Skip to main content

Showing 1–23 of 23 results for author: Dong, Q

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.15846  [pdf, other

    cs.CL eess.AS

    Revisiting Interpolation Augmentation for Speech-to-Text Generation

    Authors: Chen Xu, Jie Wang, Xiaoqian Liu, Qianqian Dong, Chunliang Zhang, Tong Xiao, **gbo Zhu, Dapeng Man, Wu Yang

    Abstract: Speech-to-text (S2T) generation systems frequently face challenges in low-resource scenarios, primarily due to the lack of extensive labeled datasets. One emerging solution is constructing virtual training samples by interpolating inputs and labels, which has notably enhanced system generalization in other domains. Despite its potential, this technique's application in S2T tasks has remained under… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

    Comments: ACL 2024 Findings

  2. arXiv:2403.18826  [pdf

    q-bio.QM eess.IV eess.SY

    SAM-dPCR: Real-Time and High-throughput Absolute Quantification of Biological Samples Using Zero-Shot Segment Anything Model

    Authors: Yuanyuan Wei, Shanhang Luo, Changran Xu, Yingqi Fu, Qingyue Dong, Yi Zhang, Fuyang Qu, Guangyao Cheng, Yi-** Ho, Ho-Pui Ho, Wu Yuan

    Abstract: Digital PCR (dPCR) has revolutionized nucleic acid diagnostics by enabling absolute quantification of rare mutations and target sequences. However, current detection methodologies face challenges, as flow cytometers are costly and complex, while fluorescence imaging methods, relying on software or manual counting, are time-consuming and prone to errors. To address these limitations, we present SAM… ▽ More

    Submitted 22 January, 2024; originally announced March 2024.

    Comments: 23 pages, 6 figures

  3. arXiv:2312.13585  [pdf, other

    cs.CL cs.SD eess.AS

    Speech Translation with Large Language Models: An Industrial Practice

    Authors: Zhichao Huang, Rong Ye, Tom Ko, Qianqian Dong, Shanbo Cheng, Mingxuan Wang, Hang Li

    Abstract: Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the large language model (LLM) with a speech encoder and employing multi-task instruction tuning, LLM-ST can produce accurate timestamped transcriptions and translations, even from long au… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: Technical report. 13 pages. Demo: https://speechtranslation.github.io/llm-st/

  4. arXiv:2309.12234  [pdf, ps, other

    cs.CL eess.AS

    Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition

    Authors: Chen Xu, Xiaoqian Liu, Erfeng He, Yuhao Zhang, Qianqian Dong, Tong Xiao, **gbo Zhu, Dapeng Man, Wu Yang

    Abstract: In this study, we present synchronous bilingual Connectionist Temporal Classification (CTC), an innovative framework that leverages dual CTC to bridge the gaps of both modality and language in the speech translation (ST) task. Utilizing transcript and translation as concurrent objectives for CTC, our model bridges the gap between audio and text as well as between source and target languages. Build… ▽ More

    Submitted 21 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  5. arXiv:2309.10153  [pdf, other

    eess.IV cs.CV cs.LG

    Preserving Tumor Volumes for Unsupervised Medical Image Registration

    Authors: Qihua Dong, Hao Du, Ying Song, Yan Xu, **g Liao

    Abstract: Medical image registration is a critical task that estimates the spatial correspondence between pairs of images. However, current traditional and deep-learning-based methods rely on similarity measures to generate a deforming field, which often results in disproportionate volume changes in dissimilar regions, especially in tumor regions. These changes can significantly alter the tumor size and und… ▽ More

    Submitted 9 May, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: ICCV 2023 Poster

  6. arXiv:2306.11646  [pdf, other

    cs.CL eess.AS

    Recent Advances in Direct Speech-to-text Translation

    Authors: Chen Xu, Rong Ye, Qianqian Dong, Chengqi Zhao, Tom Ko, Mingxuan Wang, Tong Xiao, **gbo Zhu

    Abstract: Recently, speech-to-text translation has attracted more and more attention and many studies have emerged rapidly. In this paper, we present a comprehensive survey on direct speech translation aiming to summarize the current state-of-the-art techniques. First, we categorize the existing research work into three directions based on the main challenges -- modeling burden, data scarcity, and applicati… ▽ More

    Submitted 20 June, 2023; originally announced June 2023.

    Comments: An expanded version of the paper accepted by IJCAI2023 survey track

  7. arXiv:2306.10493  [pdf, other

    cs.SD cs.CL eess.AS

    MOSPC: MOS Prediction Based on Pairwise Comparison

    Authors: Kexin Wang, Yunlong Zhao, Qianqian Dong, Tom Ko, Mingxuan Wang

    Abstract: As a subjective metric to evaluate the quality of synthesized speech, Mean opinion score~(MOS) usually requires multiple annotators to score the same speech. Such an annotation approach requires a lot of manpower and is also time-consuming. MOS prediction model for automatic evaluation can significantly reduce labor cost. In previous works, it is difficult to accurately rank the quality of speech… ▽ More

    Submitted 18 June, 2023; originally announced June 2023.

  8. arXiv:2306.02982  [pdf, other

    cs.CL eess.AS

    PolyVoice: Language Models for Speech to Speech Translation

    Authors: Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yu** Wang, Mingxuan Wang, Yuxuan Wang

    Abstract: We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt… ▽ More

    Submitted 13 June, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

  9. arXiv:2302.02125  [pdf, other

    eess.IV cs.CV cs.LG

    Weakly-Supervised 3D Medical Image Segmentation using Geometric Prior and Contrastive Similarity

    Authors: Hao Du, Qihua Dong, Yan Xu, **g Liao

    Abstract: Medical image segmentation is almost the most important pre-processing procedure in computer-aided diagnosis but is also a very challenging task due to the complex shapes of segments and various artifacts caused by medical imaging, (i.e., low-contrast tissues, and non-homogenous textures). In this paper, we propose a simple yet effective segmentation framework that incorporates the geometric prior… ▽ More

    Submitted 4 February, 2023; originally announced February 2023.

    Comments: Weakly-supervised Segmentation, Medical Image Segmentation, Contrastive Similarity, Geometric Prior, Point Cloud

    Journal ref: IEEE Trans. Med. Imaging, Early Access, pp. 1-1, April 24, 2023

  10. arXiv:2212.03657  [pdf, other

    cs.CL cs.SD eess.AS

    M3ST: Mix at Three Levels for Speech Translation

    Authors: Xuxin Cheng, Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Yuexian Zou

    Abstract: How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine… ▽ More

    Submitted 7 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023

  11. arXiv:2205.08993  [pdf, other

    cs.CL eess.AS

    Leveraging Pseudo-labeled Data to Improve Direct Speech-to-Speech Translation

    Authors: Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Qibing Bai, Yu Zhang

    Abstract: Direct Speech-to-speech translation (S2ST) has drawn more and more attention recently. The task is very challenging due to data scarcity and complex speech-to-speech map**. In this paper, we report our recent achievements in S2ST. Firstly, we build a S2ST Transformer baseline which outperforms the original Translatotron. Secondly, we utilize the external data by pseudo-labeling and obtain a new… ▽ More

    Submitted 18 May, 2022; originally announced May 2022.

    Comments: Submitted to INTERSPEECH 2022

  12. arXiv:2201.03313  [pdf, other

    eess.AS cs.AI cs.SD

    Cross-Modal ASR Post-Processing System for Error Correction and Utterance Rejection

    Authors: **g Du, Shiliang Pu, Qinbo Dong, Chao **, Xin Qi, Dian Gu, Ru Wu, Hongwei Zhou

    Abstract: Although modern automatic speech recognition (ASR) systems can achieve high performance, they may produce errors that weaken readers' experience and do harm to downstream tasks. To improve the accuracy and reliability of ASR hypotheses, we propose a cross-modal post-processing system for speech recognizers, which 1) fuses acoustic features and textual features from different modalities, 2) joints… ▽ More

    Submitted 10 January, 2022; originally announced January 2022.

    Comments: submit to ICASSP2022, 5 pages, 3 figures

  13. arXiv:2109.07368  [pdf, other

    cs.CL eess.AS

    Learning When to Translate for Streaming Speech

    Authors: Qianqian Dong, Yaoming Zhu, Mingxuan Wang, Lei Li

    Abstract: How to find proper moments to generate partial sentence translation given a streaming speech input? Existing approaches waiting-and-translating for a fixed duration often break the acoustic units in speech, since the boundaries between acoustic units in speech are not even. In this paper, we propose MoSST, a simple yet effective method for translating streaming speech content. Given a usually long… ▽ More

    Submitted 22 March, 2022; v1 submitted 15 September, 2021; originally announced September 2021.

    Comments: Accept to ACL 2022 main conference. 15 pages, 6 figures

  14. arXiv:2107.06151  [pdf, other

    eess.SY

    Adaptive dynamic programming-based adaptive-gain sliding mode tracking control for fixed-wing UAV with disturbances

    Authors: Chaofan Zhang, Guoshan Zhang, Qi Dong

    Abstract: This paper proposes an adaptive dynamic programming-based adaptive-gain sliding mode control (ADP-ASMC) scheme for a fixed-wing unmanned aerial vehicle (UAV) with matched and unmatched disturbances. Starting from the dynamic of fixed-wing UAV, the control-oriented model composed of attitude subsystem and airspeed subsystem is established. According to the different issues in two subsystems, two no… ▽ More

    Submitted 13 July, 2021; originally announced July 2021.

  15. arXiv:2105.07319  [pdf, other

    cs.CL cs.SD eess.AS

    The Volctrans Neural Speech Translation System for IWSLT 2021

    Authors: Chengqi Zhao, Zhicheng Liu, Jian Tong, Tao Wang, Mingxuan Wang, Rong Ye, Qianqian Dong, Jun Cao, Lei Li

    Abstract: This paper describes the systems submitted to IWSLT 2021 by the Volctrans team. We participate in the offline speech translation and text-to-text simultaneous translation tracks. For offline speech translation, our best end-to-end model achieves 8.1 BLEU improvements over the benchmark on the MuST-C test set and is even approaching the results of a strong cascade solution. For text-to-text simulta… ▽ More

    Submitted 30 June, 2021; v1 submitted 15 May, 2021; originally announced May 2021.

    Comments: IWSLT 2021

  16. arXiv:2102.10503  [pdf, ps, other

    eess.IV cs.CV

    Predicting Future Cognitive Decline with Hyperbolic Stochastic Coding

    Authors: J. Zhang, Q. Dong, J. Shi, Q. Li, C. M. Stonnington, B. A. Gutman, K. Chen, E. M. Reiman, R. J. Caselli, P. M. Thompson, J. Ye, Y. Wang

    Abstract: Hyperbolic geometry has been successfully applied in modeling brain cortical and subcortical surfaces with general topological structures. However such approaches, similar to other surface based brain morphology analysis methods, usually generate high dimensional features. It limits their statistical power in cognitive decline prediction research, especially in datasets with limited subject number… ▽ More

    Submitted 20 February, 2021; originally announced February 2021.

  17. arXiv:2012.10018  [pdf, ps, other

    cs.CL cs.SD eess.AS

    NeurST: Neural Speech Translation Toolkit

    Authors: Chengqi Zhao, Mingxuan Wang, Qianqian Dong, Rong Ye, Lei Li

    Abstract: NeurST is an open-source toolkit for neural speech translation. The toolkit mainly focuses on end-to-end speech translation, which is easy to use, modify, and extend to advanced speech translation research and products. NeurST aims at facilitating the speech translation research for NLP researchers and building reliable benchmarks for this field. It provides step-by-step recipes for feature extrac… ▽ More

    Submitted 15 June, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

    Comments: Accepted by ACL 2021 (system demonstration)

  18. arXiv:2009.09737  [pdf, other

    cs.CL eess.AS

    Consecutive Decoding for Speech-to-text Translation

    Authors: Qianqian Dong, Mingxuan Wang, Hao Zhou, Shuang Xu, Bo Xu, Lei Li

    Abstract: Speech-to-text translation (ST), which directly translates the source language speech to the target language text, has attracted intensive attention recently. However, the combination of speech recognition and machine translation in a single model poses a heavy burden on the direct cross-modal cross-lingual map**. To reduce the learning difficulty, we propose COnSecutive Transcription and Transl… ▽ More

    Submitted 14 April, 2022; v1 submitted 21 September, 2020; originally announced September 2020.

    Comments: Accepted by AAAI 2021, 11 pages, 3 figures, 13 tables

  19. arXiv:2009.09704  [pdf, other

    cs.CL eess.AS

    "Listen, Understand and Translate": Triple Supervision Decouples End-to-end Speech-to-text Translation

    Authors: Qianqian Dong, Rong Ye, Mingxuan Wang, Hao Zhou, Shuang Xu, Bo Xu, Lei Li

    Abstract: An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Existing methods are limited by the amount of parallel corpus. Can we build a system to fully utilize signals in a parallel ST corpus? We are inspired by human understanding system which is composed of auditory perception and cognitive processing. In this paper, we propose List… ▽ More

    Submitted 5 April, 2021; v1 submitted 21 September, 2020; originally announced September 2020.

    Comments: Accepted by AAAI 2021

  20. Automatic Ischemic Stroke Lesion Segmentation from Computed Tomography Perfusion Images by Image Synthesis and Attention-Based Deep Neural Networks

    Authors: Guotai Wang, Tao Song, Qiang Dong, Mei Cui, Ning Huang, Shaoting Zhang

    Abstract: Ischemic stroke lesion segmentation from Computed Tomography Perfusion (CTP) images is important for accurate diagnosis of stroke in acute care units. However, it is challenged by low image contrast and resolution of the perfusion parameter maps, in addition to the complex appearance of the lesion. To deal with this problem, we propose a novel framework based on synthesized pseudo Diffusion-Weight… ▽ More

    Submitted 7 July, 2020; originally announced July 2020.

    Comments: 14 pages, 10 figures

  21. arXiv:1911.09433  [pdf, ps, other

    cs.IT eess.SP

    Time Varying Channel Tracking for Multi-UAV Wideband Communications with Beam Squint

    Authors: Jianwei Zhao, Qi Dong, Yanjie Zhao, Bolei Wang, Feifei Gao

    Abstract: Unmanned aerial vehicle (UAV) has become an appealing solution for a wide range of commercial and civilian applications because of its high mobility and flexible deployment. Due to the continuous UAV navigation, the channel between UAV and base station (BS) is subject to the Doppler effect. Meanwhile, when the BS is equipped with massive number of antennas, the non-negligible propagation delay acr… ▽ More

    Submitted 21 November, 2019; originally announced November 2019.

  22. arXiv:1910.09768  [pdf, other

    cs.LG cs.CV cs.NE eess.IV

    Face representation by deep learning: a linear encoding in a parameter space?

    Authors: Qiulei Dong, Jiayin Sun, Zhanyi Hu

    Abstract: Recently, Convolutional Neural Networks (CNNs) have achieved tremendous performances on face recognition, and one popular perspective regarding CNNs' success is that CNNs could learn discriminative face representations from face images with complex image feature encoding. However, it is still unclear what is the intrinsic mechanism of face representation in CNNs. In this work, we investigate this… ▽ More

    Submitted 22 October, 2019; originally announced October 2019.

  23. arXiv:1602.02045  [pdf

    math.OC eess.SY

    Fuzzy Logic Control of a Hybrid Energy Storage Module for Naval Pulsed Power Applications

    Authors: Isaac J. Cohen, David A. Wetz, Stepfanie Veiga, Qing Dong, John Heinzel

    Abstract: There is need for an energy storage device capable of transferring high power in transient situations aboard naval vessels. Currently, batteries are used to accomplish this task, but previous research has shown that when utilized at high power rates, these devices deteriorate over time causing a loss in lifespan. It has been shown that a hybrid energy storage configuration is capable of meeting su… ▽ More

    Submitted 5 February, 2016; originally announced February 2016.

    Journal ref: International Journal of Fuzzy Logic Systems, vol. 6, no. 1, 2016