Skip to main content

Showing 1–16 of 16 results for author: Fang, Q

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.07330  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    CTC-based Non-autoregressive Textless Speech-to-Speech Translation

    Authors: Qingkai Fang, Zhengrui Ma, Yan Zhou, Min Zhang, Yang Feng

    Abstract: Direct speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding due to the considerable length of speech sequences. Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly. In this paper, we investig… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: ACL 2024 Findings

    ACM Class: I.2.7

  2. arXiv:2406.07289  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?

    Authors: Qingkai Fang, Shaolei Zhang, Zhengrui Ma, Min Zhang, Yang Feng

    Abstract: Recently proposed two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) within an end-to-end model, yielding promising results. However, the training of these models still relies on parallel speech data, which is extremely challenging to collect. In contrast, S2TT and TTS have accumulated a large amount of data… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: ACL 2024 main conference. Project Page: https://ictnlp.github.io/ComSpeech-Site/

    ACM Class: I.2.7

  3. arXiv:2406.06937  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation

    Authors: Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min Zhang

    Abstract: Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization betwee… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: ACL 2024; Codes and demos are at https://github.com/ictnlp/NAST-S2x

  4. arXiv:2406.03049  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning

    Authors: Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min Zhang, Yang Feng

    Abstract: Simultaneous speech-to-speech translation (Simul-S2ST, a.k.a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication. Beyond accomplishing translation between speech, Simul-S2ST requires a policy to control the model to generate corresponding target speech at the opportune moment within speech inputs, thereby posing… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 main conference, Project Page: https://ictnlp.github.io/StreamSpeech-site/

  5. arXiv:2310.07403  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation

    Authors: Qingkai Fang, Yan Zhou, Yang Feng

    Abstract: Direct speech-to-speech translation (S2ST) translates speech from one language into another using a single model. However, due to the presence of linguistic and acoustic diversity, the target speech follows a complex multimodal distribution, posing challenges to achieving both high-quality translations and fast decoding speeds for S2ST models. In this paper, we propose DASpeech, a non-autoregressi… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Comments: NeurIPS 2023. Audio samples are available at https://ictnlp.github.io/daspeech-demo/

    ACM Class: I.2.7

  6. arXiv:2306.05281   

    eess.SP cs.AI cs.RO

    A Graph Reconstruction by Dynamic Signal Coefficient for Fault Classification

    Authors: Wenbin He, Jianxu Mao, Yaonan Wang, Zhe Li, Qiu Fang, Haotian Wu

    Abstract: To improve the performance in identifying the faults under strong noise for rotating machinery, this paper presents a dynamic feature reconstruction signal graph method, which plays the key role of the proposed end-to-end fault diagnosis model. Specifically, the original mechanical signal is first decomposed by wavelet packet decomposition (WPD) to obtain multiple subbands including coefficient ma… ▽ More

    Submitted 29 September, 2023; v1 submitted 30 May, 2023; originally announced June 2023.

    Comments: The feature extraction algorithm DFSL has errors in derivation and experimental deficiencies

  7. arXiv:2305.14635  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation

    Authors: Yan Zhou, Qingkai Fang, Yang Feng

    Abstract: End-to-end speech translation (ST) is the task of translating speech signals in the source language into text in the target language. As a cross-modal task, end-to-end ST is difficult to train with limited data. Existing methods often try to transfer knowledge from machine translation (MT), but their performances are restricted by the modality gap between speech and text. In this paper, we propose… ▽ More

    Submitted 25 May, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: ACL 2023 main conference

  8. arXiv:2305.08709  [pdf, other

    cs.CL cs.SD eess.AS

    Back Translation for Speech-to-text Translation Without Transcripts

    Authors: Qingkai Fang, Yang Feng

    Abstract: The success of end-to-end speech-to-text translation (ST) is often achieved by utilizing source transcripts, e.g., by pre-training with automatic speech recognition (ASR) and machine translation (MT) tasks, or by introducing additional ASR and MT data. Unfortunately, transcripts are only sometimes available since numerous unwritten languages exist worldwide. In this paper, we aim to utilize large… ▽ More

    Submitted 15 May, 2023; originally announced May 2023.

    Comments: ACL 2023 main conference

    ACM Class: I.2.7

  9. arXiv:2305.08706  [pdf, other

    cs.CL cs.SD eess.AS

    Understanding and Bridging the Modality Gap for Speech Translation

    Authors: Qingkai Fang, Yang Feng

    Abstract: How to achieve better end-to-end speech translation (ST) by leveraging (text) machine translation (MT) data? Among various existing techniques, multi-task learning is one of the effective ways to share knowledge between ST and MT in which additional MT data can help to learn source-to-target map**. However, due to the differences between speech and text, there is always a gap between ST and MT.… ▽ More

    Submitted 15 May, 2023; originally announced May 2023.

    Comments: ACL 2023 main conference

    ACM Class: I.2.7

  10. arXiv:2302.13273  [pdf, other

    cs.SD cs.MM eess.AS

    Two-Stream Joint-Training for Speaker Independent Acoustic-to-Articulatory Inversion

    Authors: Jianrong Wang, **yu Liu, Li Liu, Xuewei Li, Mei Yu, Jie Gao, Qiang Fang

    Abstract: Acoustic-to-articulatory inversion (AAI) aims to estimate the parameters of articulators from speech audio. There are two common challenges in AAI, which are the limited data and the unsatisfactory performance in speaker independent scenario. Most current works focus on extracting features directly from speech and ignoring the importance of phoneme information which may limit the performance of AA… ▽ More

    Submitted 26 February, 2023; originally announced February 2023.

  11. arXiv:2302.12571  [pdf

    eess.IV cs.CV physics.med-ph

    3D PETCT Tumor Lesion Segmentation via GCN Refinement

    Authors: Hengzhi Xue, Qingqing Fang, Yudong Yao, Yueyang Teng

    Abstract: Whole-body PET/CT scan is an important tool for diagnosing various malignancies (e.g., malignant melanoma, lymphoma, or lung cancer), and accurate segmentation of tumors is a key part for subsequent treatment. In recent years, CNN-based segmentation methods have been extensively investigated. However, these methods often give inaccurate segmentation results, such as over-segmentation and under-seg… ▽ More

    Submitted 24 February, 2023; originally announced February 2023.

    Comments: 10 pages,5 figures,38 reference

  12. arXiv:2209.07302  [pdf, other

    cs.SD eess.AS

    MVNet: Memory Assistance and Vocal Reinforcement Network for Speech Enhancement

    Authors: Jianrong Wang, Xiaomin Li, Xuewei Li, Mei Yu, Qiang Fang, Li Liu

    Abstract: Speech enhancement improves speech quality and promotes the performance of various downstream tasks. However, most current speech enhancement work was mainly devoted to improving the performance of downstream automatic speech recognition (ASR), only a relatively small amount of work focused on the automatic speaker verification (ASV) task. In this work, we propose a MVNet consisted of a memory ass… ▽ More

    Submitted 15 September, 2022; originally announced September 2022.

    Comments: ICONIP 2022

  13. arXiv:2204.01672  [pdf, other

    cs.SD cs.CV eess.AS

    Residual-guided Personalized Speech Synthesis based on Face Image

    Authors: Jianrong Wang, Zixuan Wang, Xiaosheng Hu, Xuewei Li, Qiang Fang, Li Liu

    Abstract: Previous works derive personalized speech features by training the model on a large dataset composed of his/her audio sounds. It was reported that face information has a strong link with the speech sound. Thus in this work, we innovatively extract personalized speech features from human faces to synthesize personalized speech using neural vocoder. A Face-based Residual Personalized Speech Synthesi… ▽ More

    Submitted 1 April, 2022; originally announced April 2022.

    Comments: ICASSP 2022

  14. arXiv:2202.13764  [pdf, ps, other

    eess.SP

    A Note on "Optimum Sets of Interference-Free Sequences With Zero Autocorrelation Zone"

    Authors: Qi** Fang, Zilong Wang

    Abstract: In this paper, a simple construction of interference-free zero correlation zone (IF-ZCZ) sequence sets is proposed by well designed finite Zak transform lattice tessellation. Each set is characterized by the period of sequences $KM$, the set size $K$ and the length of zero correlation zone $M-1$, which is optimal with respect to the Tang-Fan-Matsufuji bound. Secondly, the transformations that keep… ▽ More

    Submitted 28 February, 2022; originally announced February 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:1912.09781

  15. arXiv:2112.02991  [pdf, other

    cs.CV cs.AI eess.IV

    Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery

    Authors: Qingyun Fang, Zhaokui Wang

    Abstract: Cross-modality fusing complementary information of multispectral remote sensing image pairs can improve the perception ability of detection algorithms, making them more robust and reliable for a wider range of applications, such as nighttime detection. Compared with prior methods, we think different features should be processed specifically, the modality-specific features should be retained and en… ▽ More

    Submitted 6 December, 2021; originally announced December 2021.

    Comments: 23 pages,11 figures, under consideration at Pattern Recognition

  16. arXiv:2106.13686  [pdf, other

    cs.MM cs.SD eess.AS eess.IV

    Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition

    Authors: Jianrong Wang, Ziyue Tang, Xuewei Li, Mei Yu, Qiang Fang, Li Liu

    Abstract: Cued Speech (CS) is a visual communication system for the deaf or hearing impaired people. It combines lip movements with hand cues to obtain a complete phonetic repertoire. Current deep learning based methods on automatic CS recognition suffer from a common problem, which is the data scarcity. Until now, there are only two public single speaker datasets for French (238 sentences) and British Engl… ▽ More

    Submitted 25 June, 2021; originally announced June 2021.