Skip to main content

Showing 1–50 of 60 results for author: Guo, P

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.19366  [pdf, other

    eess.SP cs.AI

    ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with LLM-Enhanced Cardiological Text

    Authors: Han Yu, Peikun Guo, Akane Sano

    Abstract: The utilization of deep learning on electrocardiogram (ECG) analysis has brought the advanced accuracy and efficiency of cardiac healthcare diagnostics. By leveraging the capabilities of deep learning in semantic understanding, especially in feature extraction and representation learning, this study introduces a new multimodal contrastive pretaining framework that aims to improve the quality and r… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

  2. arXiv:2405.10786  [pdf, other

    eess.AS

    Distinctive and Natural Speaker Anonymization via Singular Value Transformation-assisted Matrix

    Authors: Jixun Yao, Qing Wang, Pengcheng Guo, Ziqian Ning, Lei Xie

    Abstract: Speaker anonymization is an effective privacy protection solution that aims to conceal the speaker's identity while preserving the naturalness and distinctiveness of the original speech. Mainstream approaches use an utterance-level vector from a pre-trained automatic speaker verification (ASV) model to represent speaker identity, which is then averaged or modified for anonymization. However, these… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing

  3. arXiv:2405.02132  [pdf, other

    cs.SD cs.CL eess.AS

    Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

    Authors: Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie

    Abstract: Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configu… ▽ More

    Submitted 6 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

  4. arXiv:2405.00259   

    physics.med-ph eess.IV

    Optimization of Dark-Field CT for Lung Imaging

    Authors: Peiyuan Guo, Simon Spindler, Li Zhang, Zhentian Wang

    Abstract: Background: X-ray grating-based dark-field imaging can sense the small angle scattering caused by an object's micro-structure. This technique is sensitive to lung's porous alveoli and is able to detect lung disease at an early stage. Up to now, a human-scale dark-field CT has been built for lung imaging. Purpose: This study aimed to develop a more thorough optimization method for dark-field lung C… ▽ More

    Submitted 1 May, 2024; v1 submitted 30 April, 2024; originally announced May 2024.

    Comments: There is a mistake in subsection 2.3, where the content is not correct because of the incorrect parameter we set, which leads to the following calculations in the following sections potentially incorrect

  5. arXiv:2401.06788  [pdf, other

    eess.AS cs.AI cs.SD

    The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

    Authors: He Wang, Pengcheng Guo, Wei Chen, Pan Zhou, Lei Xie

    Abstract: This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of Single-Speaker VSR Task, and the open track of Multi-Speaker VSR Task. In terms of data processing, we leverage the lip motion extractor from the baseline1 to produce… ▽ More

    Submitted 29 February, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

    Comments: Included in CNVSRC Workshop 2023, NCMMSC 2023

  6. arXiv:2401.04148  [pdf, other

    cs.LG cs.AI eess.SP

    Online Test-Time Adaptation of Spatial-Temporal Traffic Flow Forecasting

    Authors: Pengxin Guo, Pengrong **, Ziyue Li, Lei Bai, Yu Zhang

    Abstract: Accurate spatial-temporal traffic flow forecasting is crucial in aiding traffic managers in implementing control measures and assisting drivers in selecting optimal travel routes. Traditional deep-learning based methods for traffic flow forecasting typically rely on historical data to train their models, which are then used to make predictions on future data. However, the performance of the traine… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

  7. arXiv:2401.03697  [pdf, other

    cs.SD eess.AS

    An audio-quality-based multi-strategy approach for target speaker extraction in the MISP 2023 Challenge

    Authors: Runduo Han, Xiaopeng Yan, Weiming Xu, Pengcheng Guo, Jiayao Sun, He Wang, Quan Lu, Ning Jiang, Lei Xie

    Abstract: This paper describes our audio-quality-based multi-strategy approach for the audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information based Speech Processing (MISP) 2023 Challenge. Specifically, our approach adopts different extraction strategies based on the audio quality, striking a balance between interference removal and speech preservation, which benifits the back-en… ▽ More

    Submitted 6 March, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  8. arXiv:2401.03473  [pdf, ps, other

    cs.SD cs.AI eess.AS

    ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge

    Authors: He Wang, Pengcheng Guo, Yue Li, Ao Zhang, Jiayao Sun, Lei Xie, Wei Chen, Pan Zhou, Hui Bu, Xin Xu, Binbin Zhang, Zhuo Chen, Jian Wu, Longbiao Wang, Eng Siong Chng, Sun Li

    Abstract: To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. This challenge collects over 100 hours of multi-channel speech data recorded inside a new energy vehicle and 40 hours… ▽ More

    Submitted 20 February, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

    Comments: Accepted at ICASSP 2024

  9. MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

    Authors: He Wang, Pengcheng Guo, Pan Zhou, Lei Xie

    Abstract: While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness. However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the… ▽ More

    Submitted 8 April, 2024; v1 submitted 7 January, 2024; originally announced January 2024.

    Comments: 5 pages, 3 figures Accepted at ICASSP 2024

  10. arXiv:2312.09746  [pdf, other

    cs.SD eess.AS

    Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies

    Authors: Bingshen Mu, Pengcheng Guo, Dake Guo, Pan Zhou, Wei Chen, Lei Xie

    Abstract: Automatic Speech Recognition (ASR) has shown remarkable progress, yet it still faces challenges in real-world distant scenarios across various array topologies each with multiple recording devices. The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition perf… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024

  11. Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition

    Authors: Qijie Shao, Pengcheng Guo, **ghao Yan, Pengfei Hu, Lei Xie

    Abstract: Accents, as variations from standard pronunciation, pose significant challenges for speech recognition systems. Although joint automatic speech recognition (ASR) and accent recognition (AR) training has been proven effective in handling multi-accent scenarios, current multi-task ASR-AR approaches overlook the granularity differences between tasks. Fine-grained units capture pronunciation-related a… ▽ More

    Submitted 17 November, 2023; v1 submitted 12 November, 2023; originally announced November 2023.

    Comments: Accepted by IEEE Transactions on Audio, Speech and Language Processing (TASLP)

  12. arXiv:2310.04863  [pdf, other

    cs.SD eess.AS

    SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

    Authors: Yangze Li, Fan Yu, Yuhao Liang, Pengcheng Guo, Mohan Shi, Zhihao Du, Shiliang Zhang, Lei Xie

    Abstract: Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we intro… ▽ More

    Submitted 7 October, 2023; originally announced October 2023.

  13. arXiv:2309.15800  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

    Authors: Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, **chuan Tian, Shinji Watanabe, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang

    Abstract: Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning repre… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Submitted to IEEE ICASSP 2024

  14. arXiv:2309.00929  [pdf, other

    cs.SD eess.AS

    Timbre-reserved Adversarial Attack in Speaker Identification

    Authors: Qing Wang, Jixun Yao, Li Zhang, Pengcheng Guo, Lei Xie

    Abstract: As a type of biometric identification, a speaker identification (SID) system is confronted with various kinds of attacks. The spoofing attacks typically imitate the timbre of the target speakers, while the adversarial attacks confuse the SID system by adding a well-designed adversarial perturbation to an arbitrary speech. Although the spoofing attack copies a similar timbre as the victim, it does… ▽ More

    Submitted 2 September, 2023; originally announced September 2023.

    Comments: 11 pages, 8 figures

  15. arXiv:2306.00804  [pdf, other

    cs.SD cs.CL eess.AS

    Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

    Authors: Tianyi Xu, Zhanheng Yang, Kaixun Huang, Pengcheng Guo, Ao Zhang, Biao Li, Changru Chen, Chao Li, Lei Xie

    Abstract: By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual bias… ▽ More

    Submitted 15 August, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

  16. arXiv:2305.19020  [pdf, other

    cs.SD eess.AS

    Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification

    Authors: Qing Wang, Jixun Yao, Ziqian Wang, Pengcheng Guo, Lei Xie

    Abstract: In this study, we propose a timbre-reserved adversarial attack approach for speaker identification (SID) to not only exploit the weakness of the SID model but also preserve the timbre of the target speaker in a black-box attack setting. Particularly, we generate timbre-reserved fake audio by adding an adversarial constraint during the training of the voice conversion model. Then, we leverage a pse… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: 5 pages

  17. arXiv:2305.13716  [pdf, other

    cs.SD cs.CL eess.AS

    BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

    Authors: Yuhao Liang, Fan Yu, Yangze Li, Pengcheng Guo, Shiliang Zhang, Qian Chen, Lei Xie

    Abstract: The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundary-aware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the d… ▽ More

    Submitted 5 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted by INTERSPEECH 2023

  18. TranUSR: Phoneme-to-word Transcoder Based Unified Speech Representation Learning for Cross-lingual Speech Recognition

    Authors: Hongfei Xue, Qijie Shao, Peikun Chen, Pengcheng Guo, Lei Xie, Jie Liu

    Abstract: UniSpeech has achieved superior performance in cross-lingual automatic speech recognition (ASR) by explicitly aligning latent representations to phoneme units using multi-task self-supervised learning. While the learned representations transfer well from high-resource to low-resource languages, predicting words directly from these phonetic representations in downstream ASR is challenging. In this… ▽ More

    Submitted 8 October, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: 5 pages, 3 figures. Accepted by INTERSPEECH 2023

  19. arXiv:2305.12493  [pdf, other

    eess.AS cs.CL cs.SD

    Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network

    Authors: Kaixun Huang, Ao Zhang, Zhanheng Yang, Pengcheng Guo, Bingshen Mu, Tianyi Xu, Lei Xie

    Abstract: Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context… ▽ More

    Submitted 12 July, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Accepted by interspeech2023

  20. arXiv:2303.06341  [pdf, other

    eess.AS

    The NPU-ASLP System for Audio-Visual Speech Recognition in MISP 2022 Challenge

    Authors: Pengcheng Guo, He Wang, Bingshen Mu, Ao Zhang, Peikun Chen

    Abstract: This paper describes our NPU-ASLP system for the Audio-Visual Diarization and Recognition (AVDR) task in the Multi-modal Information based Speech Processing (MISP) 2022 Challenge. Specifically, the weighted prediction error (WPE) and guided source separation (GSS) techniques are used to reduce reverberation and generate clean signals for each single speaker first. Then, we explore the effectivenes… ▽ More

    Submitted 11 March, 2023; originally announced March 2023.

    Comments: 2 pages, accepted by ICASSP 2023

  21. arXiv:2302.13523  [pdf, other

    cs.SD eess.AS

    VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting

    Authors: Ao Zhang, He Wang, Pengcheng Guo, Yihui Fu, Lei Xie, Yingying Gao, Shilei Zhang, Junlan Feng

    Abstract: The performance of the keyword spotting (KWS) system based on audio modality, commonly measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. Therefore, audio-visual keyword spotting, which leverages complementary relationships over multiple modalities, has recently gained much attention. However, current studies mainly focus on combining the e… ▽ More

    Submitted 14 March, 2023; v1 submitted 27 February, 2023; originally announced February 2023.

    Comments: 5 pages. Accepted at ICASSP2023

  22. arXiv:2211.13443  [pdf, other

    cs.SD eess.AS

    TESSP: Text-Enhanced Self-Supervised Speech Pre-training

    Authors: Zhuoyuan Yao, Shuo Ren, Sanyuan Chen, Ziyang Ma, Pengcheng Guo, Lei Xie

    Abstract: Self-supervised speech pre-training empowers the model with the contextual structure inherent in the speech signal while self-supervised text pre-training empowers the model with linguistic information. Both of them are beneficial for downstream speech tasks such as ASR. However, the distinct pre-training objectives make it challenging to jointly optimize the speech and text representation in the… ▽ More

    Submitted 24 November, 2022; originally announced November 2022.

    Comments: 9 pages, 4 figures

  23. arXiv:2211.03038  [pdf, other

    eess.AS cs.CR cs.SD

    Distinguishable Speaker Anonymization based on Formant and Fundamental Frequency Scaling

    Authors: Jixun Yao, Qing Wang, Yi Lei, Pengcheng Guo, Lei Xie, Namin Wang, Jie Liu

    Abstract: Speech data on the Internet are proliferating exponentially because of the emergence of social media, and the sharing of such personal data raises obvious security and privacy concerns. One solution to mitigate these concerns involves concealing speaker identities before sharing speech data, also referred to as speaker anonymization. In our previous work, we have developed an automatic speaker ver… ▽ More

    Submitted 6 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  24. arXiv:2211.03036  [pdf, other

    eess.AS cs.SD

    Preserving background sound in noise-robust voice conversion via multi-task learning

    Authors: Jixun Yao, Yi Lei, Qing Wang, Pengcheng Guo, Ziqian Ning, Lei Xie, Hai Li, Junhui Liu, Danming Xie

    Abstract: Background sound is an informative form of art that is helpful in providing a more immersive experience in real-application voice conversion (VC) scenarios. However, prior research about VC, mainly focusing on clean voices, pay rare attention to VC with background sound. The critical problem for preserving background sound in VC is inevitable speech distortion by the neural separation model and th… ▽ More

    Submitted 6 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  25. arXiv:2210.05265  [pdf, other

    cs.SD eess.AS

    MFCCA:Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario

    Authors: Fan Yu, Shiliang Zhang, Pengcheng Guo, Yuhao Liang, Zhihao Du, Yuxiao Lin, Lei Xie

    Abstract: Recently cross-channel attention, which better leverages multi-channel signals from microphone array, has shown promising results in the multi-party meeting scenario. Cross-channel attention focuses on either learning global correlations between sequences of different channels or exploiting fine-grained channel-wise information effectively at each time step. Considering the delay of microphone arr… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

    Comments: Accepted by SLT 2022

  26. arXiv:2209.11969  [pdf, other

    eess.AS cs.SD

    NWPU-ASLP System for the VoicePrivacy 2022 Challenge

    Authors: Jixun Yao, Qing Wang, Li Zhang, Pengcheng Guo, Yuhao Liang, Lei Xie

    Abstract: This paper presents the NWPU-ASLP speaker anonymization system for VoicePrivacy 2022 Challenge. Our submission does not involve additional Automatic Speaker Verification (ASV) model or x-vector pool. Our system consists of four modules, including feature extractor, acoustic model, anonymization module, and neural vocoder. First, the feature extractor extracts the Phonetic Posteriorgram (PPG) and p… ▽ More

    Submitted 24 September, 2022; originally announced September 2022.

    Comments: VoicePrivacy 2022 Challenge

  27. arXiv:2207.00883  [pdf, other

    cs.SD cs.CL eess.AS

    Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

    Authors: Kun Wei, Pengcheng Guo, Ning Jiang

    Abstract: Transformer-based models have demonstrated their effectiveness in automatic speech recognition (ASR) tasks and even shown superior performance over the conventional hybrid framework. The main idea of Transformers is to capture the long-range global context within an utterance by self-attention layers. However, for scenarios like conversational speech, such utterance-level modeling will neglect con… ▽ More

    Submitted 2 July, 2022; originally announced July 2022.

    Comments: Accepted by Interspeech2022

  28. arXiv:2206.06065  [pdf

    eess.IV cs.CV

    Deep ensemble learning for segmenting tuberculosis-consistent manifestations in chest radiographs

    Authors: Sivaramakrishnan Rajaraman, Feng Yang, Ghada Zamzmi, Peng Guo, Zhiyun Xue, Sameer K Antani

    Abstract: Automated segmentation of tuberculosis (TB)-consistent lesions in chest X-rays (CXRs) using deep learning (DL) methods can help reduce radiologist effort, supplement clinical decision-making, and potentially result in improved patient treatment. The majority of works in the literature discuss training automatic segmentation models using coarse bounding box annotations. However, the granularity of… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

    Comments: 13 pages, 6 figures

    MSC Class: 68T07

  29. arXiv:2204.11669  [pdf

    eess.IV cs.AI physics.med-ph

    Deep-learning-enabled Brain Hemodynamic Map** Using Resting-state fMRI

    Authors: Xirui Hou, Pengfei Guo, Puyang Wang, Peiying Liu, Doris D. M. Lin, Hongli Fan, Yang Li, Zhiliang Wei, Zixuan Lin, Dengrong Jiang, ** **, Catherine Kelly, Jay J. Pillai, Judy Huang, Marco C. Pinho, Binu P. Thomas, Babu G. Welch, Denise C. Park, Vishal M. Patel, Argye E. Hillis, Hanzhang Lu

    Abstract: Cerebrovascular disease is a leading cause of death globally. Prevention and early intervention are known to be the most effective forms of its management. Non-invasive imaging methods hold great promises for early stratification, but at present lack the sensitivity for personalized prognosis. Resting-state functional magnetic resonance imaging (rs-fMRI), a powerful tool previously used for mappin… ▽ More

    Submitted 25 April, 2022; originally announced April 2022.

    Journal ref: npj Digital Medicine (2023) 116

  30. arXiv:2204.03398  [pdf, other

    cs.SD eess.AS

    Linguistic-Acoustic Similarity Based Accent Shift for Accent Recognition

    Authors: Qijie Shao, **ghao Yan, Jian Kang, Pengcheng Guo, Xian Shi, Pengfei Hu, Lei Xie

    Abstract: General accent recognition (AR) models tend to directly extract low-level information from spectrums, which always significantly overfit on speakers or channels. Considering accent can be regarded as a series of shifts relative to native pronunciation, distinguishing accents will be an easier task with accent shift as input. But due to the lack of native utterance as an anchor, estimating the acce… ▽ More

    Submitted 1 July, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: Accepted by Interspeech 2022

  31. arXiv:2203.06338  [pdf, other

    eess.IV cs.CV

    Auto-FedRL: Federated Hyperparameter Optimization for Multi-institutional Medical Image Segmentation

    Authors: Pengfei Guo, Dong Yang, Ali Hatamizadeh, An Xu, Ziyue Xu, Wenqi Li, Can Zhao, Daguang Xu, Stephanie Harmon, Evrim Turkbey, Baris Turkbey, Bradford Wood, Francesca Patella, Elvira Stellato, Gianpaolo Carrafiello, Vishal M. Patel, Holger R. Roth

    Abstract: Federated learning (FL) is a distributed machine learning technique that enables collaborative model training while avoiding explicit data sharing. The inherent privacy-preserving property of FL algorithms makes them especially attractive to the medical field. However, in case of heterogeneous client data distributions, standard FL methods are unstable and require intensive hyperparameter tuning t… ▽ More

    Submitted 31 August, 2022; v1 submitted 11 March, 2022; originally announced March 2022.

  32. arXiv:2203.05574  [pdf, other

    eess.IV cs.CV

    On-the-Fly Test-time Adaptation for Medical Image Segmentation

    Authors: Jeya Maria Jose Valanarasu, Pengfei Guo, Vibashan VS, Vishal M. Patel

    Abstract: One major problem in deep learning-based solutions for medical imaging is the drop in performance when a model is tested on a data distribution different from the one that it is trained on. Adapting the source model to target data distribution at test-time is an efficient solution for the data-shift problem. Previous methods solve this by adapting the model to target distribution by using techniqu… ▽ More

    Submitted 10 March, 2022; originally announced March 2022.

    Comments: Tech Report

  33. arXiv:2203.04292  [pdf, other

    eess.IV cs.CV cs.LG

    Towards performant and reliable undersampled MR reconstruction via diffusion model sampling

    Authors: Cheng Peng, Pengfei Guo, S. Kevin Zhou, Vishal Patel, Rama Chellappa

    Abstract: Magnetic Resonance (MR) image reconstruction from under-sampled acquisition promises faster scanning time. To this end, current State-of-The-Art (SoTA) approaches leverage deep neural networks and supervised training to learn a recovery model. While these approaches achieve impressive performances, the learned model can be fragile on unseen degradation, e.g. when given a different acceleration fac… ▽ More

    Submitted 10 March, 2022; v1 submitted 7 March, 2022; originally announced March 2022.

  34. arXiv:2202.03647  [pdf, other

    cs.SD eess.AS

    Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

    Authors: Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, Siqi Zheng, Weilong Huang, Lei Xie, Zheng-Hua Tan, DeLiang Wang, Yanmin Qian, Kong Aik Lee, Zhijie Yan, Bin Ma, Xin Xu, Hui Bu

    Abstract: The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Ma… ▽ More

    Submitted 25 February, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

    Comments: Accepted by ICASSP 2022

  35. arXiv:2201.09376  [pdf, other

    eess.IV cs.CV

    ReconFormer: Accelerated MRI Reconstruction Using Recurrent Transformer

    Authors: Pengfei Guo, Yiqun Mei, **yuan Zhou, Shanshan Jiang, Vishal M. Patel

    Abstract: Accelerating magnetic resonance image (MRI) reconstruction process is a challenging ill-posed inverse problem due to the excessive under-sampling operation in k-space. In this paper, we propose a recurrent transformer model, namely ReconFormer, for MRI reconstruction which can iteratively reconstruct high fertility magnetic resonance images from highly under-sampled k-space data. In particular, th… ▽ More

    Submitted 27 January, 2022; v1 submitted 23 January, 2022; originally announced January 2022.

  36. arXiv:2111.06707  [pdf, other

    eess.IV cs.CV

    Transformer-based Image Compression

    Authors: Ming Lu, Peiyao Guo, Huiqing Shi, Chuntong Cao, Zhan Ma

    Abstract: A Transformer-based Image Compression (TIC) approach is developed which reuses the canonical variational autoencoder (VAE) architecture with paired main and hyper encoder-decoders. Both main and hyper encoders are comprised of a sequence of neural transformation units (NTUs) to analyse and aggregate important information for more compact representation of input image, while the decoders mirror the… ▽ More

    Submitted 12 November, 2021; originally announced November 2021.

  37. arXiv:2110.13720  [pdf

    eess.IV cond-mat.mtrl-sci cs.CV

    Deep DIC: Deep Learning-Based Digital Image Correlation for End-to-End Displacement and Strain Measurement

    Authors: Ru Yang, Yang Li, Danielle Zeng, ** Guo

    Abstract: Digital image correlation (DIC) has become an industry standard to retrieve accurate displacement and strain measurement in tensile testing and other material characterization. Though traditional DIC offers a high precision estimation of deformation for general tensile testing cases, the prediction becomes unstable at large deformation or when the speckle patterns start to tear. In addition, tradi… ▽ More

    Submitted 6 January, 2022; v1 submitted 26 October, 2021; originally announced October 2021.

    Comments: 39 pages, 19 figures

    Journal ref: Journal of Materials Processing Technology (2021): 117474

  38. arXiv:2110.07393  [pdf, other

    cs.SD eess.AS

    M2MeT: The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge

    Authors: Fan Yu, Shiliang Zhang, Yihui Fu, Lei Xie, Siqi Zheng, Zhihao Du, Weilong Huang, Pengcheng Guo, Zhijie Yan, Bin Ma, Xin Xu, Hui Bu

    Abstract: Recent development of speech processing, such as speech recognition, speaker diarization, etc., has inspired numerous applications of speech technologies. The meeting scenario is one of the most valuable and, at the same time, most challenging scenarios for the deployment of speech technologies. Specifically, two typical tasks, speaker diarization and multi-speaker automatic speech recognition hav… ▽ More

    Submitted 25 February, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

    Comments: Accepted by ICASSP 2022

  39. arXiv:2110.04590  [pdf, other

    cs.CL cs.SD eess.AS

    An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition

    Authors: Xuankai Chang, Takashi Maekaku, Pengcheng Guo, **g Shi, Yen-Ju Lu, Aswin Shanmugam Subramanian, Tianzi Wang, Shu-wen Yang, Yu Tsao, Hung-yi Lee, Shinji Watanabe

    Abstract: Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance. Recently, there are several works focusing on evaluating the quality of self-supervised pretrained representations on various tasks without domain restriction, e.g. SUPERB. However, such evaluations… ▽ More

    Submitted 9 October, 2021; originally announced October 2021.

    Comments: To appear in ASRU2021

  40. arXiv:2107.00636  [pdf, other

    eess.AS cs.CL cs.SD

    ESPnet-ST IWSLT 2021 Offline Speech Translation System

    Authors: Hirofumi Inaguma, Brian Yan, Siddharth Dalmia, Pengcheng Guo, Jiatong Shi, Kevin Duh, Shinji Watanabe

    Abstract: This paper describes the ESPnet-ST group's IWSLT 2021 submission in the offline speech translation track. This year we made various efforts on training data, architecture, and audio segmentation. On the data side, we investigated sequence-level knowledge distillation (SeqKD) for end-to-end (E2E) speech translation. Specifically, we used multi-referenced SeqKD from multiple teachers trained on diff… ▽ More

    Submitted 6 July, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

    Comments: IWSLT 2021

  41. arXiv:2106.08886  [pdf, other

    eess.IV cs.CV

    Over-and-Under Complete Convolutional RNN for MRI Reconstruction

    Authors: Pengfei Guo, Jeya Maria Jose Valanarasu, Puyang Wang, **yuan Zhou, Shanshan Jiang, Vishal M. Patel

    Abstract: Reconstructing magnetic resonance (MR) images from undersampled data is a challenging problem due to various artifacts introduced by the under-sampling operation. Recent deep learning-based methods for MR image reconstruction usually leverage a generic auto-encoder architecture which captures low-level features at the initial layers and high-level features at the deeper layers. Such networks focus… ▽ More

    Submitted 24 June, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted to MICCAI 2021

  42. arXiv:2106.08595  [pdf, other

    eess.AS cs.SD

    Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

    Authors: Pengcheng Guo, Xuankai Chang, Shinji Watanabe, Lei Xie

    Abstract: Non-autoregressive (NAR) models have achieved a large inference computation reduction and comparable results with autoregressive (AR) models on various sequence to sequence tasks. However, there has been limited research aiming to explore the NAR approaches on sequence to multi-sequence problems, like multi-speaker automatic speech recognition (ASR). In this study, we extend our proposed condition… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted by Interspeech 2021

  43. arXiv:2104.04702  [pdf, other

    cs.SD eess.AS

    Boundary and Context Aware Training for CIF-based Non-Autoregressive End-to-end ASR

    Authors: Fan Yu, Haoneng Luo, Pengcheng Guo, Yuhao Liang, Zhuoyuan Yao, Lei Xie, Yingying Gao, Lei**g Hou, Shilei Zhang

    Abstract: Continuous integrate-and-fire (CIF) based models, which use a soft and monotonic alignment mechanism, have been well applied in non-autoregressive (NAR) speech recognition with competitive performance compared with other NAR methods. However, such an alignment learning strategy may suffer from an erroneous acoustic boundary estimation, severely hindering the convergence speed as well as the system… ▽ More

    Submitted 26 September, 2021; v1 submitted 10 April, 2021; originally announced April 2021.

    Comments: 5 pages,4 figures

  44. arXiv:2103.02148  [pdf, other

    eess.IV cs.CV

    Multi-institutional Collaborations for Improving Deep Learning-based Magnetic Resonance Image Reconstruction Using Federated Learning

    Authors: Pengfei Guo, Puyang Wang, **yuan Zhou, Shanshan Jiang, Vishal M. Patel

    Abstract: Fast and accurate reconstruction of magnetic resonance (MR) images from under-sampled data is important in many clinical applications. In recent years, deep learning-based methods have been shown to produce superior performance on MR image reconstruction. However, these methods require large amounts of data which is difficult to collect and share due to the high cost of acquisition and medical dat… ▽ More

    Submitted 10 March, 2021; v1 submitted 2 March, 2021; originally announced March 2021.

    Comments: Accepted at CVPR 2021

  45. arXiv:2012.13006  [pdf, other

    eess.AS cs.SD

    The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans

    Authors: Shinji Watanabe, Florian Boyer, Xuankai Chang, Pengcheng Guo, Tomoki Hayashi, Yosuke Higuchi, Takaaki Hori, Wen-Chin Huang, Hirofumi Inaguma, Naoyuki Kamo, Shigeki Karita, Chenda Li, **g Shi, Aswin Shanmugam Subramanian, Wangyou Zhang

    Abstract: This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text… ▽ More

    Submitted 23 December, 2020; originally announced December 2020.

  46. arXiv:2011.09301  [pdf, other

    cs.SD eess.AS

    Context-aware RNNLM Rescoring for Conversational Speech Recognition

    Authors: Kun Wei, Pengcheng Guo, Hang Lv, Zhen Tu, Lei Xie

    Abstract: Conversational speech recognition is regarded as a challenging task due to its free-style speaking and long-term contextual dependencies. Prior work has explored the modeling of long-range context through RNNLM rescoring with improved performance. To further take advantage of the persisted nature during a conversation, such as topics or speaker turn, we extend the rescoring procedure to a new cont… ▽ More

    Submitted 18 November, 2020; originally announced November 2020.

  47. arXiv:2011.08623  [pdf, other

    cs.SD eess.AS

    Adversarial Training for Multi-domain Speaker Recognition

    Authors: Qing Wang, Wei Rao, Pengcheng Guo, Lei Xie

    Abstract: In real-life applications, the performance of speaker recognition systems always degrades when there is a mismatch between training and evaluation data. Many domain adaptation methods have been successfully used for eliminating the domain mismatches in speaker recognition. However, usually both training and evaluation data themselves can be composed of several subsets. These inner variances of eac… ▽ More

    Submitted 17 November, 2020; originally announced November 2020.

    Comments: 5 pages, 2 figures

  48. arXiv:2010.13956  [pdf, other

    eess.AS cs.SD

    Recent Developments on ESPnet Toolkit Boosted by Conformer

    Authors: Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, **g Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, Yuekai Zhang

    Abstract: In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-… ▽ More

    Submitted 29 October, 2020; v1 submitted 26 October, 2020; originally announced October 2020.

  49. arXiv:2009.13304  [pdf

    cs.CV cs.MM eess.IV

    Cuid: A new study of perceived image quality and its subjective assessment

    Authors: Lucie Lévêque, Ji Yang, Xiaohan Yang, Pengfei Guo, Kenneth Dasalla, Leida Li, Yingying Wu, Hantao Liu

    Abstract: Research on image quality assessment (IQA) remains limited mainly due to our incomplete knowledge about human visual perception. Existing IQA algorithms have been designed or trained with insufficient subjective data with a small degree of stimulus variability. This has led to challenges for those algorithms to handle complexity and diversity of real-world digital content. Perceptual evidence from… ▽ More

    Submitted 28 September, 2020; originally announced September 2020.

    Journal ref: 27th IEEE International Conference on Image Processing (ICIP), Oct 2020, Abu Dhabi, United Arab Emirates

  50. arXiv:2008.02859  [pdf, other

    eess.IV cs.CV

    Confidence-guided Lesion Mask-based Simultaneous Synthesis of Anatomic and Molecular MR Images in Patients with Post-treatment Malignant Gliomas

    Authors: Pengfei Guo, Puyang Wang, Rajeev Yasarla, **yuan Zhou, Vishal M. Patel, Shanshan Jiang

    Abstract: Data-driven automatic approaches have demonstrated their great potential in resolving various clinical diagnostic dilemmas in neuro-oncology, especially with the help of standard anatomic and advanced molecular MR images. However, data quantity and quality remain a key determinant of, and a significant limit on, the potential of such applications. In our previous work, we explored synthesis of ana… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

    Comments: Submit to IEEE TMI. arXiv admin note: text overlap with arXiv:2006.14761