Skip to main content

Showing 1–50 of 129 results for author: Gu, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.01494  [pdf, other

    cs.CV cs.SD eess.AS

    FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

    Authors: Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, Kai Chen

    Abstract: We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience. Despite its wide range of applications, existing approaches encounter limitations when it comes to simultaneously synthesizing high-quality and video-aligned (i.e.,, semantic relevant and temporal synchronized) sounds. To overcome these limitations… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Project page: https://foleycrafter.github.io/

  2. arXiv:2406.17244  [pdf, other

    eess.SP eess.SY

    A Near-Field Super-Resolution Network for Accelerating Antenna Characterization

    Authors: Yuchen Gu, Hai-Han Sun, Daniel W. van der Weide

    Abstract: We present a deep neural network-enabled method to accelerate near-field (NF) antenna measurement. We develop a Near-field Super-resolution Network (NFS-Net) to reconstruct significantly undersampled near-field data as high-resolution data, which considerably reduces the number of sampling points required for NF measurement and thus improves measurement efficiency. The high-resolution near-field d… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  3. arXiv:2406.09618  [pdf, other

    cs.CL cs.AI cs.IR cs.SD eess.AS

    Multi-Modal Retrieval For Large Language Model Based Speech Recognition

    Authors: Jari Kolehmainen, Aditya Gourav, Prashanth Gurunath Shivakumar, Yile Gu, Ankur Gandhe, Ariya Rastrow, Grant Strimel, Ivan Bulyko

    Abstract: Retrieval is a widely adopted approach for improving language models leveraging external information. As the field moves towards multi-modal large language models, it is important to extend the pure text based methods to incorporate other modalities in retrieval as well for applications across the wide spectrum of machine learning tasks and data types. In this work, we propose multi-modal retrieva… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  4. arXiv:2406.08081  [pdf

    eess.SP

    CLDTA: Contrastive Learning based on Diagonal Transformer Autoencoder for Cross-Dataset EEG Emotion Recognition

    Authors: Yuan Liao, Yuhong Zhang, Shenghuan Wang, Xiruo Zhang, Yiling Zhang, Wei Chen, Yuzhe Gu, Liya Huang

    Abstract: Recent advances in non-invasive EEG technology have broadened its application in emotion recognition, yielding a multitude of related datasets. Yet, deep learning models struggle to generalize across these datasets due to variations in acquisition equipment and emotional stimulus materials. To address the pressing need for a universal model that fluidly accommodates diverse EEG dataset formats and… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  5. arXiv:2406.05325  [pdf, other

    eess.AS cs.SD

    LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance

    Authors: Shihao Chen, Yu Gu, Jie Zhang, Na Li, Rilin Chen, Li** Chen, Lirong Dai

    Abstract: Any-to-any singing voice conversion (SVC) is an interesting audio editing technique, aiming to convert the singing voice of one singer into that of another, given only a few seconds of singing data. However, during the conversion process, the issue of timbre leakage is inevitable: the converted singing voice still sounds like the original singer's voice. To tackle this, we propose a latent diffusi… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  6. arXiv:2405.17100  [pdf, other

    cs.CR cs.SD eess.AS

    Sok: Comprehensive Security Overview, Challenges, and Future Directions of Voice-Controlled Systems

    Authors: Haozhe Xu, Cong Wu, Yangyang Gu, Xingcan Shang, **g Chen, Kun He, Ruiying Du

    Abstract: The integration of Voice Control Systems (VCS) into smart devices and their growing presence in daily life accentuate the importance of their security. Current research has uncovered numerous vulnerabilities in VCS, presenting significant risks to user privacy and security. However, a cohesive and systematic examination of these vulnerabilities and the corresponding solutions is still absent. This… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  7. arXiv:2405.14336  [pdf, other

    eess.IV

    I$^2$VC: A Unified Framework for Intra- & Inter-frame Video Compression

    Authors: Meiqin Liu, Chenming Xu, Yukai Gu, Chao Yao, Yao Zhao

    Abstract: Video compression aims to reconstruct seamless frames by encoding the motion and residual information from existing frames. Previous neural video compression methods necessitate distinct codecs for three types of frames (I-frame, P-frame and B-frame), which hinders a unified approach and generalization across different video contexts. Intra-codec techniques lack the advanced Motion Estimation and… ▽ More

    Submitted 1 June, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: 19 pages, 10 figures

  8. arXiv:2404.19441  [pdf, other

    cs.SD eess.AS

    ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers

    Authors: Yuzhe Gu, Enmao Diao

    Abstract: Neural speech codecs aim to compress input signals into minimal bits while maintaining content quality in a low-latency manner. However, existing codecs often trade computational complexity for reconstruction performance. These codecs primarily use convolutional blocks for feature transformation layers, which are not inherently suited for capturing the local redundancies in speech signals. To comp… ▽ More

    Submitted 21 June, 2024; v1 submitted 30 April, 2024; originally announced April 2024.

  9. arXiv:2404.17161  [pdf, other

    cs.SD eess.AS eess.SP

    An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder

    Authors: Yicheng Gu, Xueyao Zhang, Liumeng Xue, Haizhou Li, Zhizheng Wu

    Abstract: Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constan… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: arXiv admin note: text overlap with arXiv:2311.14957

  10. arXiv:2403.19374  [pdf, other

    cs.ET eess.SY

    A noise-tolerant, resource-saving probabilistic binary neural network implemented by the SOT-MRAM compute-in-memory system

    Authors: Yu Gu, Puyang Huang, Tianhao Chen, Chenyi Fu, Aitian Chen, Shouzhong Peng, Xixiang Zhang, Xufeng Kou

    Abstract: We report a spin-orbit torque(SOT) magnetoresistive random-access memory(MRAM)-based probabilistic binary neural network(PBNN) for resource-saving and hardware noise-tolerant computing applications. With the presence of thermal fluctuation, the non-destructive SOT-driven magnetization switching characteristics lead to a random weight matrix with controllable probability distribution. In the meanwh… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

    Comments: 5 pages, 10 figures

    MSC Class: 94C60 ACM Class: B.2.4; B.3.0

  11. arXiv:2403.16262  [pdf, other

    cs.RO eess.SY

    HT-LIP Model based Robust Control of Quadrupedal Robot Locomotion under Unknown Vertical Ground Motion

    Authors: Amir Iqbal, Sushant Veer, Christopher Niezrecki, Yan Gu

    Abstract: This paper presents a hierarchical control framework that enables robust quadrupedal locomotion on a dynamic rigid surface (DRS) with general and unknown vertical motions. The key novelty of the framework lies in its higher layer, which is a discrete-time, provably stabilizing footstep controller. The basis of the footstep controller is a new hybrid, time-varying, linear inverted pendulum (HT-LIP)… ▽ More

    Submitted 24 March, 2024; originally announced March 2024.

  12. arXiv:2403.16252  [pdf, other

    cs.RO eess.SY

    Legged Robot State Estimation within Non-inertial Environments

    Authors: Zijian He, Sangli Teng, Tzu-Yuan Lin, Maani Ghaffari, Yan Gu

    Abstract: This paper investigates the robot state estimation problem within a non-inertial environment. The proposed state estimation approach relaxes the common assumption of static ground in the system modeling. The process and measurement models explicitly treat the movement of the non-inertial environments without requiring knowledge of its motion in the inertial frame or relying on GPS or sensing envir… ▽ More

    Submitted 24 March, 2024; originally announced March 2024.

  13. arXiv:2403.11672  [pdf, other

    eess.IV cs.CV

    WIA-LD2ND: Wavelet-based Image Alignment for Self-supervised Low-Dose CT Denoising

    Authors: Haoyu Zhao, Yuliang Gu, Zhou Zhao, Bo Du, Yongchao Xu, Rui Yu

    Abstract: In clinical examinations and diagnoses, low-dose computed tomography (LDCT) is crucial for minimizing health risks compared with normal-dose computed tomography (NDCT). However, reducing the radiation dose compromises the signal-to-noise ratio, leading to degraded quality of CT images. To address this, we analyze LDCT denoising task based on experimental results from the frequency perspective, and… ▽ More

    Submitted 1 July, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: MICCAI2024

  14. arXiv:2402.10816  [pdf, other

    cs.LG cs.CR cs.DC eess.SP

    TernaryVote: Differentially Private, Communication Efficient, and Byzantine Resilient Distributed Optimization on Heterogeneous Data

    Authors: Richeng **, Yujie Gu, Kai Yue, Xiaofan He, Zhaoyang Zhang, Huaiyu Dai

    Abstract: Distributed training of deep neural networks faces three critical challenges: privacy preservation, communication efficiency, and robustness to fault and adversarial behaviors. Although significant research efforts have been devoted to addressing these challenges independently, their synthesis remains less explored. In this paper, we propose TernaryVote, which combines a ternary compressor and the… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

  15. arXiv:2401.15508  [pdf, other

    cs.RO cs.LG eess.SY

    Proto-MPC: An Encoder-Prototype-Decoder Approach for Quadrotor Control in Challenging Winds

    Authors: Yuliang Gu, Sheng Cheng, Naira Hovakimyan

    Abstract: Quadrotors are increasingly used in the evolving field of aerial robotics for their agility and mechanical simplicity. However, inherent uncertainties, such as aerodynamic effects coupled with quadrotors' operation in dynamically changing environments, pose significant challenges for traditional, nominal model-based control designs. We propose a multi-task meta-learning method called Encoder-Proto… ▽ More

    Submitted 21 May, 2024; v1 submitted 27 January, 2024; originally announced January 2024.

  16. arXiv:2401.05446  [pdf, other

    eess.SP cs.AI cs.LG

    Self-supervised Learning for Electroencephalogram: A Systematic Survey

    Authors: Weining Weng, Yang Gu, Shuai Guo, Yuan Ma, Zhaohua Yang, Yuchen Liu, Yiqiang Chen

    Abstract: Electroencephalogram (EEG) is a non-invasive technique to record bioelectrical signals. Integrating supervised deep learning techniques with EEG signals has recently facilitated automatic analysis across diverse EEG-based tasks. However, the label issues of EEG signals have constrained the development of EEG-based deep models. Obtaining EEG annotations is difficult that requires domain experts to… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

    Comments: 35 pages, 12 figures

    MSC Class: 68-02 (Primarily); 68T01 (Secondary) ACM Class: I.2; J.3; I.5.4

  17. arXiv:2401.03468  [pdf, other

    eess.AS cs.SD

    Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

    Authors: Qiushi Zhu, Jie Zhang, Yu Gu, Yuchen Hu, Lirong Dai

    Abstract: Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech proces… ▽ More

    Submitted 7 January, 2024; originally announced January 2024.

    Comments: Accepted by AAAI 2024

  18. arXiv:2401.02921  [pdf, other

    cs.CL eess.AS

    Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks

    Authors: Kevin Everson, Yile Gu, Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-yi Lee, Ariya Rastrow, Andreas Stolcke

    Abstract: In the realm of spoken language understanding (SLU), numerous natural language understanding (NLU) methodologies have been adapted by supplying large language models (LLMs) with transcribed speech instead of conventional written text. In real-world scenarios, prior to input into an LLM, an automated speech recognition (ASR) system generates an output transcript hypothesis, where inherent errors ca… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

    Comments: Accepted to ICASSP 2024

  19. arXiv:2401.00159  [pdf, other

    eess.IV cs.CV

    Automatic hip osteoarthritis grading with uncertainty estimation from computed tomography using digitally-reconstructed radiographs

    Authors: Masachika Masuda, Mazen Soufi, Yoshito Otake, Keisuke Uemura, Sotaro Kono, Kazuma Takashima, Hidetoshi Hamada, Yi Gu, Masaki Takao, Seiji Okada, Nobuhiko Sugano, Yoshinobu Sato

    Abstract: Progression of hip osteoarthritis (hip OA) leads to pain and disability, likely leading to surgical treatment such as hip arthroplasty at the terminal stage. The severity of hip OA is often classified using the Crowe and Kellgren-Lawrence (KL) classifications. However, as the classification is subjective, we aimed to develop an automated approach to classify the disease severity based on the two g… ▽ More

    Submitted 30 December, 2023; originally announced January 2024.

  20. arXiv:2312.15316  [pdf, other

    cs.CL eess.AS

    Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

    Authors: Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-yi Lee, Ivan Bulyko

    Abstract: Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore pro… ▽ More

    Submitted 17 January, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024. Camera-ready version

  21. arXiv:2312.13752  [pdf

    eess.IV cs.AI cs.CV

    Hunting imaging biomarkers in pulmonary fibrosis: Benchmarks of the AIIB23 challenge

    Authors: Yang Nan, Xiaodan Xing, Shiyi Wang, Zeyu Tang, Federico N Felder, Sheng Zhang, Roberta Eufrasia Ledda, Xiaoliu Ding, Ruiqi Yu, Wei** Liu, Feng Shi, Tianyang Sun, Zehong Cao, Minghui Zhang, Yun Gu, Hanxiao Zhang, Jian Gao, **yu Wang, Wen Tang, Pengxin Yu, Han Kang, Junqiang Chen, Xing Lu, Boyu Zhang, Michail Mamalakis , et al. (16 additional authors not shown)

    Abstract: Airway-related quantitative imaging biomarkers are crucial for examination, diagnosis, and prognosis in pulmonary diseases. However, the manual delineation of airway trees remains prohibitively time-consuming. While significant efforts have been made towards enhancing airway modelling, current public-available datasets concentrate on lung diseases with moderate morphological variations. The intric… ▽ More

    Submitted 16 April, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: 19 pages

  22. arXiv:2312.09911  [pdf, other

    cs.SD eess.AS

    Amphion: An Open-Source Audio, Music and Speech Generation Toolkit

    Authors: Xueyao Zhang, Liumeng Xue, Yicheng Gu, Yuancheng Wang, Haorui He, Chaoren Wang, Xi Chen, Zihao Fang, Haopeng Chen, Junan Zhang, Tze Ying Tang, Lexiao Zou, Mingxuan Wang, Jun Han, Kai Chen, Haizhou Li, Zhizheng Wu

    Abstract: Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to ease the way for junior researchers and engineers into these fields. It presents a unified framework that is inclusive of diverse generation tasks and models, with the added bonus of being easily extendable for new incorporation. The toolkit is designed with beginner-friendly workflows and pre-trained models, a… ▽ More

    Submitted 22 February, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

    Comments: Amphion Website: https://github.com/open-mmlab/Amphion

  23. arXiv:2312.09576  [pdf, other

    eess.IV cs.CV

    SegRap2023: A Benchmark of Organs-at-Risk and Gross Tumor Volume Segmentation for Radiotherapy Planning of Nasopharyngeal Carcinoma

    Authors: Xiangde Luo, Jia Fu, Yunxin Zhong, Shuolin Liu, Bing Han, Mehdi Astaraki, Simone Bendazzoli, Iuliana Toma-Dasu, Yiwen Ye, Ziyang Chen, Yong Xia, Yanzhou Su, ** Ye, Junjun He, Zhaohu Xing, Hongqiu Wang, Lei Zhu, Kaixiang Yang, Xin Fang, Zhiwei Wang, Chan Woong Lee, Sang Joon Park, Jaehee Chun, Constantin Ulrich, Klaus H. Maier-Hein , et al. (17 additional authors not shown)

    Abstract: Radiation therapy is a primary and effective NasoPharyngeal Carcinoma (NPC) treatment strategy. The precise delineation of Gross Tumor Volumes (GTVs) and Organs-At-Risk (OARs) is crucial in radiation treatment, directly impacting patient prognosis. Previously, the delineation of GTVs and OARs was performed by experienced radiation oncologists. Recently, deep learning has achieved promising results… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: A challenge report of SegRap2023 (organized in conjunction with MICCAI2023)

  24. Synergistic Perception and Control Simplex for Verifiable Safe Vertical Landing

    Authors: Ayoosh Bansal, Yang Zhao, James Zhu, Sheng Cheng, Yuliang Gu, Hyung-** Yoon, Hunmin Kim, Naira Hovakimyan, Lui Sha

    Abstract: Perception, Planning, and Control form the essential components of autonomy in advanced air mobility. This work advances the holistic integration of these components to enhance the performance and robustness of the complete cyber-physical system. We adapt Perception Simplex, a system for verifiable collision avoidance amidst obstacle detection faults, to the vertical landing maneuver for autonomou… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

    Comments: To appear in AIAA SciTech 2024

    ACM Class: C.3; C.4; J.7

    Journal ref: AIAA SCITECH 2024 Forum, p. 1167

  25. arXiv:2311.14957  [pdf, other

    cs.SD eess.AS

    Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder

    Authors: Yicheng Gu, Xueyao Zhang, Liumeng Xue, Zhizheng Wu

    Abstract: Generative Adversarial Network (GAN) based vocoders are superior in inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator to promote GAN-based vocoders. Most existing time-frequency-representation-based discriminators are rooted in Short-Time Fourier Transform (STFT), whose time-frequency res… ▽ More

    Submitted 25 November, 2023; originally announced November 2023.

  26. arXiv:2311.12199  [pdf, other

    cs.SD cs.LG eess.AS

    Improving Label Assignments Learning by Dynamic Sample Dropout Combined with Layer-wise Optimization in Speech Separation

    Authors: Chenyang Gao, Yue Gu, Ivan Marsic

    Abstract: In supervised speech separation, permutation invariant training (PIT) is widely used to handle label ambiguity by selecting the best permutation to update the model. Despite its success, previous studies showed that PIT is plagued by excessive label assignment switching in adjacent epochs, impeding the model to learn better label assignments. To address this issue, we propose a novel training stra… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

    Comments: Accepted by INTERSPEECH 2023

  27. arXiv:2310.11160  [pdf, other

    cs.SD eess.AS

    Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion

    Authors: Xueyao Zhang, Yicheng Gu, Haopeng Chen, Zihao Fang, Lexiao Zou, Junan Zhang, Liumeng Xue, **chao Zhang, Jie Zhou, Zhizheng Wu

    Abstract: Singing Voice Conversion (SVC) is a technique that enables any singer to perform any song. To achieve this, it is essential to obtain speaker-agnostic representations from the source audio, which poses a significant challenge. A common solution involves utilizing a semantic-based audio pretrained model as a feature extractor. However, the degree to which the extracted features can meet the SVC req… ▽ More

    Submitted 27 May, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

  28. arXiv:2310.06248  [pdf, other

    eess.AS

    Discriminative Speech Recognition Rescoring with Pre-trained Language Models

    Authors: Prashanth Gurunath Shivakumar, Jari Kolehmainen, Yile Gu, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko

    Abstract: Second pass rescoring is a critical component of competitive automatic speech recognition (ASR) systems. Large language models have demonstrated their ability in using pre-trained information for better rescoring of ASR hypothesis. Discriminative training, directly optimizing the minimum word-error-rate (MWER) criterion typically improves rescoring. In this study, we propose and explore several di… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: ASRU 2023

  29. arXiv:2310.03747  [pdf, other

    eess.SP cs.AI cs.LG

    A Knowledge-Driven Cross-view Contrastive Learning for EEG Representation

    Authors: Weining Weng, Yang Gu, Qihui Zhang, Yingying Huang, Chunyan Miao, Yiqiang Chen

    Abstract: Due to the abundant neurophysiological information in the electroencephalogram (EEG) signal, EEG signals integrated with deep learning methods have gained substantial traction across numerous real-world tasks. However, the development of supervised learning methods based on EEG signals has been hindered by the high cost and significant label discrepancies to manually label large-scale EEG datasets… ▽ More

    Submitted 21 September, 2023; originally announced October 2023.

    Comments: 14pages,7 figures

    MSC Class: 68T30 Knowledge representation ACM Class: I.2.4; I.5.2; J.3.1

  30. arXiv:2310.01453  [pdf, other

    eess.SP cs.IT

    Enhancing Secrecy Capacity in PLS Communication with NORAN based on Pilot Information Codebooks

    Authors: Yebo Gu, Tao Shen, Jian Song, Qingbo Wang

    Abstract: In recent research, non-orthogonal artificial noise (NORAN) has been proposed as an alternative to orthogonal artificial noise (AN). However, NORAN introduces additional noise into the channel, which reduces the capacity of the legitimate channel (LC). At the same time, selecting a NORAN design with ideal security performance from a large number of design options is also a challenging problem. To… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

  31. arXiv:2309.15649  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

    Authors: Chao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan Bulyko, Andreas Stolcke

    Abstract: We explore the ability of large language models (LLMs) to act as speech recognition post-processors that perform rescoring and error correction. Our first focus is on instruction prompting to let LLMs perform these task without fine-tuning, for which we evaluate different prompting schemes, both zero- and few-shot in-context learning, and a novel task activation prompting method that combines caus… ▽ More

    Submitted 10 October, 2023; v1 submitted 27 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE Automatic Speech Recognition and Understanding (ASRU) 2023. 8 pages. 2nd version revised from Sep 29th's version

    Journal ref: Proc. IEEE ASRU Workshop, Dec. 2023

  32. arXiv:2309.15223  [pdf, other

    cs.CL cs.AI cs.LG cs.NE cs.SD eess.AS

    Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition

    Authors: Yu Yu, Chao-Han Huck Yang, Jari Kolehmainen, Prashanth G. Shivakumar, Yile Gu, Sungho Ryu, Roger Ren, Qi Luo, Aditya Gourav, I-Fan Chen, Yi-Chieh Liu, Tuan Dinh, Ankur Gandhe, Denis Filimonov, Shalini Ghosh, Andreas Stolcke, Ariya Rastow, Ivan Bulyko

    Abstract: We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we p… ▽ More

    Submitted 10 October, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted to IEEE ASRU 2023. Internal Review Approved. Revised 2nd version with Andreas and Huck. The first version is in Sep 29th. 8 pages

    Journal ref: Proc. IEEE ASRU Workshop, Dec. 2023

  33. arXiv:2309.12792  [pdf, other

    eess.AS cs.SD

    DurIAN-E: Duration Informed Attention Network For Expressive Text-to-Speech Synthesis

    Authors: Yu Gu, Yianrao Bian, Guangzhi Lei, Chao Weng, Dan Su

    Abstract: This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis. Inherited from the original DurIAN model, an auto-regressive model structure in which the alignments between the input linguistic information and the output acoustic features are inferred from a duration model is adopted. Meanwhile the proposed Du… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

  34. arXiv:2309.03906  [pdf, other

    eess.IV cs.CV

    A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

    Authors: Ziyan Huang, Zhongying Deng, ** Ye, Haoyu Wang, Yanzhou Su, Tianbin Li, Hui Sun, Junlong Cheng, Jianpin Chen, Junjun He, Yun Gu, Shaoting Zhang, Lixu Gu, Yu Qiao

    Abstract: Although deep learning have revolutionized abdominal multi-organ segmentation, models often struggle with generalization due to training on small, specific datasets. With the recent emergence of large-scale datasets, some important questions arise: \textbf{Can models trained on these datasets generalize well on different ones? If yes/no, how to further improve their generalizability?} To address t… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

  35. arXiv:2308.14553  [pdf, other

    eess.AS cs.SD

    Rep2wav: Noise Robust text-to-speech Using self-supervised representations

    Authors: Qiushi Zhu, Yu Gu, Rilin Chen, Chao Weng, Yuchen Hu, Lirong Dai, Jie Zhang

    Abstract: Benefiting from the development of deep learning, text-to-speech (TTS) techniques using clean speech have achieved significant performance improvements. The data collected from real scenes often contains noise and generally needs to be denoised by speech enhancement models. Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background… ▽ More

    Submitted 3 September, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

    Comments: 5 pages,2 figures

  36. arXiv:2308.00247  [pdf, other

    eess.IV cs.CV

    Unleashing the Power of Self-Supervised Image Denoising: A Comprehensive Review

    Authors: Dan Zhang, Fangfang Zhou, Felix Albu, Yuanzhou Wei, Xiao Yang, Yuan Gu, Qiang Li

    Abstract: The advent of deep learning has brought a revolutionary transformation to image denoising techniques. However, the persistent challenge of acquiring noise-clean pairs for supervised methods in real-world scenarios remains formidable, necessitating the exploration of more practical self-supervised image denoising. This paper focuses on self-supervised image denoising methods that offer effective so… ▽ More

    Submitted 25 March, 2024; v1 submitted 31 July, 2023; originally announced August 2023.

    Comments: 24 pages

  37. arXiv:2308.00218  [pdf, other

    eess.SY cs.AI cs.LG

    Deep Reinforcement Learning-Based Battery Conditioning Hierarchical V2G Coordination for Multi-Stakeholder Benefits

    Authors: Yubao Zhang, Xin Chen, Yi Gu, Zhicheng Li, Wu Kai

    Abstract: With the growing prevalence of electric vehicles (EVs) and advancements in EV electronics, vehicle-to-grid (V2G) techniques and large-scale scheduling strategies have emerged to promote renewable energy utilization and power grid stability. This study proposes a multi-stakeholder hierarchical V2G coordination based on deep reinforcement learning (DRL) and the Proof of Stake algorithm. Furthermore,… ▽ More

    Submitted 31 July, 2023; originally announced August 2023.

  38. arXiv:2307.11513  [pdf, other

    eess.IV cs.CV

    Bone mineral density estimation from a plain X-ray image by learning decomposition into projections of bone-segmented computed tomography

    Authors: Yi Gu, Yoshito Otake, Keisuke Uemura, Mazen Soufi, Masaki Takao, Hugues Talbot, Seiji Okada, Nobuhiko Sugano, Yoshinobu Sato

    Abstract: Osteoporosis is a prevalent bone disease that causes fractures in fragile bones, leading to a decline in daily living activities. Dual-energy X-ray absorptiometry (DXA) and quantitative computed tomography (QCT) are highly accurate for diagnosing osteoporosis; however, these modalities require special equipment and scan protocols. To frequently monitor bone health, low-cost, low-dose, and ubiquito… ▽ More

    Submitted 21 July, 2023; originally announced July 2023.

    Comments: 20 pages and 22 figures

  39. arXiv:2307.06832  [pdf, other

    eess.AS cs.CL

    Personalization for BERT-based Discriminative Speech Recognition Rescoring

    Authors: Jari Kolehmainen, Yile Gu, Aditya Gourav, Prashanth Gurunath Shivakumar, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko

    Abstract: Recognition of personalized content remains a challenge in end-to-end speech recognition. We explore three novel approaches that use personalized content in a neural rescoring step to improve recognition: gazetteers, prompting, and a cross-attention based encoder-decoder model. We use internal de-identified en-US data from interactions with a virtual voice assistant supplemented with personalized… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

  40. arXiv:2307.02720  [pdf, other

    cs.CL cs.SD eess.AS

    On-Device Constrained Self-Supervised Speech Representation Learning for Keyword Spotting via Knowledge Distillation

    Authors: Gene-** Yang, Yue Gu, Qingming Tang, Dongsu Du, Yuzong Liu

    Abstract: Large self-supervised models are effective feature extractors, but their application is challenging under on-device budget constraints and biased dataset collection, especially in keyword spotting. To address this, we proposed a knowledge distillation-based self-supervised speech representation learning (S3RL) architecture for on-device keyword spotting. Our approach used a teacher-student framewo… ▽ More

    Submitted 5 July, 2023; originally announced July 2023.

    Comments: Accepted to Interspeech 2023

  41. arXiv:2306.15815  [pdf, other

    eess.AS

    Scaling Laws for Discriminative Speech Recognition Rescoring Models

    Authors: Yile Gu, Prashanth Gurunath Shivakumar, Jari Kolehmainen, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko

    Abstract: Recent studies have found that model performance has a smooth power-law relationship, or scaling laws, with training data and model size, for a wide range of problems. These scaling laws allow one to choose nearly optimal data and model sizes. We study whether this scaling property is also applicable to second-pass rescoring, which is an important component of speech recognition systems. We focus… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

  42. arXiv:2306.09452  [pdf, other

    eess.AS

    Distillation Strategies for Discriminative Speech Recognition Rescoring

    Authors: Prashanth Gurunath Shivakumar, Jari Kolehmainen, Yile Gu, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko

    Abstract: Second-pass rescoring is employed in most state-of-the-art speech recognition systems. Recently, BERT based models have gained popularity for re-ranking the n-best hypothesis by exploiting the knowledge from masked language model pre-training. Further, fine-tuning with discriminative loss such as minimum word error rate (MWER) has shown to perform better than likelihood-based loss. Streaming appli… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: Accepted at INTERSPEECH 2023

  43. arXiv:2306.09116  [pdf, other

    eess.IV cs.CV

    Accurate Airway Tree Segmentation in CT Scans via Anatomy-aware Multi-class Segmentation and Topology-guided Iterative Learning

    Authors: Puyang Wang, Dazhou Guo, Dandan Zheng, Minghui Zhang, Haogang Yu, Xin Sun, Jia Ge, Yun Gu, Le Lu, Xianghua Ye, Dakai **

    Abstract: Intrathoracic airway segmentation in computed tomography (CT) is a prerequisite for various respiratory disease analyses such as chronic obstructive pulmonary disease (COPD), asthma and lung cancer. Unlike other organs with simpler shapes or topology, the airway's complex tree structure imposes an unbearable burden to generate the "ground truth" label (up to 7 or 3 hours of manual or semi-automati… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

  44. arXiv:2305.13957  [pdf, other

    eess.AS

    Eeg2vec: Self-Supervised Electroencephalographic Representation Learning

    Authors: Qiushi Zhu, Xiaoying Zhao, Jie Zhang, Yu Gu, Chao Weng, Yuchen Hu

    Abstract: Recently, many efforts have been made to explore how the brain processes speech using electroencephalographic (EEG) signals, where deep learning-based approaches were shown to be applicable in this field. In order to decode speech signals from EEG signals, linear networks, convolutional neural networks (CNN) and long short-term memory networks are often used in a supervised manner. Recording EEG-s… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: 5 pages

  45. arXiv:2303.11701  [pdf, other

    eess.IV cs.CV cs.LG

    A High-Frequency Focused Network for Lightweight Single Image Super-Resolution

    Authors: Xiaotian Weng, Yi Chen, Zhichao Zheng, Yanhui Gu, Junsheng Zhou, Yudong Zhang

    Abstract: Lightweight neural networks for single-image super-resolution (SISR) tasks have made substantial breakthroughs in recent years. Compared to low-frequency information, high-frequency detail is much more difficult to reconstruct. Most SISR models allocate equal computational resources for low-frequency and high-frequency information, which leads to redundant processing of simple low-frequency inform… ▽ More

    Submitted 21 March, 2023; originally announced March 2023.

  46. arXiv:2303.05745  [pdf, other

    eess.IV cs.CV

    Multi-site, Multi-domain Airway Tree Modeling (ATM'22): A Public Benchmark for Pulmonary Airway Segmentation

    Authors: Minghui Zhang, Yangqian Wu, Hanxiao Zhang, Yulei Qin, Hao Zheng, Wen Tang, Corey Arnold, Chenhao Pei, Pengxin Yu, Yang Nan, Guang Yang, Simon Walsh, Dominic C. Marshall, Matthieu Komorowski, Puyang Wang, Dazhou Guo, Dakai **, Ya'nan Wu, Shuiqing Zhao, Runsheng Chang, Boyu Zhang, Xing Lv, Abdul Qayyum, Moona Mazher, Qi Su , et al. (11 additional authors not shown)

    Abstract: Open international challenges are becoming the de facto standard for assessing computer vision and image analysis algorithms. In recent years, new methods have extended the reach of pulmonary airway segmentation that is closer to the limit of image resolution. Since EXACT'09 pulmonary airway segmentation, limited effort has been directed to quantitative comparison of newly emerged algorithms drive… ▽ More

    Submitted 27 June, 2023; v1 submitted 10 March, 2023; originally announced March 2023.

    Comments: 32 pages, 16 figures. Homepage: https://atm22.grand-challenge.org/. Submitted

  47. arXiv:2303.04595  [pdf

    eess.IV cs.CV

    Structure-aware registration network for liver DCE-CT images

    Authors: Peng Xue, **gyang Zhang, Lei Ma, Mianxin Liu, Yuning Gu, Jiawei Huang, Feihong Liua, Yongsheng Pan, Xiaohuan Cao, Dinggang Shen

    Abstract: Image registration of liver dynamic contrast-enhanced computed tomography (DCE-CT) is crucial for diagnosis and image-guided surgical planning of liver cancer. However, intensity variations due to the flow of contrast agents combined with complex spatial motion induced by respiration brings great challenge to existing intensity-based registration methods. To address these problems, we propose a no… ▽ More

    Submitted 8 March, 2023; originally announced March 2023.

  48. arXiv:2303.04255  [pdf, other

    cs.SD cs.LG eess.AS

    Self-supervised speech representation learning for keyword-spotting with light-weight transformers

    Authors: Chenyang Gao, Yue Gu, Francesco Caliva, Yuzong Liu

    Abstract: Self-supervised speech representation learning (S3RL) is revolutionizing the way we leverage the ever-growing availability of data. While S3RL related studies typically use large models, we employ light-weight networks to comply with tight memory of compute-constrained devices. We demonstrate the effectiveness of S3RL on a keyword-spotting (KS) problem by using transformers with 330k parameters an… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

  49. arXiv:2303.02018  [pdf, other

    eess.SP

    Efficient Aberration Correction via Optimal Bulk Speed of Sound Compensation

    Authors: Scott Schoen Jr, Viksit Kumar, Yuyang Gu, Sunethra Dayavansha, Rimon Tadross, Mike Washburn, Kai Thomenius, Anthony E. Samir

    Abstract: Diagnostic ultrasound is a versatile and practical tool in the abdomen, and is particularly vital toward the detection and mitigation of early-stage non-alcoholic fatty liver disease (NAFLD). However, its performance in those with obesity -- who are at increased risk for NAFLD -- is degraded due to distortions of the ultrasound as it traverses thicker, acoustically heterogeneous body walls (aberra… ▽ More

    Submitted 13 May, 2024; v1 submitted 3 March, 2023; originally announced March 2023.

  50. arXiv:2303.01701  [pdf, other

    eess.SY

    Descriptor State Space Modeling of Power Systems

    Authors: Yitong Li, Timothy C. Green, Yunjie Gu

    Abstract: State space is widely used for modeling power systems and analyzing their dynamics but it is limited to representing causal and proper systems in which the number of zeros does not exceed the number of poles. In other words, the system input, output, and state can not be freely selected. This limits how flexibly models are constructed, and in some circumstances, can introduce errors because of the… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.