Skip to main content

Showing 1–50 of 266 results for author: Wu, Z

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.15222  [pdf

    eess.IV cs.AI cs.CV

    Rapid and Accurate Diagnosis of Acute Aortic Syndrome using Non-contrast CT: A Large-scale, Retrospective, Multi-center and AI-based Study

    Authors: Yujian Hu, Yilang Xiang, Yan-Jie Zhou, Yangyan He, Shifeng Yang, Xiaolong Du, Chunlan Den, Youyao Xu, Gaofeng Wang, Zhengyao Ding, **gyong Huang, Wenjun Zhao, Xuejun Wu, Donglin Li, Qianqian Zhu, Zhenjiang Li, Chenyang Qiu, Ziheng Wu, Yunjun He, Chen Tian, Yihui Qiu, Zuodong Lin, Xiaolong Zhang, Yuan He, Zhenpeng Yuan , et al. (15 additional authors not shown)

    Abstract: Chest pain symptoms are highly prevalent in emergency departments (EDs), where acute aortic syndrome (AAS) is a catastrophic cardiovascular emergency with a high fatality rate, especially when timely and accurate treatment is not administered. However, current triage practices in the ED can cause up to approximately half of patients with AAS to have an initially missed diagnosis or be misdiagnosed… ▽ More

    Submitted 24 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: under peer review

  2. arXiv:2406.13340  [pdf, other

    cs.CL cs.SD eess.AS

    SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

    Authors: Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu

    Abstract: Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, includin… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  3. arXiv:2406.11336  [pdf, other

    eess.SY

    LFPLM: A General and Flexible Load Forecasting Framework based on Pre-trained Language Model

    Authors: Mingyang Gao, Suyang Zhou, Wei Gu, Zhi Wu, Zijian Hu, Hong Zhu, Haiquan Liu

    Abstract: Accurate load forecasting is essential for maintaining the power balance between generators and consumers, especially with the increasing integration of renewable energy sources, which introduce significant intermittent volatility. With the development of data-driven methods, machine learning and deep learning-based models have become the predominant approach for load forecasting tasks. In recent… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 7 pages, 5 figures and 5 tables

  4. arXiv:2406.10844  [pdf, other

    eess.AS cs.SD

    Multi-Scale Accent Modeling with Disentangling for Multi-Speaker Multi-Accent TTS Synthesis

    Authors: Xuehao Zhou, Mingyang Zhang, Yi Zhou, Zhizheng Wu, Haizhou Li

    Abstract: Synthesizing speech across different accents while preserving the speaker identity is essential for various real-world customer applications. However, the individual and accurate modeling of accents and speakers in a text-to-speech (TTS) system is challenging due to the complexity of accent variations and the intrinsic entanglement between the accent and speaker identity. In this paper, we present… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  5. arXiv:2406.08336  [pdf, other

    cs.SD cs.CV eess.AS

    CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction

    Authors: Xueyuan Chen, Dongchao Yang, Dingdong Wang, Xixin Wu, Zhiyong Wu, Helen Meng

    Abstract: Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. In this paper, we propose a multi-modal DSR model by leveraging neural codec language modeling to improve the reconstruction results, especially for the speaker similarity and prosody naturalness. Our proposed model consists of: (… ▽ More

    Submitted 24 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  6. arXiv:2406.07989  [pdf, other

    cs.IT eess.SP

    Near-Field Wideband Beam Training Based on Distance-Dependent Beam Split

    Authors: Tianyue Zheng, Mingyao Cui, Zidong Wu, Linglong Dai

    Abstract: Near-field beam training is essential for acquiring channel state information in 6G extremely large-scale multiple input multiple output (XL-MIMO) systems. To achieve low-overhead beam training, existing method has been proposed to leverage the near-field beam split effect, which deploys true-time-delay arrays to simultaneously search multiple angles of the entire angular range in a distance ring… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  7. arXiv:2406.04791  [pdf, other

    cs.SD eess.AS

    Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR

    Authors: Shaojun Li, Daimeng Wei, Jiaxin Guo, ZongYao Li, Zhanglin Wu, Zhiqiang Rao, Yuanchang Luo, Xianghui He, Hao Yang

    Abstract: Despite recent improvements in End-to-End Automatic Speech Recognition (E2E ASR) systems, the performance can degrade due to vocal characteristic mismatches between training and testing data, particularly with limited target speaker adaptation data. We propose a novel speaker adaptation approach Speaker-Smoothed kNN that leverages k-Nearest Neighbors (kNN) retrieval techniques to improve model out… ▽ More

    Submitted 11 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  8. arXiv:2406.03438  [pdf, other

    cs.IT eess.SP

    CSI-GPT: Integrating Generative Pre-Trained Transformer with Federated-Tuning to Acquire Downlink Massive MIMO Channels

    Authors: Ye Zeng, Li Qiao, Zhen Gao, Tong Qin, Zhonghuai Wu, Sheng Chen, Mohsen Guizani

    Abstract: In massive multiple-input multiple-output (MIMO) systems, how to reliably acquire downlink channel state information (CSI) with low overhead is challenging. In this work, by integrating the generative pre-trained Transformer (GPT) with federated-tuning, we propose a CSI-GPT approach to realize efficient downlink CSI acquisition. Specifically, we first propose a Swin Transformer-based channel acqui… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  9. arXiv:2406.02921  [pdf, other

    cs.CL cs.AI cs.LG cs.NE eess.AS

    Text Injection for Neural Contextual Biasing

    Authors: Zhong Meng, Zelin Wu, Rohit Prabhavalkar, Cal Peyser, Weiran Wang, Nanxin Chen, Tara N. Sainath, Bhuvana Ramabhadran

    Abstract: Neural contextual biasing effectively improves automatic speech recognition (ASR) for crucial phrases within a speaker's context, particularly those that are infrequent in the training data. This work proposes contextual text injection (CTI) to enhance contextual ASR. CTI leverages not only the paired speech-text data, but also a much larger corpus of unpaired text to optimize the ASR model and it… ▽ More

    Submitted 11 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: 5 pages, 1 figure

    Journal ref: Interspeech 2024, Kos Island, Greece

  10. arXiv:2406.02518  [pdf, other

    cs.CV eess.IV

    DDGS-CT: Direction-Disentangled Gaussian Splatting for Realistic Volume Rendering

    Authors: Zhongpai Gao, Benjamin Planche, Meng Zheng, Xiao Chen, Terrence Chen, Ziyan Wu

    Abstract: Digitally reconstructed radiographs (DRRs) are simulated 2D X-ray images generated from 3D CT volumes, widely used in preoperative settings but limited in intraoperative applications due to computational bottlenecks, especially for accurate but heavy physics-based Monte Carlo methods. While analytical DRR renderers offer greater efficiency, they overlook anisotropic X-ray image formation phenomena… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  11. arXiv:2405.18782  [pdf, other

    eess.IV cs.CV stat.ML

    Principled Probabilistic Imaging using Diffusion Models as Plug-and-Play Priors

    Authors: Zihui Wu, Yu Sun, Yifan Chen, Bingliang Zhang, Yisong Yue, Katherine L. Bouman

    Abstract: Diffusion models (DMs) have recently shown outstanding capability in modeling complex image distributions, making them expressive image priors for solving Bayesian inverse problems. However, most existing DM-based methods rely on approximations in the generative process to be generic to different inverse problems, leading to inaccurate sample distributions that deviate from the target posterior de… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  12. arXiv:2405.13771  [pdf, other

    eess.IV cs.CV cs.LG

    Multi-Dataset Multi-Task Learning for COVID-19 Prognosis

    Authors: Filippo Ruffini, Lorenzo Tronchin, Zhuoru Wu, Wenting Chen, Paolo Soda, Linlin Shen, Valerio Guarrasi

    Abstract: In the fight against the COVID-19 pandemic, leveraging artificial intelligence to predict disease outcomes from chest radiographic images represents a significant scientific aim. The challenge, however, lies in the scarcity of large, labeled datasets with compatible tasks for training deep learning models without leading to overfitting. Addressing this issue, we introduce a novel multi-dataset mul… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  13. arXiv:2405.11856  [pdf, other

    cs.RO eess.SY

    Modeling and simulation of a mechanism for suppressing the flip** problem of a jum** robot

    Authors: Qi Li, Liang Peng, Zhiyuan Wu, Pengda Ye, Weitao Zhang, Yi Xu, Qing Shi

    Abstract: In order to solve the problem of stable jum** of micro robot, we design a special mechanism: elastic passive joint (EPJ). EPJ can assist in achieving smooth jum** through the opening-closing process when the robot jumps. First, we introduce the composition and operation principle of EPJ, and perform a dynamic modeling of the robot's jum** process. Then, in order to verify the effectiveness o… ▽ More

    Submitted 20 May, 2024; originally announced May 2024.

  14. arXiv:2405.04867  [pdf, other

    eess.IV cs.CV

    MIPI 2024 Challenge on Demosaic for HybridEVS Camera: Methods and Results

    Authors: Yaqi Wu, Zhihao Fan, Xiaofeng Chu, Jimmy S. Ren, Xiaoming Li, Zongsheng Yue, Chongyi Li, Shangcheng Zhou, Ruicheng Feng, Yuekun Dai, Peiqing Yang, Chen Change Loy, Senyan Xu, Zhi**g Sun, Jiaying Zhu, Yurui Zhu, Xueyang Fu, Zheng-Jun Zha, Jun Cao, Cheng Li, Shu Chen, Liang Ma, Shiyang Zhou, Hai** Zeng, Kai Feng , et al. (24 additional authors not shown)

    Abstract: The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: MIPI@CVPR2024. Website: https://mipi-challenge.org/MIPI2024/

  15. arXiv:2404.17161  [pdf, other

    cs.SD eess.AS eess.SP

    An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder

    Authors: Yicheng Gu, Xueyao Zhang, Liumeng Xue, Haizhou Li, Zhizheng Wu

    Abstract: Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constan… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: arXiv admin note: text overlap with arXiv:2311.14957

  16. arXiv:2404.16619  [pdf, other

    cs.SD eess.AS

    The THU-HCSI Multi-Speaker Multi-Lingual Few-Shot Voice Cloning System for LIMMITS'24 Challenge

    Authors: Yixuan Zhou, Shuoyi Zhou, Shun Lei, Zhiyong Wu, Menglin Wu

    Abstract: This paper presents the multi-speaker multi-lingual few-shot voice cloning system developed by THU-HCSI team for LIMMITS'24 Challenge. To achieve high speaker similarity and naturalness in both mono-lingual and cross-lingual scenarios, we build the system upon YourTTS and add several enhancements. For further improving speaker similarity and speech quality, we introduce speaker-aware text encoder… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: Accepted in Grand Challenge of ICASSP 2024

  17. arXiv:2404.16484  [pdf, other

    cs.CV eess.IV

    Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

    Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, **shan Pan, Jiangxin Dong, **hui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi **, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

    Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: CVPR 2024, AI for Streaming (AIS) Workshop

  18. arXiv:2404.11938  [pdf, other

    cs.MM cs.DC cs.SD eess.AS

    HyDiscGAN: A Hybrid Distributed cGAN for Audio-Visual Privacy Preservation in Multimodal Sentiment Analysis

    Authors: Zhuojia Wu, Qi Zhang, Duoqian Miao, Kun Yi, Wei Fan, Liang Hu

    Abstract: Multimodal Sentiment Analysis (MSA) aims to identify speakers' sentiment tendencies in multimodal video content, raising serious concerns about privacy risks associated with multimodal data, such as voiceprints and facial images. Recent distributed collaborative learning has been verified as an effective paradigm for privacy preservation in multimodal tasks. However, they often overlook the privac… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

    Comments: 13 pages, IJCAI-2024

  19. arXiv:2404.10343  [pdf, other

    cs.CV eess.IV

    The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

    Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More

    Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

    Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

  20. arXiv:2404.10180  [pdf, other

    cs.CL cs.AI cs.LG cs.NE eess.AS

    Deferred NAM: Low-latency Top-K Context Injection via Deferred Context Encoding for Non-Streaming ASR

    Authors: Zelin Wu, Gan Song, Christopher Li, Pat Rondon, Zhong Meng, Xavier Velez, Weiran Wang, Diamantino Caseiro, Golan Pundak, Tsendsuren Munkhdalai, Angad Chandorkar, Rohit Prabhavalkar

    Abstract: Contextual biasing enables speech recognizers to transcribe important phrases in the speaker's context, such as contact names, even if they are rare in, or absent from, the training data. Attention-based biasing is a leading approach which allows for full end-to-end cotraining of the recognizer and biasing system and requires no separate inference-time components. Such biasers typically consist of… ▽ More

    Submitted 23 April, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

    Comments: 9 pages, 3 figures, accepted by NAACL 2024 - Industry Track

    Journal ref: 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics - Industry Track

  21. arXiv:2403.20289  [pdf, other

    cs.CL cs.SD eess.AS

    Emotion-Anchored Contrastive Learning Framework for Emotion Recognition in Conversation

    Authors: Fangxu Yu, Junjie Guo, Zhen Wu, Xinyu Dai

    Abstract: Emotion Recognition in Conversation (ERC) involves detecting the underlying emotion behind each utterance within a conversation. Effectively generating representations for utterances remains a significant challenge in this task. Recent works propose various models to address this issue, but they still struggle with differentiating similar emotions such as excitement and happiness. To alleviate thi… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

    Comments: Accepted by Findings of NAACL 2024

  22. DBPF: A Framework for Efficient and Robust Dynamic Bin-Picking

    Authors: Yichuan Li, Junkai Zhao, Yixiao Li, Zheng Wu, Rui Cao, Masayoshi Tomizuka, Yunhui Liu

    Abstract: Efficiency and reliability are critical in robotic bin-picking as they directly impact the productivity of automated industrial processes. However, traditional approaches, demanding static objects and fixed collisions, lead to deployment limitations, operational inefficiencies, and process unreliability. This paper introduces a Dynamic Bin-Picking Framework (DBPF) that challenges traditional stati… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: 8 pages, 5 figures. This paper has been accepted by IEEE RA-L on 2024-03-24. See the supplementary video at youtube: https://youtu.be/n5af2VsKhkg

  23. arXiv:2403.11845  [pdf

    eess.SP

    Simplified Self-homodyne Coherent System Based on Alamouti Coding and Digital Subcarrier Multiplexing

    Authors: Wei Wang, Dongdong Zou, Zhenpeng Wu, Qi Sui, Xingwen Yi, Fan Li, Chao Lu, Zhaohui Li

    Abstract: Coherent technology inherent with more availabledegrees of freedom is deemed a competitive solution for nextgeneration ultra-high-speed short-reach optical interconnects.However, the fatal barriers to implementing the conventiona.coherent system in short-reach optical interconnect are the costfootprint, and power consumption. Self-homodyne coherentsystem exhibits its potential to reduce the power… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  24. arXiv:2403.06185  [pdf, other

    cs.IT eess.SP math.OC

    Quantized Constant-Envelope Waveform Design for Massive MIMO DFRC Systems

    Authors: Zheyu Wu, Ya-Feng Liu, Wei-Kun Chen, Christos Masouros

    Abstract: Both dual-functional radar-communication (DFRC) and massive multiple-input multiple-output (MIMO) have been recognized as enabling technologies for 6G wireless networks. This paper considers the advanced waveform design for hardware-efficient massive MIMO DFRC systems. Specifically, the transmit waveform is imposed with the quantized constant-envelope (QCE) constraint, which facilitates the employ… ▽ More

    Submitted 10 March, 2024; originally announced March 2024.

    Comments: 17 pages, 11 figures, submitted for possible publication

  25. arXiv:2403.05834  [pdf, other

    cs.MM cs.SD eess.AS

    Enhancing Expressiveness in Dance Generation via Integrating Frequency and Music Style Information

    Authors: Qiaochu Huang, Xu He, Boshi Tang, Haolin Zhuang, Liyang Chen, Shuochen Gao, Zhiyong Wu, Haozhi Huang, Helen Meng

    Abstract: Dance generation, as a branch of human motion generation, has attracted increasing attention. Recently, a few works attempt to enhance dance expressiveness, which includes genre matching, beat alignment, and dance dynamics, from certain aspects. However, the enhancement is quite limited as they lack comprehensive consideration of the aforementioned three factors. In this paper, we propose Expressi… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

  26. arXiv:2403.05010  [pdf, other

    cs.SD cs.AI eess.AS

    RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction

    Authors: Peng Liu, Dongyang Dai, Zhiyong Wu

    Abstract: Recent advancements in generative modeling have significantly enhanced the reconstruction of audio waveforms from various representations. While diffusion models are adept at this task, they are hindered by latency issues due to their operation at the individual sample point level and the need for numerous sampling steps. In this study, we introduce RFWave, a cutting-edge multi-band Rectified Flow… ▽ More

    Submitted 2 June, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

  27. arXiv:2403.04374  [pdf, other

    eess.SY cs.AI

    Model-Free Load Frequency Control of Nonlinear Power Systems Based on Deep Reinforcement Learning

    Authors: Xiaodi Chen, Meng Zhang, Zhengguang Wu, Ligang Wu, Xiaohong Guan

    Abstract: Load frequency control (LFC) is widely employed in power systems to stabilize frequency fluctuation and guarantee power quality. However, most existing LFC methods rely on accurate power system modeling and usually ignore the nonlinear characteristics of the system, limiting controllers' performance. To solve these problems, this paper proposes a model-free LFC method for nonlinear power systems b… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  28. arXiv:2403.03552  [pdf, other

    cs.GT cs.LG cs.MA eess.SY

    Population-aware Online Mirror Descent for Mean-Field Games by Deep Reinforcement Learning

    Authors: Zida Wu, Mathieu Lauriere, Samuel Jia Cong Chua, Matthieu Geist, Olivier Pietquin, Ankur Mehta

    Abstract: Mean Field Games (MFGs) have the ability to handle large-scale multi-agent systems, but learning Nash equilibria in MFGs remains a challenging task. In this paper, we propose a deep reinforcement learning (DRL) algorithm that achieves population-dependent Nash equilibrium without the need for averaging or sampling from history, inspired by Munchausen RL and Online Mirror Descent. Through the desig… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

  29. arXiv:2403.03100  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

    Authors: Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, **yu Li, Sheng Zhao

    Abstract: While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing di… ▽ More

    Submitted 23 April, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

    Comments: Achieving human-level quality and naturalness on multi-speaker datasets (e.g., LibriSpeech) in a zero-shot way

  30. arXiv:2403.01435  [pdf, ps, other

    math.OC eess.SY

    Distributed Least-Squares Optimization Solvers with Differential Privacy

    Authors: Weijia Liu, Lei Wang, Fanghong Guo, Zhengguang Wu, Hongye Su

    Abstract: This paper studies the distributed least-squares optimization problem with differential privacy requirement of local cost functions, for which two differentially private distributed solvers are proposed. The first is established on the distributed gradient tracking algorithm, by appropriately perturbing the initial values and parameters that contain the privacy-sensitive data with Gaussian and tru… ▽ More

    Submitted 3 March, 2024; originally announced March 2024.

  31. arXiv:2402.18175  [pdf, other

    cs.CV eess.IV

    Self-Supervised Spatially Variant PSF Estimation for Aberration-Aware Depth-from-Defocus

    Authors: Zhuofeng Wu, Yusuke Monno, Masatoshi Okutomi

    Abstract: In this paper, we address the task of aberration-aware depth-from-defocus (DfD), which takes account of spatially variant point spread functions (PSFs) of a real camera. To effectively obtain the spatially variant PSFs of a real camera without requiring any ground-truth PSFs, we propose a novel self-supervised learning method that leverages the pair of real sharp and blurred images, which can be e… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

    Journal ref: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

  32. arXiv:2402.12660  [pdf, other

    cs.SD cs.HC eess.AS

    SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion

    Authors: Liumeng Xue, Chaoren Wang, Mingxuan Wang, Xueyao Zhang, Jun Han, Zhizheng Wu

    Abstract: In this study, we present SingVisio, an interactive visual analysis system that aims to explain the diffusion model used in singing voice conversion. SingVisio provides a visual display of the generation process in diffusion models, showcasing the step-by-step denoising of the noisy spectrum and its transformation into a clean spectrum that captures the desired singer's timbre. The system also fac… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

  33. arXiv:2402.03412  [pdf, other

    eess.IV cs.CV

    See More Details: Efficient Image Super-Resolution by Experts Mining

    Authors: Eduard Zamfir, Zongwei Wu, Nancy Mehta, Yulun Zhang, Radu Timofte

    Abstract: Reconstructing high-resolution (HR) images from low-resolution (LR) inputs poses a significant challenge in image super-resolution (SR). While recent approaches have demonstrated the efficacy of intricate operations customized for various objectives, the straightforward stacking of these disparate operations can result in a substantial computational burden, hampering their practical utility. In re… ▽ More

    Submitted 6 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: Accepted at ICML 2024

  34. arXiv:2401.17796  [pdf, other

    cs.SD eess.AS

    Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction

    Authors: Xueyuan Chen, Yuejiao Wang, Xixin Wu, Disong Wang, Zhiyong Wu, Xunying Liu, Helen Meng

    Abstract: Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech by improving the intelligibility and naturalness. This is a challenging task especially for patients with severe dysarthria and speaking in complex, noisy acoustic environments. To address these challenges, we propose a novel multi-modal framework to utilize visual information, e.g., lip movements, in DSR… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  35. arXiv:2401.13276  [pdf, other

    eess.AS

    SCNet: Sparse Compression Network for Music Source Separation

    Authors: Weinan Tong, Jiaxu Zhu, Jun Chen, Shiyin Kang, Tao Jiang, Yang Li, Zhiyong Wu, Helen Meng

    Abstract: Deep learning-based methods have made significant achievements in music source separation. However, obtaining good results while maintaining a low model complexity remains challenging in super wide-band music source separation. Previous works either overlook the differences in subbands or inadequately address the problem of information loss when generating subband features. In this paper, we propo… ▽ More

    Submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  36. arXiv:2401.12264  [pdf, other

    eess.AS cs.MM cs.SD eess.IV

    CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

    Authors: Xianghu Yue, Xiaohai Tian, Lu Lu, Malu Zhang, Zhizheng Wu, Haizhou Li

    Abstract: There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems… ▽ More

    Submitted 21 February, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

  37. arXiv:2401.12025  [pdf, other

    cs.IT eess.SP math.OC

    A Survey of Recent Advances in Optimization Methods for Wireless Communications

    Authors: Ya-Feng Liu, Tsung-Hui Chang, Mingyi Hong, Zheyu Wu, Anthony Man-Cho So, Eduard A. Jorswieck, Wei Yu

    Abstract: Mathematical optimization is now widely regarded as an indispensable modeling and solution tool for the design of wireless communications systems. While optimization has played a significant role in the revolutionary progress in wireless communication and networking technologies from 1G to 5G and onto the future 6G, the innovations in wireless technologies have also substantially transformed the n… ▽ More

    Submitted 7 June, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

    Comments: 39 pages, 5 figures, accepted for publication in IEEE Journal on Selected Areas in Communications

  38. arXiv:2401.08430  [pdf, other

    eess.SY eess.SP

    A Dynamic Capacitance Matching (DCM)-based Current Response Algorithm for Signal Line RC Network

    Authors: Zhoujie Wu, Cai Luo, Zhong Guan

    Abstract: This paper proposes a dynamic capacitance matching (DCM)-based RC current response algorithm for calculating the current waveform of a signal line without performing SPICE simulation. Specifically, unlike previous method such as CCS model, driver linear representation, waveform functional fitting or equivalent load capacitance, our algorithm does not rely on fixed reduced model of both standard ce… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

  39. arXiv:2401.07532  [pdf, other

    cs.SD cs.AI eess.AS

    Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long Multi-track Symbolic Music Generation

    Authors: Zhiwei Lin, Jun Chen, Boshi Tang, Binzhu Sha, **g Yang, Yaolong Ju, Fan Fan, Shiyin Kang, Zhiyong Wu, Helen Meng

    Abstract: Variational Autoencoders (VAEs) constitute a crucial component of neural symbolic music generation, among which some works have yielded outstanding results and attracted considerable attention. Nevertheless, previous VAEs still encounter issues with overly long feature sequences and generated results lack contextual coherence, thus the challenge of modeling long multi-track symbolic music still re… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  40. arXiv:2401.07494  [pdf, other

    cs.LG cs.CE eess.SY

    Input Convex Lipschitz RNN: A Fast and Robust Approach for Engineering Tasks

    Authors: Zihao Wang, Zhe Wu

    Abstract: Computational efficiency and non-adversarial robustness are critical factors in process modeling and optimization for real-world engineering applications. Yet, conventional neural networks often fall short in addressing both simultaneously, or even separately. Drawing insights from natural physical systems and existing literature, it is known theoretically that an input convex architecture will en… ▽ More

    Submitted 17 May, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

  41. arXiv:2401.05431  [pdf, other

    eess.SP cs.AI cs.LG

    TRLS: A Time Series Representation Learning Framework via Spectrogram for Medical Signal Processing

    Authors: Luyuan Xie, Cong Li, Xin Zhang, Shengfang Zhai, Yuejian Fang, Qingni Shen, Zhonghai Wu

    Abstract: Representation learning frameworks in unlabeled time series have been proposed for medical signal processing. Despite the numerous excellent progresses have been made in previous works, we observe the representation extracted for the time series still does not generalize well. In this paper, we present a Time series (medical signal) Representation Learning framework via Spectrogram (TRLS) to get m… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

    Comments: This paper is accept by ICASSP 2024. This is a more detailed version

  42. arXiv:2401.04235  [pdf, other

    cs.CL cs.SD eess.AS

    High-precision Voice Search Query Correction via Retrievable Speech-text Embedings

    Authors: Christopher Li, Gary Wang, Kyle Kastner, Heng Su, Allen Chen, Andrew Rosenberg, Zhehuai Chen, Zelin Wu, Leonid Velikovich, Pat Rondon, Diamantino Caseiro, Petar Aleksic

    Abstract: Automatic speech recognition (ASR) systems can suffer from poor recall for various reasons, such as noisy audio, lack of sufficient training data, etc. Previous work has shown that recall can be improved by retrieving rewrite candidates from a large database of likely, contextually-relevant alternatives to the hypothesis text using nearest-neighbors search over embeddings of the ASR hypothesis t… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

  43. arXiv:2401.03476  [pdf, other

    cs.MM cs.AI cs.HC cs.SD eess.AS

    Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness

    Authors: Sicheng Yang, Zunnan Xu, Haiwei Xue, Yongkang Cheng, Shaoli Huang, Mingming Gong, Zhiyong Wu

    Abstract: Current talking avatars mostly generate co-speech gestures based on audio and text of the utterance, without considering the non-speaking motion of the speaker. Furthermore, previous works on co-speech gesture generation have designed network structures based on individual gesture datasets, which results in limited data volume, compromised generalizability, and restricted speaker movements. To tac… ▽ More

    Submitted 7 January, 2024; originally announced January 2024.

    Comments: 6 pages, 3 figures, ICASSP 2024

  44. arXiv:2401.00283  [pdf, other

    cs.IT eess.SP

    Near-Space Communications: the Last Piece of 6G Space-Air-Ground-Sea Integrated Network Puzzle

    Authors: Hongshan Liu, Tong Qin, Zhen Gao, Tianqi Mao, Keke Ying, Ziwei Wan, Li Qiao, Rui Na, Zhongxiang Li, Chun Hu, Yikun Mei, Tuan Li, Guanghui Wen, Lei Chen, Zhonghuai Wu, Ruiqi Liu, Gaojie Chen, Shuo Wang, Dezhi Zheng

    Abstract: This article presents a comprehensive study on the emerging near-space communications (NS-COM) within the context of space-air-ground-sea integrated network (SAGSIN). Specifically, we firstly explore the recent technical developments of NS-COM, followed by the discussions about motivations behind integrating NS-COM into SAGSIN. To further demonstrate the necessity of NS-COM, a comparative analysis… ▽ More

    Submitted 4 March, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

    Comments: 28 pages, 8 figures, 2 tables

  45. arXiv:2312.15463  [pdf, other

    eess.AS cs.SD

    Consistent and Relevant: Rethink the Query Embedding in General Sound Separation

    Authors: Yuanyuan Wang, Hangting Chen, Dongchao Yang, Jianwei Yu, Chao Weng, Zhiyong Wu, Helen Meng

    Abstract: The query-based audio separation usually employs specific queries to extract target sources from a mixture of audio signals. Currently, most query-based separation models need additional networks to obtain query embedding. In this way, separation model is optimized to be adapted to the distribution of query embedding. However, query embedding may exhibit mismatches with separation models due to in… ▽ More

    Submitted 24 December, 2023; originally announced December 2023.

    Comments: Accepted by ICASSP 2024

  46. arXiv:2312.13611  [pdf, other

    cs.LG cs.NI eess.SP

    Topology Learning for Heterogeneous Decentralized Federated Learning over Unreliable D2D Networks

    Authors: Zheshun Wu, Zenglin Xu, Dun Zeng, Junfan Li, Jie Liu

    Abstract: With the proliferation of intelligent mobile devices in wireless device-to-device (D2D) networks, decentralized federated learning (DFL) has attracted significant interest. Compared to centralized federated learning (CFL), DFL mitigates the risk of central server failures due to communication bottlenecks. However, DFL faces several challenges, such as the severe heterogeneity of data distributions… ▽ More

    Submitted 10 March, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: To appear in IEEE Transactions on Vehicular Technology

  47. arXiv:2312.13523  [pdf

    physics.med-ph eess.IV

    High-resolution myelin-water fraction and quantitative relaxation map** using 3D ViSTa-MR fingerprinting

    Authors: Congyu Liao, Xiaozhi Cao, Siddharth Srinivasan Iyer, Sophie Schauman, Zihan Zhou, Xiaoqian Yan, Quan Chen, Zhitao Li, Nan Wang, Ting Gong, Zhe Wu, Hongjian He, Jianhui Zhong, Yang Yang, Adam Kerr, Kalanit Grill-Spector, Kawin Setsompop

    Abstract: Purpose: This study aims to develop a high-resolution whole-brain multi-parametric quantitative MRI approach for simultaneous map** of myelin-water fraction (MWF), T1, T2, and proton-density (PD), all within a clinically feasible scan time. Methods: We developed 3D ViSTa-MRF, which combined Visualization of Short Transverse relaxation time component (ViSTa) technique with MR Fingerprinting (MR… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

    Comments: 38 pages, 12 figures and 1 table

    Journal ref: Magnetic Resonance in Medicine 2023

  48. arXiv:2312.12181  [pdf, other

    cs.SD cs.AI eess.AS

    StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis

    Authors: Xueyuan Chen, Xi Wang, Shaofei Zhang, Lei He, Zhiyong Wu, Xixin Wu, Helen Meng

    Abstract: The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution in the training data. To address these issues, in this paper, we propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis. Firstly, a text style encoder is pre-trained with a large amount of unlab… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2024

  49. arXiv:2312.11896  [pdf, other

    eess.SY

    Stable Relay Learning Optimization Approach for Fast Power System Production Cost Minimization Simulation

    Authors: Zishan Guo, Qinran Hu, Tao Qian, Xin Fang, Renjie Hu, Zaijun Wu

    Abstract: Production cost minimization (PCM) simulation is commonly employed for assessing the operational efficiency, economic viability, and reliability, providing valuable insights for power system planning and operations. However, solving a PCM problem is time-consuming, consisting of numerous binary variables for simulation horizon extending over months and years. This hinders rapid assessment of moder… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: Submitted to IEEE Transactions on Power Systems on December 15, 2023

  50. arXiv:2312.10381  [pdf, other

    cs.SD eess.AS

    SECap: Speech Emotion Captioning with Large Language Model

    Authors: Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu, Shixiong Zhang, Guangzhi Li, Yi Luo, Rongzhi Gu

    Abstract: Speech emotions are crucial in human communication and are extensively used in fields like speech synthesis and natural language understanding. Most prior studies, such as speech emotion recognition, have categorized speech emotions into a fixed set of classes. Yet, emotions expressed in human speech are often complex, and categorizing them into predefined groups can be insufficient to adequately… ▽ More

    Submitted 23 December, 2023; v1 submitted 16 December, 2023; originally announced December 2023.

    Comments: Accepted by AAAI 2024