Skip to main content

Showing 1–50 of 307 results for author: Zha, Z

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.00463  [pdf, other

    cs.LG cs.AI cs.CL cs.HC eess.AS

    Open-Source Conversational AI with SpeechBrain 1.0

    Authors: Mirco Ravanelli, Titouan Parcollet, Adel Moumen, Sylvain de Langen, Cem Subakan, Peter Plantinga, Yingzhi Wang, Pooneh Mousavi, Luca Della Libera, Artem Ploujnikov, Francesco Paissan, Davide Borra, Salah Zaiem, Zeyu Zhao, Shucong Zhang, Georgios Karakasidis, Sung-Lin Yeh, Aku Rouhe, Rudolf Braun, Florian Mai, Juan Zuluaga-Gomez, Seyed Mahed Mousavi, Andreas Nautsch, Xuechen Liu, Sangeet Sagar , et al. (5 additional authors not shown)

    Abstract: SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much more.It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper presen… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: Submitted to JMLR (Machine Learning Open Source Software)

  2. arXiv:2406.16929  [pdf, other

    eess.SP cs.AI

    Modelling the 5G Energy Consumption using Real-world Data: Energy Fingerprint is All You Need

    Authors: Tingwei Chen, Yantao Wang, Hanzhi Chen, Zijian Zhao, Xinhao Li, Nicola Piovesan, Guangxu Zhu, Qingjiang Shi

    Abstract: The introduction of fifth-generation (5G) radio technology has revolutionized communications, bringing unprecedented automation, capacity, connectivity, and ultra-fast, reliable communications. However, this technological leap comes with a substantial increase in energy consumption, presenting a significant challenge. To improve the energy efficiency of 5G networks, it is imperative to develop sop… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  3. arXiv:2406.15119  [pdf, other

    cs.SD cs.AI eess.AS

    Speech Emotion Recognition under Resource Constraints with Data Distillation

    Authors: Yi Chang, Zhao Ren, Zhonghao Zhao, Thanh Tam Nguyen, Kun Qian, Tanja Schultz, Björn W. Schuller

    Abstract: Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  4. arXiv:2406.10856  [pdf, other

    cs.NI eess.SY

    LEO Satellite Networks Assisted Geo-distributed Data Processing

    Authors: Zhiyuan Zhao, Zhe Chen, Zheng Lin, Wenjun Zhu, Kun Qiu, Chaoqun You, Yue Gao

    Abstract: Nowadays, the increasing deployment of edge clouds globally provides users with low-latency services. However, connecting an edge cloud to a core cloud via optic cables in terrestrial networks poses significant barriers due to the prohibitively expensive building cost of optic cables. Fortunately, emerging Low Earth Orbit (LEO) satellite networks (e.g., Starlink) offer a more cost-effective soluti… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: 6 pages, 5 figures

  5. arXiv:2406.07255  [pdf, other

    cs.CV eess.IV

    Towards Realistic Data Generation for Real-World Super-Resolution

    Authors: Long Peng, Wenbo Li, Ren**g Pei, **g**g Ren, Xueyang Fu, Yang Wang, Yang Cao, Zheng-Jun Zha

    Abstract: Existing image super-resolution (SR) techniques often fail to generalize effectively in complex real-world settings due to the significant divergence between training data and practical scenarios. To address this challenge, previous efforts have either manually simulated intricate physical-based degradations or utilized learning-based techniques, yet these approaches remain inadequate for producin… ▽ More

    Submitted 11 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

  6. arXiv:2406.02430  [pdf, other

    eess.AS cs.SD

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Authors: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu , et al. (21 additional authors not shown)

    Abstract: We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  7. arXiv:2406.02429  [pdf, other

    eess.AS cs.SD

    Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion

    Authors: Ruiqi Li, Rongjie Huang, Yongqi Wang, Zhiqing Hong, Zhou Zhao

    Abstract: Speech-to-singing voice conversion (STS) task always suffers from data scarcity, because it requires paired speech and singing data. Compounding this issue are the challenges of content-pitch alignment and the suboptimal quality of generated outputs, presenting significant hurdles in STS research. This paper presents SVPT, an STS approach boosted by a self-supervised singing voice pre-training mod… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 13 pages

  8. arXiv:2406.01993  [pdf

    eess.IV cs.CV

    Choroidal Vessel Segmentation on Indocyanine Green Angiography Images via Human-in-the-Loop Labeling

    Authors: Ruoyu Chen, Ziwei Zhao, Mayinuer Yusufu, Xianwen Shang, Danli Shi, Mingguang He

    Abstract: Human-in-the-loop (HITL) strategy has been recently introduced into the field of medical image processing. Indocyanine green angiography (ICGA) stands as a well-established examination for visualizing choroidal vasculature and detecting chorioretinal diseases. However, the intricate nature of choroidal vascular networks makes large-scale manual segmentation of ICGA images challenging. Thus, the st… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 25 pages,4 figures

  9. arXiv:2406.01205  [pdf, other

    eess.AS cs.LG cs.SD

    ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

    Authors: Shengpeng Ji, Jialong Zuo, Minghui Fang, Siqi Zheng, Qian Chen, Wen Wang, Ziyue Jiang, Hai Huang, Xize Cheng, Rongjie Huang, Zhou Zhao

    Abstract: In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual style description prompt. Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  10. arXiv:2406.00356  [pdf, other

    eess.AS cs.SD

    AudioLCM: Text-to-Audio Generation with Latent Consistency Models

    Authors: Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize Cheng, Siqi Zheng, Zhou Zhao

    Abstract: Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficie… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

  11. arXiv:2406.00320  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    Frieren: Efficient Video-to-Audio Generation with Rectified Flow Matching

    Authors: Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, Zhou Zhao

    Abstract: Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video, and it remains challenging to build V2A models with high generation quality, efficiency, and visual-audio temporal synchrony. We propose Frieren, a V2A model based on rectified flow matching. Frieren regresses the conditional transport vector field from noise to spectrogram latent with straight paths and c… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

  12. arXiv:2406.00279  [pdf

    eess.IV cs.CV

    Hybrid attention structure preserving network for reconstruction of under-sampled OCT images

    Authors: Zezhao Guo, Zhanfang Zhao

    Abstract: Optical coherence tomography (OCT) is a non-invasive, high-resolution imaging technology that provides cross-sectional images of tissues. Dense acquisition of A-scans along the fast axis is required to obtain high digital resolution images. However, the dense acquisition will increase the acquisition time, causing the discomfort of patients. In addition, the longer acquisition time may lead to mot… ▽ More

    Submitted 31 May, 2024; originally announced June 2024.

  13. arXiv:2405.19450  [pdf, other

    cs.CV eess.IV

    FourierMamba: Fourier Learning Integration with State Space Models for Image Deraining

    Authors: Dong Li, Yidi Liu, Xueyang Fu, Senyan Xu, Zheng-Jun Zha

    Abstract: Image deraining aims to remove rain streaks from rainy images and restore clear backgrounds. Currently, some research that employs the Fourier transform has proved to be effective for image deraining, due to it acting as an effective frequency prior for capturing rain streaks. However, despite there exists dependency of low frequency and high frequency in images, these Fourier-based methods rarely… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  14. arXiv:2405.19336  [pdf

    eess.SP

    Image-based retrieval of all-day cloud physical parameters for FY4A/AGRI and its application over the Tibetan Plateau

    Authors: Zhijun Zhao, Feng Zhang, Wenwen Li, **gwei Li

    Abstract: Satellite remote sensing serves as a crucial means to acquire cloud physical parameters. However, existing official cloud products derived from the advanced geostationary radiation imager (AGRI) onboard the Fengyun-4A geostationary satellite suffer from limitations in computational precision and efficiency. In this study, an image-based transfer learning model (ITLM) was developed to realize all-d… ▽ More

    Submitted 28 March, 2024; originally announced May 2024.

  15. arXiv:2405.12872  [pdf, other

    eess.IV cs.CV

    Spatial-aware Attention Generative Adversarial Network for Semi-supervised Anomaly Detection in Medical Image

    Authors: Zerui Zhang, Zhichao Sun, Zelong Liu, Bo Du, Rui Yu, Zhou Zhao, Yongchao Xu

    Abstract: Medical anomaly detection is a critical research area aimed at recognizing abnormal images to aid in diagnosis.Most existing methods adopt synthetic anomalies and image restoration on normal samples to detect anomaly. The unlabeled data consisting of both normal and abnormal data is not well explored. We introduce a novel Spatial-aware Attention Generative Adversarial Network (SAGAN) for one-class… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

    Comments: Early Accept by MICCAI 2024

  16. arXiv:2405.09995  [pdf, other

    eess.SP

    Semantic Communication via Rate Distortion Perception Bottleneck

    Authors: Zihe Zhao, Chunyue Wang

    Abstract: With the advancement of Artificial Intelligence (AI) technology, next-generation wireless communication network is facing unprecedented challenge. Semantic communication has become a novel solution to address such challenges, with enhancing the efficiency of bandwidth utilization by transmitting meaningful information and filtering out superfluous data. Unfortunately, recent studies have shown tha… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

  17. arXiv:2405.09940  [pdf, other

    eess.AS cs.SD

    Robust Singing Voice Transcription Serves Synthesis

    Authors: Ruiqi Li, Yu Zhang, Yongqi Wang, Zhiqing Hong, Rongjie Huang, Zhou Zhao

    Abstract: Note-level Automatic Singing Voice Transcription (AST) converts singing recordings into note sequences, facilitating the automatic annotation of singing datasets for Singing Voice Synthesis (SVS) applications. Current AST methods, however, struggle with accuracy and robustness when used for practical annotation. This paper presents ROSVOT, the first robust AST model that serves SVS, incorporating… ▽ More

    Submitted 3 June, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

    Comments: ACL 2024

  18. arXiv:2405.07260  [pdf

    cs.LG cs.AI eess.SP

    A Supervised Information Enhanced Multi-Granularity Contrastive Learning Framework for EEG Based Emotion Recognition

    Authors: Xiang Li, Jian Song, Zhigang Zhao, Chunxiao Wang, Dawei Song, Bin Hu

    Abstract: This study introduces a novel Supervised Info-enhanced Contrastive Learning framework for EEG based Emotion Recognition (SICLEER). SI-CLEER employs multi-granularity contrastive learning to create robust EEG contextual representations, potentiallyn improving emotion recognition effectiveness. Unlike existing methods solely guided by classification loss, we propose a joint learning model combining… ▽ More

    Submitted 12 May, 2024; originally announced May 2024.

    Comments: 5 pages, 3 figures, 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  19. arXiv:2405.07023  [pdf, other

    eess.IV cs.CV

    Efficient Real-world Image Super-Resolution Via Adaptive Directional Gradient Convolution

    Authors: Long Peng, Yang Cao, Ren**g Pei, Wenbo Li, Jiaming Guo, Xueyang Fu, Yang Wang, Zheng-Jun Zha

    Abstract: Real-SR endeavors to produce high-resolution images with rich details while mitigating the impact of multiple degradation factors. Although existing methods have achieved impressive achievements in detail recovery, they still fall short when addressing regions with complex gradient arrangements due to the intensity-based linear weighting feature extraction manner. Moreover, the stochastic artifact… ▽ More

    Submitted 11 May, 2024; originally announced May 2024.

  20. arXiv:2405.05503  [pdf, other

    eess.SP

    Communications under Bursty Mixed Gaussian-impulsive Noise: Demodulation and Performance Analysis

    Authors: Tianfu Qi, Jun Wang, Zexue Zhao

    Abstract: This is the second part of the two-part paper considering the communications under the bursty mixed noise composed of white Gaussian noise and colored non-Gaussian impulsive noise. In the first part, based on Gaussian distribution and student distribution, we proposed a multivariate bursty mixed noise model and designed model parameter estimation algorithms. However, the performance of a communica… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

  21. arXiv:2405.04867  [pdf, other

    eess.IV cs.CV

    MIPI 2024 Challenge on Demosaic for HybridEVS Camera: Methods and Results

    Authors: Yaqi Wu, Zhihao Fan, Xiaofeng Chu, Jimmy S. Ren, Xiaoming Li, Zongsheng Yue, Chongyi Li, Shangcheng Zhou, Ruicheng Feng, Yuekun Dai, Peiqing Yang, Chen Change Loy, Senyan Xu, Zhi**g Sun, Jiaying Zhu, Yurui Zhu, Xueyang Fu, Zheng-Jun Zha, Jun Cao, Cheng Li, Shu Chen, Liang Ma, Shiyang Zhou, Hai** Zeng, Kai Feng , et al. (24 additional authors not shown)

    Abstract: The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: MIPI@CVPR2024. Website: https://mipi-challenge.org/MIPI2024/

  22. arXiv:2404.19750  [pdf, other

    cs.IT eess.SP

    A Joint Communication and Computation Design for Distributed RISs Assisted Probabilistic Semantic Communication in IIoT

    Authors: Zhouxiang Zhao, Zhaohui Yang, Chongwen Huang, Li Wei, Qianqian Yang, Caijun Zhong, Wei Xu, Zhaoyang Zhang

    Abstract: In this paper, the problem of spectral-efficient communication and computation resource allocation for distributed reconfigurable intelligent surfaces (RISs) assisted probabilistic semantic communication (PSC) in industrial Internet-of-Things (IIoT) is investigated. In the considered model, multiple RISs are deployed to serve multiple users, while PSC adopts compute-then-transmit protocol to reduc… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  23. arXiv:2404.16484  [pdf, other

    cs.CV eess.IV

    Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

    Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, **shan Pan, Jiangxin Dong, **hui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi **, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

    Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: CVPR 2024, AI for Streaming (AIS) Workshop

  24. arXiv:2404.15992  [pdf, other

    cs.CV eess.IV

    HDDGAN: A Heterogeneous Dual-Discriminator Generative Adversarial Network for Infrared and Visible Image Fusion

    Authors: Guosheng Lu, Zile Fang, Chunming He, Zhigang Zhao

    Abstract: Infrared and visible image fusion (IVIF) aims to preserve thermal radiation information from infrared images while integrating texture details from visible images, enabling the capture of important features and hidden details of subjects in complex scenes and disturbed environments. Consequently, IVIF offers distinct advantages in practical applications such as video surveillance, night navigation… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

  25. arXiv:2404.10343  [pdf, other

    cs.CV eess.IV

    The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

    Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More

    Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

    Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

  26. arXiv:2404.09313  [pdf, other

    eess.AS cs.AI

    Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment

    Authors: Zhiqing Hong, Rongjie Huang, Xize Cheng, Yongqi Wang, Ruiqi Li, Fuming You, Zhou Zhao, Zhimeng Zhang

    Abstract: A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consi… ▽ More

    Submitted 20 May, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

    Comments: ACL 2024 Main

  27. arXiv:2404.00612  [pdf, other

    cs.IT eess.SP

    Resource Allocation for Green Probabilistic Semantic Communication with Rate Splitting

    Authors: Ruopeng Xu, Zhaohui Yang, Zhouxiang Zhao, Qianqian Yang, Zhaoyang Zhang

    Abstract: In this paper, the energy efficient design for probabilistic semantic communication (PSC) system with rate splitting multiple access (RSMA) is investigated. Basic principles are first reviewed to show how the PSC system works to extract, compress and transmit the semantic information in a task-oriented transmission. Subsequently, the process of how multiuser semantic information can be represented… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

  28. arXiv:2403.12400  [pdf, other

    cs.LG cs.AI eess.SP

    Finding the Missing Data: A BERT-inspired Approach Against Package Loss in Wireless Sensing

    Authors: Zijian Zhao, Tingwei Chen, Fanyi Meng, Hang Li, Xiaoyang Li, Guangxu Zhu

    Abstract: Despite the development of various deep learning methods for Wi-Fi sensing, package loss often results in noncontinuous estimation of the Channel State Information (CSI), which negatively impacts the performance of the learning models. To overcome this challenge, we propose a deep learning model based on Bidirectional Encoder Representations from Transformers (BERT) for CSI recovery, named CSI-BER… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: 6 pages, accepted by IEEE INFOCOM Deepwireless Workshop 2024

  29. arXiv:2403.11780  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

    Authors: Yongqi Wang, Ruofan Hu, Rongjie Huang, Zhiqing Hong, Ruiqi Li, Wenrui Liu, Fuming You, Tao **, Zhou Zhao

    Abstract: Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Accepted by NAACL 2024 (main conference)

  30. arXiv:2403.11689  [pdf, other

    eess.IV cs.CV

    MoreStyle: Relax Low-frequency Constraint of Fourier-based Image Reconstruction in Generalizable Medical Image Segmentation

    Authors: Haoyu Zhao, Wenhui Dong, Rui Yu, Zhou Zhao, Du Bo, Yongchao Xu

    Abstract: The task of single-source domain generalization (SDG) in medical image segmentation is crucial due to frequent domain shifts in clinical image datasets. To address the challenge of poor generalization across different domains, we introduce a Plug-and-Play module for data augmentation called MoreStyle. MoreStyle diversifies image styles by relaxing low-frequency constraints in Fourier space, guidin… ▽ More

    Submitted 1 July, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: MICCAI2024

  31. arXiv:2403.11672  [pdf, other

    eess.IV cs.CV

    WIA-LD2ND: Wavelet-based Image Alignment for Self-supervised Low-Dose CT Denoising

    Authors: Haoyu Zhao, Yuliang Gu, Zhou Zhao, Bo Du, Yongchao Xu, Rui Yu

    Abstract: In clinical examinations and diagnoses, low-dose computed tomography (LDCT) is crucial for minimizing health risks compared with normal-dose computed tomography (NDCT). However, reducing the radiation dose compromises the signal-to-noise ratio, leading to degraded quality of CT images. To address this, we analyze LDCT denoising task based on experimental results from the frequency perspective, and… ▽ More

    Submitted 1 July, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: MICCAI2024

  32. arXiv:2403.11542  [pdf, ps, other

    eess.SY

    Topology Data Analysis-based Error Detection for Semantic Image Transmission with Incremental Knowledge-based HARQ

    Authors: Fei Ni, Rongpeng Li, Zhifeng Zhao, Honggang Zhang

    Abstract: Semantic communication (SemCom) aims to achieve high fidelity information delivery under low communication consumption by only guaranteeing semantic accuracy. Nevertheless, semantic communication still suffers from unexpected channel volatility and thus develo** a re-transmission mechanism (e.g., hybrid automatic repeat request [HARQ]) is indispensable. In that regard, instead of discarding prev… ▽ More

    Submitted 23 March, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

  33. arXiv:2403.09392  [pdf, other

    eess.IV cs.CV

    Event-based Asynchronous HDR Imaging by Temporal Incident Light Modulation

    Authors: Yuliang Wu, Ganchao Tan, **ze Chen, Wei Zhai, Yang Cao, Zheng-Jun Zha

    Abstract: Dynamic Range (DR) is a pivotal characteristic of imaging systems. Current frame-based cameras struggle to achieve high dynamic range imaging due to the conflict between globally uniform exposure and spatially variant scene illumination. In this paper, we propose AsynHDR, a Pixel-Asynchronous HDR imaging system, based on key insights into the challenges in HDR imaging and the unique event-generati… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

  34. arXiv:2403.08504  [pdf, other

    cs.CV cs.RO eess.IV

    OccFiner: Offboard Occupancy Refinement with Hybrid Propagation

    Authors: Hao Shi, Song Wang, Jiaming Zhang, Xiaoting Yin, Zhongdao Wang, Zhijian Zhao, Guangming Wang, Jianke Zhu, Kailun Yang, Kaiwei Wang

    Abstract: Vision-based occupancy prediction, also known as 3D Semantic Scene Completion (SSC), presents a significant challenge in computer vision. Previous methods, confined to onboard processing, struggle with simultaneous geometric and semantic estimation, continuity across varying viewpoints, and single-view occlusion. Our paper introduces OccFiner, a novel offboard framework designed to enhance the acc… ▽ More

    Submitted 15 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

  35. arXiv:2403.01132  [pdf

    cs.LG cs.SD eess.AS

    MPIPN: A Multi Physics-Informed PointNet for solving parametric acoustic-structure systems

    Authors: Chu Wang, **hong Wu, Yanzhi Wang, Zhijian Zha, Qi Zhou

    Abstract: Machine learning is employed for solving physical systems governed by general nonlinear partial differential equations (PDEs). However, complex multi-physics systems such as acoustic-structure coupling are often described by a series of PDEs that incorporate variable physical quantities, which are referred to as parametric systems. There are lack of strategies for solving parametric systems govern… ▽ More

    Submitted 2 March, 2024; originally announced March 2024.

    Comments: The number of figures is 16. The number of tables is 5. The number of words is 9717

  36. arXiv:2403.00434  [pdf, other

    cs.IT eess.SP

    Probabilistic Semantic Communication over Wireless Networks with Rate Splitting

    Authors: Zhouxiang Zhao, Zhaohui Yang, Ye Hu, Qianqian Yang, Wei Xu, Zhaoyang Zhang

    Abstract: In this paper, the problem of joint transmission and computation resource allocation for probabilistic semantic communication (PSC) system with rate splitting multiple access (RSMA) is investigated. In the considered model, the base station (BS) needs to transmit a large amount of data to multiple users with RSMA. Due to limited communication resources, the BS is required to utilize semantic commu… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

  37. arXiv:2402.16328  [pdf, other

    cs.IT eess.SP

    A Joint Communication and Computation Design for Probabilistic Semantic Communications

    Authors: Zhouxiang Zhao, Zhaohui Yang, Mingzhe Chen, Zhaoyang Zhang, H. Vincent Poor

    Abstract: In this paper, the problem of joint transmission and computation resource allocation for a multi-user probabilistic semantic communication (PSC) network is investigated. In the considered model, users employ semantic information extraction techniques to compress their large-sized data before transmitting them to a multi-antenna base station (BS). Our model represents large-sized data through subst… ▽ More

    Submitted 28 February, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

  38. arXiv:2402.12208  [pdf, other

    eess.AS cs.SD

    Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models

    Authors: Shengpeng Ji, Minghui Fang, Ziyue Jiang, Siqi Zheng, Qian Chen, Rongjie Huang, Jialung Zuo, Shulei Wang, Zhou Zhao

    Abstract: In recent years, large language models have achieved significant success in generative tasks (e.g., speech cloning and audio generation) related to speech, audio, music, and other signal domains. A crucial element of these models is the discrete acoustic codecs, which serves as an intermediate representation replacing the mel-spectrogram. However, there exist several gaps between discrete codecs a… ▽ More

    Submitted 27 April, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: We release a more powerful checkpoint in Language-Codec v3

  39. arXiv:2402.09378  [pdf, other

    eess.AS cs.SD

    MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

    Authors: Shengpeng Ji, Ziyue Jiang, Hanting Wang, Jialong Zuo, Zhou Zhao

    Abstract: Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice cloning capabilities, requiring only a few seconds of unseen speaker voice prompts. However, all previous work has been developed for cloud-based systems. Taking autoregressive models as an example, although these approaches achieve high-fidelity voice cloning, they fall short in terms of inference speed, mod… ▽ More

    Submitted 2 June, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

    Comments: Accepted by ACL 2024 (Main Conference)

  40. arXiv:2402.07729  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension

    Authors: Qian Yang, ** Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, **gren Zhou

    Abstract: Recently, instruction-following audio-language models have received broad attention for human-audio interaction. However, the absence of benchmarks capable of evaluating audio-centric interaction capabilities has impeded advancements in this field. Previous models primarily focus on assessing different fundamental tasks, such as Automatic Speech Recognition (ASR), and lack an assessment of the ope… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

  41. arXiv:2401.05426  [pdf, other

    eess.SP cs.AI cs.LG

    CoSS: Co-optimizing Sensor and Sampling Rate for Data-Efficient AI in Human Activity Recognition

    Authors: Mengxi Liu, Zimin Zhao, Daniel Geißler, Bo Zhou, Sungho Suh, Paul Lukowicz

    Abstract: Recent advancements in Artificial Neural Networks have significantly improved human activity recognition using multiple time-series sensors. While employing numerous sensors with high-frequency sampling rates usually improves the results, it often leads to data inefficiency and unnecessary expansion of the ANN, posing a challenge for their practical deployment on edge devices. Addressing these iss… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

    Comments: Accepeted by the 2nd Workshop on Sustainable AI (AAAI24)

  42. arXiv:2401.02117  [pdf, other

    cs.RO cs.AI cs.CV cs.LG eess.SY

    Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

    Authors: Zipeng Fu, Tony Z. Zhao, Chelsea Finn

    Abstract: Imitation learning from human demonstrations has shown impressive performance in robotics. However, most results focus on table-top manipulation, lacking the mobility and dexterity necessary for generally useful tasks. In this work, we develop a system for imitating mobile manipulation tasks that are bimanual and require whole-body control. We first present Mobile ALOHA, a low-cost and whole-body… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

    Comments: Project website: https://mobile-aloha.github.io (Zipeng Fu and Tony Z. Zhao are project co-leads, Chelsea Finn is the advisor)

  43. arXiv:2312.17183  [pdf, other

    eess.IV cs.CV

    One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts

    Authors: Ziheng Zhao, Yao Zhang, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, Weidi Xie

    Abstract: In this study, we focus on building up a model that aims to Segment Anything in medical scenarios, driven by Text prompts, termed as SAT. Our main contributions are three folds: (i) for dataset construction, we combine multiple knowledge sources to construct the first multi-modal knowledge tree on human anatomy, including 6502 anatomical terminologies; Then we build up the largest and most compreh… ▽ More

    Submitted 1 May, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

    Comments: 53 pages

  44. arXiv:2312.15197  [pdf, other

    cs.SD cs.CL cs.CV eess.AS

    TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

    Authors: Xize Cheng, Rongjie Huang, Linjun Li, Tao **, Zehan Wang, Aoxiong Yin, Minglei Li, Xinyu Duan, changpeng yang, Zhou Zhao

    Abstract: Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning. This approach circumvents delays and cascading errors associated with model cascading. However, talking head translation, converting audio-visual speech (i.e., talking head video) from one language into another, still confronts several challenges comp… ▽ More

    Submitted 23 December, 2023; originally announced December 2023.

  45. arXiv:2312.13556  [pdf, other

    cs.SD eess.AS

    Multi-Level Knowledge Distillation for Speech Emotion Recognition in Noisy Conditions

    Authors: Yang Liu, Haoqin Sun, Geng Chen, Qingyue Wang, Zhen Zhao, Xugang Lu, Longbiao Wang

    Abstract: Speech emotion recognition (SER) performance deteriorates significantly in the presence of noise, making it challenging to achieve competitive performance in noisy conditions. To this end, we propose a multi-level knowledge distillation (MLKD) method, which aims to transfer the knowledge from a teacher model trained on clean speech to a simpler student model trained on noisy speech. Specifically,… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

    Comments: Accepted by INTERSPEECH 2023

  46. arXiv:2312.10741  [pdf, other

    eess.AS cs.CL cs.SD

    StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis

    Authors: Yu Zhang, Rongjie Huang, Ruiqi Li, **Zheng He, Yan Xia, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao

    Abstract: Style transfer for out-of-domain (OOD) singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles (such as timbre, emotion, pronunciation, and articulation skills) derived from reference singing voice samples. However, the endeavor to model the intricate nuances of singing voice styles is an arduous task, as singing voices possess a remarkable degree of expr… ▽ More

    Submitted 2 January, 2024; v1 submitted 17 December, 2023; originally announced December 2023.

    Comments: Accepted by AAAI 2024

  47. Robust Target Detection of Intelligent Integrated Optical Camera and mmWave Radar System

    Authors: Chen Zhu, Zhouxiang Zhao, Ze**g Shan, Lijie Yang, Sijie Ji, Zhaohui Yang, Zhaoyang Zhang

    Abstract: Target detection is pivotal for modern urban computing applications. While image-based techniques are widely adopted, they falter under challenging environmental conditions such as adverse weather, poor lighting, and occlusion. To improve the target detection performance under complex real-world scenarios, this paper proposes an intelligent integrated optical camera and millimeter-wave (mmWave) ra… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

  48. arXiv:2312.06197  [pdf, other

    cs.SD cs.MM eess.AS

    MART: Learning Hierarchical Music Audio Representations with Part-Whole Transformer

    Authors: Dong Yao, Jieming Zhu, Jiahao Xun, Shengyu Zhang, Zhou Zhao, Liqun Deng, Wenqiao Zhang, Zhenhua Dong, Xin Jiang

    Abstract: Recent research in self-supervised contrastive learning of music representations has demonstrated remarkable results across diverse downstream tasks. However, a prevailing trend in existing methods involves representing equally-sized music clips in either waveform or spectrogram formats, often overlooking the intrinsic part-whole hierarchies within music. In our quest to comprehend the bottom-up s… ▽ More

    Submitted 19 April, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: Short paper accepted by WWW 2024. This is revised and condensed based on the previous version titled "Music-PAW: Learning Music Representations via Hierarchical Part-whole Interaction and Contrast". For more experimental details and discussions, please refer to the original long paper at arXiv:2312.06197v1

  49. arXiv:2312.02471  [pdf, other

    cs.NI cs.LG eess.SP

    Congestion-aware Distributed Task Offloading in Wireless Multi-hop Networks Using Graph Neural Networks

    Authors: Zhongyuan Zhao, Jake Perazzone, Gunjan Verma, Santiago Segarra

    Abstract: Computational offloading has become an enabling component for edge intelligence in mobile and smart devices. Existing offloading schemes mainly focus on mobile devices and servers, while ignoring the potential network congestion caused by tasks from multiple mobile devices, especially in wireless multi-hop networks. To fill this gap, we propose a low-overhead, congestion-aware distributed task off… ▽ More

    Submitted 21 January, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: 5 pages, 5 figures, accepted to IEEE ICASSP 2024

    MSC Class: 05C90 ACM Class: C.2.1; C.2.2

  50. arXiv:2312.01423  [pdf, other

    eess.SP

    Self-Critical Alternate Learning based Semantic Broadcast Communication

    Authors: Zhilin Lu, Rongpeng Li, Ming Lei, Chan Wang, Zhifeng Zhao, Honggang Zhang

    Abstract: Semantic communication (SemCom) has been deemed as a promising communication paradigm to break through the bottleneck of traditional communications. Nonetheless, most of the existing works focus more on point-to-point communication scenarios and its extension to multi-user scenarios is not that straightforward due to its cost-inefficiencies to directly scale the JSCC framework to the multi-user co… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.