Skip to main content

Showing 1–50 of 1,570 results for author: Wang, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.18985  [pdf, other

    cs.IT eess.SP

    Exploiting Structured Sparsity in Near Field: From the Perspective of Decomposition

    Authors: Xufeng Guo, Yuanbin Chen, Ying Wang, Chau Yuen

    Abstract: The structured sparsity can be leveraged in traditional far-field channels, greatly facilitating efficient sparse channel recovery by compressing the complexity of overheads to the level of the scatterer number. However, when experiencing a fundamental shift from planar-wave-based far-field modeling to spherical-wave-based near-field modeling, whether these benefits persist in the near-field regim… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: This aricle has been accepted for publication in IEEE Commag

  2. arXiv:2406.18871  [pdf, other

    eess.AS cs.CL

    DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

    Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

    Abstract: Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby fa… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  3. arXiv:2406.17225  [pdf, other

    eess.IV cs.CV

    Multimodal Cross-Task Interaction for Survival Analysis in Whole Slide Pathological Images

    Authors: Songhan Jiang, Zhengyu Gan, Linghan Cai, Yifeng Wang, Yongbing Zhang

    Abstract: Survival prediction, utilizing pathological images and genomic profiles, is increasingly important in cancer analysis and prognosis. Despite significant progress, precise survival analysis still faces two main challenges: (1) The massive pixels contained in whole slide images (WSIs) complicate the process of pathological images, making it difficult to generate an effective representation of the tu… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  4. arXiv:2406.16929  [pdf, other

    eess.SP cs.AI

    Modelling the 5G Energy Consumption using Real-world Data: Energy Fingerprint is All You Need

    Authors: Tingwei Chen, Yantao Wang, Hanzhi Chen, Zijian Zhao, Xinhao Li, Nicola Piovesan, Guangxu Zhu, Qingjiang Shi

    Abstract: The introduction of fifth-generation (5G) radio technology has revolutionized communications, bringing unprecedented automation, capacity, connectivity, and ultra-fast, reliable communications. However, this technological leap comes with a substantial increase in energy consumption, presenting a significant challenge. To improve the energy efficiency of 5G networks, it is imperative to develop sop… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  5. arXiv:2406.16878  [pdf, ps, other

    eess.SP cs.AI cs.IT

    Benchmarking Semantic Communications for Image Transmission Over MIMO Interference Channels

    Authors: Yanhu Wang, Shuaishuai Guo, Anming Dong, Hui Zhao

    Abstract: Semantic communications offer promising prospects for enhancing data transmission efficiency. However, existing schemes have predominantly concentrated on point-to-point transmissions. In this paper, we aim to investigate the validity of this claim in interference scenarios compared to baseline approaches. Specifically, our focus is on general multiple-input multiple-output (MIMO) interference cha… ▽ More

    Submitted 10 April, 2024; originally announced June 2024.

  6. arXiv:2406.16381  [pdf, other

    eess.SP

    Polar-Coded Tensor-Based Unsourced Random Access with Soft Decoding

    Authors: Jiaqi Fang, Yan Liang, Gangle Sun, Hongwei Hou, Yafei Wang, Li You, Wen** Wang

    Abstract: The unsourced random access (URA) has emerged as a viable scheme for supporting the massive machine-type communications (mMTC) in the sixth generation (6G) wireless networks. Notably, the tensor-based URA (TURA), with its inherent tensor structure, stands out by simultaneously enhancing performance and reducing computational complexity for the multi-user separation, especially in mMTC networks wit… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  7. arXiv:2406.16303  [pdf, other

    eess.SP

    Hybrid Precoding With Low-Resolution PSs for Wideband Terahertz Communication Systems in The Face of Beam Squint

    Authors: Yang Wang, Chuang Yang, Mugen Peng

    Abstract: Terahertz (THz) communication is considered one of the most critical technologies for 6G because of its abundant bandwidth. To compensate the high propagation of THz, analog/digital hybrid precoding for THz massive multiple input multiple output (MIMO) is proposed to focus signals and extend communication range. Notably, considering hardware cost and power consumption, infinite and high-resolution… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  8. arXiv:2406.16150  [pdf, other

    eess.IV cs.CV

    Intensity Confusion Matters: An Intensity-Distance Guided Loss for Bronchus Segmentation

    Authors: Haifan Gong, Wenhao Huang, Huan Zhang, Yu Wang, Xiang Wan, Hong Shen, Guanbin Li, Haofeng Li

    Abstract: Automatic segmentation of the bronchial tree from CT imaging is important, as it provides structural information for disease diagnosis. Despite the merits of previous automatic bronchus segmentation methods, they have paied less attention to the issue we term as \textit{Intensity Confusion}, wherein the intensity values of certain background voxels approach those of the foreground voxels within br… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: IEEE International Conference on Multimedia & Expo (ICME) 2024

  9. Learning Autonomous Race Driving with Action Map** Reinforcement Learning

    Authors: Yuanda Wang, Xin Yuan, Changyin Sun

    Abstract: Autonomous race driving poses a complex control challenge as vehicles must be operated at the edge of their handling limits to reduce lap times while respecting physical and safety constraints. This paper presents a novel reinforcement learning (RL)-based approach, incorporating the action map** (AM) mechanism to manage state-dependent input constraints arising from limited tire-road friction. A… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  10. arXiv:2406.14896  [pdf, other

    eess.IV cs.CV

    SelfReg-UNet: Self-Regularized UNet for Medical Image Segmentation

    Authors: Wenhui Zhu, Xiwen Chen, Peijie Qiu, Mohammad Farazi, Aristeidis Sotiras, Abolfazl Razi, Yalin Wang

    Abstract: Since its introduction, UNet has been leading a variety of medical image segmentation tasks. Although numerous follow-up studies have also been dedicated to improving the performance of standard UNet, few have conducted in-depth analyses of the underlying interest pattern of UNet in medical image segmentation. In this paper, we explore the patterns learned in a UNet and observe two important facto… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: Accepted as a conference paper to 2024 MICCAI

  11. arXiv:2406.14069  [pdf, other

    eess.IV cs.CV

    Towards Multi-modality Fusion and Prototype-based Feature Refinement for Clinically Significant Prostate Cancer Classification in Transrectal Ultrasound

    Authors: Hong Wu, Juan Fu, Hongsheng Ye, Yuming Zhong, Xuebin Zou, Jianhua Zhou, Yi Wang

    Abstract: Prostate cancer is a highly prevalent cancer and ranks as the second leading cause of cancer-related deaths in men globally. Recently, the utilization of multi-modality transrectal ultrasound (TRUS) has gained significant traction as a valuable technique for guiding prostate biopsies. In this study, we propose a novel learning framework for clinically significant prostate cancer (csPCa) classifica… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  12. arXiv:2406.13977  [pdf, other

    eess.IV cs.CV

    Similarity-aware Syncretic Latent Diffusion Model for Medical Image Translation with Representation Learning

    Authors: Tingyi Lin, Pengju Lyu, Jie Zhang, Yuqing Wang, Cheng Wang, Jianjun Zhu

    Abstract: Non-contrast CT (NCCT) imaging may reduce image contrast and anatomical visibility, potentially increasing diagnostic uncertainty. In contrast, contrast-enhanced CT (CECT) facilitates the observation of regions of interest (ROI). Leading generative models, especially the conditional diffusion model, demonstrate remarkable capabilities in medical image modality transformation. Typical conditional d… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  13. arXiv:2406.13340  [pdf, other

    cs.CL cs.SD eess.AS

    SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

    Authors: Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu

    Abstract: Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, includin… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  14. arXiv:2406.13275  [pdf, other

    cs.SD cs.CL eess.AS

    Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

    Authors: Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang

    Abstract: Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED)… ▽ More

    Submitted 25 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  15. arXiv:2406.13179  [pdf, other

    cs.SD cs.AI cs.NE eess.AS

    Global-Local Convolution with Spiking Neural Networks for Energy-efficient Keyword Spotting

    Authors: Shuai Wang, Dehao Zhang, Kexin Shi, Yuchen Wang, Wenjie Wei, Jibin Wu, Malu Zhang

    Abstract: Thanks to Deep Neural Networks (DNNs), the accuracy of Keyword Spotting (KWS) has made substantial progress. However, as KWS systems are usually implemented on edge devices, energy efficiency becomes a critical requirement besides performance. Here, we take advantage of spiking neural networks' energy efficiency and propose an end-to-end lightweight KWS model. The model consists of two innovative… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  16. arXiv:2406.13150  [pdf

    eess.IV cs.CV

    MCAD: Multi-modal Conditioned Adversarial Diffusion Model for High-Quality PET Image Reconstruction

    Authors: Jiaqi Cui, Xinyi Zeng, Pinxian Zeng, Bo Liu, Xi Wu, Jiliu Zhou, Yan Wang

    Abstract: Radiation hazards associated with standard-dose positron emission tomography (SPET) images remain a concern, whereas the quality of low-dose PET (LPET) images fails to meet clinical requirements. Therefore, there is great interest in reconstructing SPET images from LPET images. However, prior studies focus solely on image data, neglecting vital complementary information from other modalities, e.g.… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Early accepted by MICCAI2024

  17. arXiv:2406.12699  [pdf, other

    cs.SD eess.AS eess.SP

    Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition

    Authors: Kuan-Chen Wang, You-** Li, Wei-Lun Chen, Yu-Wen Chen, Yi-Ching Wang, **-Cheng Yeh, Chao Zhang, Yu Tsao

    Abstract: Noise robustness is critical when applying automatic speech recognition (ASR) in real-world scenarios. One solution involves the used of speech enhancement (SE) models as the front end of ASR. However, neural network-based (NN-based) SE often introduces artifacts into the enhanced signals and harms ASR performance, particularly when SE and ASR are independently trained. Therefore, this study intro… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  18. arXiv:2406.11568  [pdf, other

    cs.CL cs.SD eess.AS q-bio.NC

    Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models

    Authors: Sheng Feng, Heyang Liu, Yu Wang, Yanfeng Wang

    Abstract: In this paper, we introduce a groundbreaking end-to-end (E2E) framework for decoding invasive brain signals, marking a significant advancement in the field of speech neuroprosthesis. Our methodology leverages the comprehensive reasoning abilities of large language models (LLMs) to facilitate direct decoding. By fully integrating LLMs, we achieve results comparable to the state-of-the-art cascade m… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  19. arXiv:2406.10056  [pdf, other

    cs.SD eess.AS

    UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

    Authors: Dongchao Yang, Haohan Guo, Yuanyuan Wang, Rongjie Huang, Xiang Li, Xu Tan, Xixin Wu, Helen Meng

    Abstract: The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel and LLMs-dr… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  20. arXiv:2406.09317  [pdf, other

    eess.IV cs.CV

    Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

    Authors: Meng Wang, Tian Lin, Kai Yu, Aidi Lin, Yuanyuan Peng, Lianyu Wang, Cheng Chen, Ke Zou, Huiyu Liang, Man Chen, Xue Yao, Meiqin Zhang, Binwei Huang, Chaoxin Zheng, Wei Chen, Yilong Luo, Yifan Chen, **gcheng Wang, Yih Chung Tham, Dianbo Liu, Wendy Wong, Sahil Thakur, Beau Fenner, Yanda Meng, Yukun Zhou , et al. (11 additional authors not shown)

    Abstract: The current retinal artificial intelligence models were trained using data with a limited category of diseases and limited knowledge. In this paper, we present a retinal vision-language foundation model (RetiZero) with knowledge of over 400 fundus diseases. Specifically, we collected 341,896 fundus images paired with text descriptions from 29 publicly available datasets, 180 ophthalmic books, and… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  21. arXiv:2406.09192  [pdf, other

    eess.SP

    Joint Power Allocation and Beamforming Design for Active IRS-Aided Directional Modulation Secure Systems

    Authors: Yifan Zhao, Xiaoyu Wang, Kaibo Zhou, Xuehui Wang, Yan Wang, Wei Gao, Ruiqi Liu, Feng Shu

    Abstract: Since the secrecy rate (SR) performance improvement obtained by secure directional modulation (DM) network is limited, an active intelligent reflective surface (IRS)-assisted DM network is considered to attain a high SR. To address the SR maximization problem, a novel method based on Lagrangian dual transform and closed-form fractional programming algorithm (LDT-CFFP) is proposed, where the soluti… ▽ More

    Submitted 25 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Directional modulation, active intelligent reflective surface, Lagrangian dual transformation, fractional programming, power allocation

  22. arXiv:2406.09182  [pdf, ps, other

    eess.SP cs.LG

    Federated Contrastive Learning for Personalized Semantic Communication

    Authors: Yining Wang, Wanli Ni, Wenqiang Yi, Xiaodong Xu, ** Zhang, Arumugam Nallanathan

    Abstract: In this letter, we design a federated contrastive learning (FedCL) framework aimed at supporting personalized semantic communication. Our FedCL enables collaborative training of local semantic encoders across multiple clients and a global semantic decoder owned by the base station. This framework supports heterogeneous semantic encoders since it does not require client-side model aggregation. Furt… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: IEEE Communications Letters

  23. arXiv:2406.09061  [pdf, other

    eess.SY

    Joint Observer Gain and Input Design for Asymptotic Active Fault Diagnosis

    Authors: Feng Xu, Yiming Wan, Ye Wang, Vicenc Puig

    Abstract: This paper proposes a joint gain and input design method for observer-based asymptotic active fault diagnosis, which is based on a newly-defined notion named the excluding degree of the origin from a zonotope. Using the excluding degree, a quantitative specification is obtained to characterize the performance of set-based robust fault diagnosis. Furthermore, a single gain design method and a joint… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  24. arXiv:2406.09022  [pdf, other

    eess.SP

    Towards Unified AI Models for MU-MIMO Communications: A Tensor Equivariance Framework

    Authors: Yafei Wang, Hongwei Hou, ** Yi, Wen** Wang, Shi **

    Abstract: In this paper, we propose a unified framework based on equivariance for the design of artificial intelligence (AI)-assisted technologies in multi-user multiple-input-multiple-output (MU-MIMO) systems. We first provide definitions of multidimensional equivariance, high-order equivariance, and multidimensional invariance (referred to collectively as tensor equivariance). On this basis, by investigat… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  25. arXiv:2406.08716  [pdf, other

    cs.SD eess.AS

    TSE-PI: Target Sound Extraction under Reverberant Environments with Pitch Information

    Authors: Yiwen Wang, Xihong Wu

    Abstract: Target sound extraction (TSE) separates the target sound from the mixture signals based on provided clues. However, the performance of existing models significantly degrades under reverberant conditions. Inspired by auditory scene analysis (ASA), this work proposes a TSE model provided with pitch information named TSE-PI. Conditional pitch extraction is achieved through the Feature-wise Linearly M… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech2024

  26. arXiv:2406.08634  [pdf, other

    eess.IV cs.CV cs.LG

    Unveiling Incomplete Modality Brain Tumor Segmentation: Leveraging Masked Predicted Auto-Encoder and Divergence Learning

    Authors: Zhongao Sun, Jiameng Li, Yuhan Wang, Jiarong Cheng, Qing Zhou, Chun Li

    Abstract: Brain tumor segmentation remains a significant challenge, particularly in the context of multi-modal magnetic resonance imaging (MRI) where missing modality images are common in clinical settings, leading to reduced segmentation accuracy. To address this issue, we propose a novel strategy, which is called masked predicted pre-training, enabling robust feature learning from incomplete modality data… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  27. arXiv:2406.08300  [pdf, other

    eess.IV cs.CV

    From Chaos to Clarity: 3DGS in the Dark

    Authors: Zhihao Li, Yufei Wang, Alex Kot, Bihan Wen

    Abstract: Novel view synthesis from raw images provides superior high dynamic range (HDR) information compared to reconstructions from low dynamic range RGB images. However, the inherent noise in unprocessed raw images compromises the accuracy of 3D scene representation. Our study reveals that 3D Gaussian Splatting (3DGS) is particularly susceptible to this noise, leading to numerous elongated Gaussian shap… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  28. arXiv:2406.08203  [pdf, other

    eess.AS cs.SD

    LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation

    Authors: Wenhao Guan, Kaidi Wang, Wang** Zhou, Yang Wang, Feng Deng, Hui Wang, Lin Li, Qingyang Hong, Yong Qin

    Abstract: Recently, the application of diffusion models has facilitated the significant development of speech and audio generation. Nevertheless, the quality of samples generated by diffusion models still needs improvement. And the effectiveness of the method is accompanied by the extensive number of sampling steps, leading to an extended synthesis time necessary for generating high-quality audio. Previous… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech2024

  29. arXiv:2406.07914  [pdf, other

    cs.SD eess.AS

    Can Large Language Models Understand Spatial Audio?

    Authors: Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Jun Zhang, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

    Abstract: This paper explores enabling large language models (LLMs) to understand spatial information from multichannel audio, a skill currently lacking in auditory LLMs. By leveraging LLMs' advanced cognitive and inferential abilities, the aim is to enhance understanding of 3D environments via audio. We study 3 spatial audio tasks: sound source localization (SSL), far-field speech recognition (FSR), and lo… ▽ More

    Submitted 14 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  30. arXiv:2406.07842  [pdf, other

    eess.AS cs.CL

    Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR

    Authors: Yerbolat Khassanov, Zhipeng Chen, Tianfeng Chen, Tze Yuang Chong, Wei Li, Jun Zhang, Lu Lu, Yuxuan Wang

    Abstract: This paper addresses challenges in integrating new languages into a pre-trained multilingual automatic speech recognition (mASR) system, particularly in scenarios where training data for existing languages is limited or unavailable. The proposed method employs a dual-pipeline with low-rank adaptation (LoRA). It maintains two data flow pipelines-one for existing languages and another for new langua… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, 4 tables

  31. arXiv:2406.07255  [pdf, other

    cs.CV eess.IV

    Towards Realistic Data Generation for Real-World Super-Resolution

    Authors: Long Peng, Wenbo Li, Ren**g Pei, **g**g Ren, Xueyang Fu, Yang Wang, Yang Cao, Zheng-Jun Zha

    Abstract: Existing image super-resolution (SR) techniques often fail to generalize effectively in complex real-world settings due to the significant divergence between training data and practical scenarios. To address this challenge, previous efforts have either manually simulated intricate physical-based degradations or utilized learning-based techniques, yet these approaches remain inadequate for producin… ▽ More

    Submitted 11 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

  32. arXiv:2406.07012  [pdf, other

    cs.SD cs.CL eess.AS

    Bridging Language Gaps in Audio-Text Retrieval

    Authors: Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang

    Abstract: Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multi… ▽ More

    Submitted 16 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: interspeech2024

  33. arXiv:2406.06992  [pdf, other

    cs.SD eess.AS

    Scaling up masked audio encoder learning for general audio classification

    Authors: Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

    Abstract: Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and… ▽ More

    Submitted 13 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024

  34. arXiv:2406.06833  [pdf, other

    eess.SY stat.AP

    Data-driven Power Flow Linearization: Simulation

    Authors: Mengshuo Jia, Gabriela Hug, Ning Zhang, Zhaojian Wang, Yi Wang, Chongqing Kang

    Abstract: Building on the theoretical insights of Part I, this paper, as the second part of the tutorial, dives deeper into data-driven power flow linearization (DPFL), focusing on comprehensive numerical testing. The necessity of these simulations stems from the theoretical analysis's inherent limitations, particularly the challenge of identifying the differences in real-world performance among DPFL method… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 26 pages

  35. arXiv:2406.05915  [pdf, other

    cs.CV eess.IV

    Bits-to-Photon: End-to-End Learned Scalable Point Cloud Compression for Direct Rendering

    Authors: Yueyu Hu, Ran Gong, Yao Wang

    Abstract: Point cloud is a promising 3D representation for volumetric streaming in emerging AR/VR applications. Despite recent advances in point cloud compression, decoding and rendering high-quality images from lossy compressed point clouds is still challenging in terms of quality and complexity, making it a major roadblock to achieve real-time 6-Degree-of-Freedom video streaming. In this paper, we address… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

  36. arXiv:2406.05681  [pdf, other

    cs.SD eess.AS

    Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

    Authors: Yuepeng Jiang, Tao Li, Fengyu Yang, Lei Xie, Meng Meng, Yujun Wang

    Abstract: Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timb… ▽ More

    Submitted 11 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, accepted by Interspeech2024

  37. arXiv:2406.05515  [pdf, other

    cs.SD cs.CL eess.AS

    Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation

    Authors: Paige Tuttösí, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica Lim

    Abstract: Acoustic context effects, where surrounding changes in pitch, rate or timbre influence the perception of a sound, are well documented in speech perception, but how they interact with language background remains unclear. Using a reverse-correlation approach, we systematically varied the pitch and speech rate in phrases around different pairs of vowels for second language (L2) speakers of English (/… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH 2024

  38. arXiv:2406.05305  [pdf, other

    cs.CV cs.MM eess.IV

    YouTube SFV+HDR Quality Dataset

    Authors: Yilin Wang, Joong Gon Yim, Neil Birkbeck, Balu Adsumilli

    Abstract: The popularity of Short form videos (SFV) has grown dramatically in the past few years, and has become a phenomenal video category with billions of viewers. Meanwhile, High Dynamic Range (HDR) as an advanced feature also becomes more and more popular on video sharing platforms. As a hot topic with huge impact, SFV and HDR bring new questions to video quality research: 1) is SFV+HDR quality assessm… ▽ More

    Submitted 20 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

    Comments: Accepted by 2024 IEEE International Conference on Image Processing Dataset link: https://media.withyoutube.com/sfv-hdr

  39. arXiv:2406.04595  [pdf, other

    cs.SD cs.CL eess.AS

    Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and Diagnosis

    Authors: Xintong Wang, Mingqian Shi, Ye Wang

    Abstract: Mispronunciation Detection and Diagnosis (MDD) systems, leveraging Automatic Speech Recognition (ASR), face two main challenges in Mandarin Chinese: 1) The two-stage models create an information gap between the phoneme or tone classification stage and the MDD stage. 2) The scarcity of Mandarin MDD datasets limits model training. In this paper, we introduce a stateless RNN-T model for Mandarin MDD,… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  40. arXiv:2406.04105  [pdf, other

    cs.LG eess.IV

    From Tissue Plane to Organ World: A Benchmark Dataset for Multimodal Biomedical Image Registration using Deep Co-Attention Networks

    Authors: Yifeng Wang, Weipeng Li, Thomas Pearce, Haohan Wang

    Abstract: Correlating neuropathology with neuroimaging findings provides a multiscale view of pathologic changes in the human organ spanning the meso- to micro-scales, and is an emerging methodology expected to shed light on numerous disease states. To gain the most information from this multimodal, multiscale approach, it is desirable to identify precisely where a histologic tissue section was taken from w… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  41. arXiv:2406.02859   

    eess.AS cs.SD

    ConPCO: Preserving Phoneme Characteristics for Automatic Pronunciation Assessment Leveraging Contrastive Ordinal Regularization

    Authors: Bi-Cheng Yan, Wei-Cheng Chao, Jiun-Ting Li, Yi-Cheng Wang, Hsin-Wei Wang, Meng-Shin Lin, Berlin Chen

    Abstract: Automatic pronunciation assessment (APA) manages to evaluate the pronunciation proficiency of a second language (L2) learner in a target language. Existing efforts typically draw on regression models for proficiency score prediction, where the models are trained to estimate target values without explicitly accounting for phoneme-awareness in the feature space. In this paper, we propose a contrasti… ▽ More

    Submitted 8 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: This paper has been withdrawn because the authors aim to achieve better organization in writing and more detailed experimental analysis

  42. arXiv:2406.02557  [pdf, other

    eess.IV cs.AI cs.CV cs.MM

    EVAN: Evolutional Video Streaming Adaptation via Neural Representation

    Authors: Mufan Liu, Le Yang, Yiling Xu, Ye-kui Wang, Jenq-Neng Hwang

    Abstract: Adaptive bitrate (ABR) using conventional codecs cannot further modify the bitrate once a decision has been made, exhibiting limited adaptation capability. This may result in either overly conservative or overly aggressive bitrate selection, which could cause either inefficient utilization of the network bandwidth or frequent re-buffering, respectively. Neural representation for video (NeRV), whic… ▽ More

    Submitted 15 April, 2024; originally announced June 2024.

    Comments: accepted by ICME (conference)

  43. arXiv:2406.02430  [pdf, other

    eess.AS cs.SD

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Authors: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu , et al. (21 additional authors not shown)

    Abstract: We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  44. arXiv:2406.02429  [pdf, other

    eess.AS cs.SD

    Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion

    Authors: Ruiqi Li, Rongjie Huang, Yongqi Wang, Zhiqing Hong, Zhou Zhao

    Abstract: Speech-to-singing voice conversion (STS) task always suffers from data scarcity, because it requires paired speech and singing data. Compounding this issue are the challenges of content-pitch alignment and the suboptimal quality of generated outputs, presenting significant hurdles in STS research. This paper presents SVPT, an STS approach boosted by a self-supervised singing voice pre-training mod… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 13 pages

  45. arXiv:2406.02164  [pdf, other

    cs.IT eess.SP

    Sparse Recovery for Holographic MIMO Channels: Leveraging the Clustered Sparsity

    Authors: Yuqing Guo, Xufeng Guo, Yuanbin Chen, Ying Wang

    Abstract: Envisioned as the next-generation transceiver technology, the holographic multiple-input-multiple-output (HMIMO) garners attention for its superior capabilities of fabricating electromagnetic (EM) waves. However, the densely packed antenna elements significantly increase the dimension of the HMIMO channel matrix, rendering traditional channel estimation methods inefficient. While the dimension cur… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: This manuscript has been submitted to IEEE journal, 5 pages, 3 figures

  46. arXiv:2406.02126  [pdf, other

    eess.SY cs.AI cs.LG cs.MA

    CityLight: A Universal Model Towards Real-world City-scale Traffic Signal Control Coordination

    Authors: **wei Zeng, Chao Yu, Xinyi Yang, Wenxuan Ao, Jian Yuan, Yong Li, Yu Wang, Huazhong Yang

    Abstract: Traffic signal control (TSC) is a promising low-cost measure to enhance transportation efficiency without affecting existing road infrastructure. While various reinforcement learning-based TSC methods have been proposed and experimentally outperform conventional rule-based methods, none of them has been deployed in the real world. An essential gap lies in the oversimplification of the scenarios in… ▽ More

    Submitted 6 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

  47. arXiv:2406.02014  [pdf, other

    q-bio.NC cs.LG cs.SD eess.AS

    Understanding Auditory Evoked Brain Signal via Physics-informed Embedding Network with Multi-Task Transformer

    Authors: Wanli Ma, Xuegang Tang, ** Gu, Ying Wang, Yuling Xia

    Abstract: In the fields of brain-computer interaction and cognitive neuroscience, effective decoding of auditory signals from task-based functional magnetic resonance imaging (fMRI) is key to understanding how the brain processes complex auditory information. Although existing methods have enhanced decoding capabilities, limitations remain in information utilization and model representation. To overcome the… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

  48. arXiv:2406.01937  [pdf, other

    cs.IT eess.SP

    Cramér-Rao Bound Analysis and Beamforming Design for Integrated Sensing and Communication with Extended Targets

    Authors: Yiqiu Wang, Meixia Tao, Shu Sun

    Abstract: This paper studies an integrated sensing and communication (ISAC) system, where a multi-antenna base station transmits beamformed signals for joint downlink multi-user communication and radar sensing of an extended target (ET). By considering echo signals as reflections from valid elements on the ET contour, a set of novel Cramér-Rao bounds (CRBs) is derived for parameter estimation of the ET, inc… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: Submitted to IEEE Transactions on Wireless Communications. arXiv admin note: text overlap with arXiv:2312.10641

  49. arXiv:2406.01644  [pdf, other

    eess.IV

    Dual-Stream Attention Network for Hyperspectral Image Unmixing

    Authors: Yufang Wang, Wenmin Wu, Lin Qi, Feng Gao

    Abstract: Hyperspectral image (HSI) contains abundant spatial and spectral information, making it highly valuable for unmixing. In this paper, we propose a Dual-Stream Attention Network (DSANet) for HSI unmixing. The endmembers and abundance of a pixel in HSI have high correlations with its adjacent pixels. Therefore, we adopt a "many to one" strategy to estimate the abundance of the central pixel. In addit… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: Accepted by IEEE IGARSS 2024

  50. arXiv:2406.00689  [pdf, other

    cs.IT eess.SP

    Hybrid Beamforming Design for Integrated Sensing and Communication Exploiting Prior Information

    Authors: Yizhuo Wang, Shuowen Zhang

    Abstract: In this paper, we investigate the hybrid beamforming design for a multiple-input multiple-output (MIMO) integrated sensing and communication (ISAC) system, where a multi-antenna base station (BS) with hybrid analog-digital transmit antenna arrays sends dual-functional signals to communicate with a multi-antenna user and simultaneously sense the location information of a point target based on the r… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: submitted for possible conference publication