Skip to main content

Showing 1–50 of 168 results for author: Cao, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.16058  [pdf, other

    eess.AS

    Text-Queried Target Sound Event Localization

    Authors: **zheng Zhao, Xinyuan Qian, Yong Xu, Haohe Liu, Yin Cao, Davide Berghi, Wenwu Wang

    Abstract: Sound event localization and detection (SELD) aims to determine the appearance of sound classes, together with their Direction of Arrival (DOA). However, current SELD systems can only predict the activities of specific classes, for example, 13 classes in DCASE challenges. In this paper, we propose text-queried target sound event localization (SEL), a new paradigm that allows the user to input the… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: Accepted by EUSIPCO 2024

  2. arXiv:2406.12268  [pdf, ps, other

    eess.SP

    Channel Twinning: An Enabler for Next-Generation Ubiquitous Wireless Connectivity

    Authors: Yashuai Cao, **gbo Tan, **tao Wang, Wei Ni, Ekram Hossain, Dusit Niyato

    Abstract: The emerging concept of channel twinning (CT) has great potential to become a key enabler of ubiquitous connectivity in next-generation (xG) wireless systems. By fusing multimodal sensor data, CT advocates a high-fidelity and low-overhead channel acquisition paradigm, which is promising to provide accurate channel prediction in cross-domain and high-mobility scenarios of ubiquitous xG networks. Ho… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: submitted to IEEE

  3. arXiv:2406.09447  [pdf, ps, other

    cs.IT eess.SP

    Self-Sustainable Active Reconfigurable Intelligent Surfaces for Anti-Jamming in Wireless Communications

    Authors: Yang Cao, Wenchi Cheng, **gqing Wang, Wei Zhang

    Abstract: Wireless devices can be easily attacked by jammers during transmission, which is a potential security threat for wireless communications. Active reconfigurable intelligent surface (RIS) attracts considerable attention and is expected to be employed in anti-jamming systems for secure transmission to significantly enhance the anti-jamming performance. However, active RIS introduces external power lo… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: submitted to IEEE systems journal

  4. arXiv:2406.07807  [pdf, ps, other

    cs.IT eess.SP

    Dynamic Energy-Saving Design for Double-Faced Active RIS Assisted Communications with Perfect/Imperfect CSI

    Authors: Yang Cao, Wenchi Cheng, **gqing Wang, Wei Zhang

    Abstract: Although the emerging reconfigurable intelligent surface (RIS) paves a new way for next-generation wireless communications, it suffers from inherent flaws, i.e., double-fading attenuation effects and half-space coverage limitations. The state-of-the-art double-face active (DFA)-RIS architecture is proposed for significantly amplifying and transmitting incident signals in full-space. Despite the ef… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: submitted to IEEE TWC

  5. arXiv:2406.07255  [pdf, other

    cs.CV eess.IV

    Towards Realistic Data Generation for Real-World Super-Resolution

    Authors: Long Peng, Wenbo Li, Ren**g Pei, **g**g Ren, Xueyang Fu, Yang Wang, Yang Cao, Zheng-Jun Zha

    Abstract: Existing image super-resolution (SR) techniques often fail to generalize effectively in complex real-world settings due to the significant divergence between training data and practical scenarios. To address this challenge, previous efforts have either manually simulated intricate physical-based degradations or utilized learning-based techniques, yet these approaches remain inadequate for producin… ▽ More

    Submitted 11 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

  6. arXiv:2406.02233  [pdf, other

    eess.AS

    Towards Out-of-Distribution Detection in Vocoder Recognition via Latent Feature Reconstruction

    Authors: Renmingyue Du, Jixun Yao, Qiuqiang Kong, Yin Cao

    Abstract: Advancements in synthesized speech have created a growing threat of impersonation, making it crucial to develop deepfake algorithm recognition. One significant aspect is out-of-distribution (OOD) detection, which has gained notable attention due to its important role in deepfake algorithm recognition. However, most of the current approaches for detecting OOD in deepfake algorithm recognition rely… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: 5 pages, 4 figures

  7. arXiv:2405.13549  [pdf, other

    eess.SP cs.IT

    Multi-Objective Optimization-Based Waveform Design for Multi-User and Multi-Target MIMO-ISAC Systems

    Authors: Peng Wang, Dongsheng Han, Yashuai Cao, Wanli Ni, Dusit Niyato

    Abstract: Integrated sensing and communication (ISAC) opens up new service possibilities for sixth-generation (6G) systems, where both communication and sensing (C&S) functionalities co-exist by sharing the same hardware platform and radio resource. In this paper, we investigate the waveform design problem in a downlink multi-user and multi-target ISAC system under different C&S performance preferences. The… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 13 pages, submitted to IEEE TWC

  8. arXiv:2405.11263  [pdf, other

    eess.SP

    MAMCA -- Optimal on Accuracy and Efficiency for Automatic Modulation Classification with Extended Signal Length

    Authors: Yezhuo Zhang, Zinan Zhou, Yichao Cao, Guangyu Li, Xuanpeng Li

    Abstract: With the rapid growth of the Internet of Things ecosystem, Automatic Modulation Classification (AMC) has become increasingly paramount. However, extended signal lengths offer a bounty of information, yet impede the model's adaptability, introduce more noise interference, extend the training and inference time, and increase storage overhead. To bridge the gap between these requisites, we propose a… ▽ More

    Submitted 18 May, 2024; originally announced May 2024.

    Comments: 5 pages, 5 figures

  9. arXiv:2405.09470  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer

    Authors: Weifei **, Yuxin Cao, Junjie Su, Qi Shen, Kai Ye, Derui Wang, Jie Hao, Ziyao Liu

    Abstract: In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: Accepted to SecTL (AsiaCCS Workshop) 2024

  10. arXiv:2405.07023  [pdf, other

    eess.IV cs.CV

    Efficient Real-world Image Super-Resolution Via Adaptive Directional Gradient Convolution

    Authors: Long Peng, Yang Cao, Ren**g Pei, Wenbo Li, Jiaming Guo, Xueyang Fu, Yang Wang, Zheng-Jun Zha

    Abstract: Real-SR endeavors to produce high-resolution images with rich details while mitigating the impact of multiple degradation factors. Although existing methods have achieved impressive achievements in detail recovery, they still fall short when addressing regions with complex gradient arrangements due to the intensity-based linear weighting feature extraction manner. Moreover, the stochastic artifact… ▽ More

    Submitted 11 May, 2024; originally announced May 2024.

  11. arXiv:2404.16484  [pdf, other

    cs.CV eess.IV

    Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

    Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, **shan Pan, Jiangxin Dong, **hui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi **, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

    Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: CVPR 2024, AI for Streaming (AIS) Workshop

  12. arXiv:2404.16312  [pdf, other

    eess.SY cs.MA cs.RO

    3D Guidance Law for Maximal Coverage and Target Enclosing with Inherent Safety

    Authors: Praveen Kumar Ranjan, Abhinav Sinha, Yongcan Cao

    Abstract: In this paper, we address the problem of enclosing an arbitrarily moving target in three dimensions by a single pursuer, which is an unmanned aerial vehicle (UAV), for maximum coverage while also ensuring the pursuer's safety by preventing collisions with the target. The proposed guidance strategy steers the pursuer to a safe region of space surrounding the target, allowing it to maintain a certai… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

  13. arXiv:2404.14132  [pdf, other

    cs.CV eess.IV

    CRNet: A Detail-Preserving Network for Unified Image Restoration and Enhancement Task

    Authors: Kangzhen Yang, Tao Hu, Kexin Dai, Genggeng Chen, Yu Cao, Wei Dong, Peng Wu, Yanning Zhang, Qingsen Yan

    Abstract: In real-world scenarios, images captured often suffer from blurring, noise, and other forms of image degradation, and due to sensor limitations, people usually can only obtain low dynamic range images. To achieve high-quality images, researchers have attempted various image restoration and enhancement operations on photographs, including denoising, deblurring, and high dynamic range imaging. Howev… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: This paper is accepted by CVPR2024 Workshop, Code: https://github.com/CalvinYang0/CRNet

  14. arXiv:2404.10343  [pdf, other

    cs.CV eess.IV

    The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

    Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More

    Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

    Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

  15. arXiv:2404.04497  [pdf, other

    eess.SY cs.MA cs.RO math.OC

    Self-organizing Multiagent Target Enclosing under Limited Information and Safety Guarantees

    Authors: Praveen Kumar Ranjan, Abhinav Sinha, Yongcan Cao

    Abstract: This paper introduces an approach to address the target enclosing problem using non-holonomic multiagent systems, where agents autonomously self-organize themselves in the desired formation around a fixed target. Our approach combines global enclosing behavior and local collision avoidance mechanisms by devising a novel potential function and sliding manifold. In our approach, agents independently… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

  16. Linear Hybrid Asymmetrical Load-Modulated Balanced Amplifier with Multi-Band Reconfigurability and Antenna-VSWR Resilience

    Authors: Jiachen Guo, Yuchen Cao, Kenle Chen

    Abstract: This paper presents the first-ever highly linear and load-insensitive three-way load-modulation power amplifier (PA) based on reconfigurable hybrid asymmetrical load modulated balanced amplifier (H-ALMBA). Through proper amplitude and phase controls, the carrier, control amplifier (CA), and two peaking balanced amplifiers (BA1 and BA2) can form a linear high-order load modulation over wide bandwid… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: This work has been submitted to the IEEE for possible publication

  17. arXiv:2403.09527  [pdf, other

    eess.AS

    WavCraft: Audio Editing and Generation with Large Language Models

    Authors: **hua Liang, Huan Zhang, Haohe Liu, Yin Cao, Qiuqiang Kong, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan, Emmanouil Benetos

    Abstract: We introduce WavCraft, a collective system that leverages large language models (LLMs) to connect diverse task-specific models for audio content creation and editing. Specifically, WavCraft describes the content of raw audio materials in natural language and prompts the LLM conditioned on audio descriptions and user requests. WavCraft leverages the in-context learning ability of the LLM to decompo… ▽ More

    Submitted 10 May, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

  18. arXiv:2403.09392  [pdf, other

    eess.IV cs.CV

    Event-based Asynchronous HDR Imaging by Temporal Incident Light Modulation

    Authors: Yuliang Wu, Ganchao Tan, **ze Chen, Wei Zhai, Yang Cao, Zheng-Jun Zha

    Abstract: Dynamic Range (DR) is a pivotal characteristic of imaging systems. Current frame-based cameras struggle to achieve high dynamic range imaging due to the conflict between globally uniform exposure and spatially variant scene illumination. In this paper, we propose AsynHDR, a Pixel-Asynchronous HDR imaging system, based on key insights into the challenges in HDR imaging and the unique event-generati… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

  19. arXiv:2402.17259  [pdf, other

    cs.SD eess.AS

    EDTC: enhance depth of text comprehension in automated audio captioning

    Authors: Liwen Tan, Yin Cao, Yi Zhou

    Abstract: Modality discrepancies have perpetually posed significant challenges within the realm of Automated Audio Captioning (AAC) and across all multi-modal domains. Facilitating models in comprehending text information plays a pivotal role in establishing a seamless connection between the two modalities of text and audio. While recent research has focused on closing the gap between these two modalities t… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

  20. arXiv:2402.16453  [pdf, ps, other

    eess.SP

    Intelligent Reflecting Surfaces and Next Generation Wireless Systems

    Authors: Yashuai Cao, Hetong Wang, Tiejun Lv, Wei Ni

    Abstract: Intelligent reflecting surface (IRS) is a potential candidate for massive multiple-input multiple-output (MIMO) 2.0 technology due to its low cost, ease of deployment, energy efficiency and extended coverage. This chapter investigates the slot-by-slot IRS reflection pattern design and two-timescale reflection pattern design schemes, respectively. For the slot-by-slot reflection optimization, we pr… ▽ More

    Submitted 27 February, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

    Comments: To appear as a chapter of the book "Massive MIMO for Future Wireless Communication Systems: Technology and Applications", to be published by Wiley-IEEE Press. arXiv admin note: text overlap with arXiv:2206.07276

  21. arXiv:2402.04865  [pdf, other

    eess.SP

    Collaborative Computing in Non-Terrestrial Networks: A Multi-Time-Scale Deep Reinforcement Learning Approach

    Authors: Yang Cao, Shao-Yu Lien, Ying-Chang Liang, Dusit Niyato, Xuemin, Shen

    Abstract: Constructing earth-fixed cells with low-earth orbit (LEO) satellites in non-terrestrial networks (NTNs) has been the most promising paradigm to enable global coverage. The limited computing capabilities on LEO satellites however render tackling resource optimization within a short duration a critical challenge. Although the sufficient computing capabilities of the ground infrastructures can be uti… ▽ More

    Submitted 7 February, 2024; originally announced February 2024.

  22. arXiv:2402.04056  [pdf, other

    eess.SP

    Collaborative Deep Reinforcement Learning for Resource Optimization in Non-Terrestrial Networks

    Authors: Yang Cao, Shao-Yu Lien, Ying-Chang Liang, Dusit Niyato, Xuemin, Shen

    Abstract: Non-terrestrial networks (NTNs) with low-earth orbit (LEO) satellites have been regarded as promising remedies to support global ubiquitous wireless services. Due to the rapid mobility of LEO satellite, inter-beam/satellite handovers happen frequently for a specific user equipment (UE). To tackle this issue, earth-fixed cell scenarios have been under studied, in which the LEO satellite adjusts its… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

  23. arXiv:2402.01828  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Retrieval Augmented End-to-End Spoken Dialog Models

    Authors: Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey

    Abstract: We recently developed SLM, a joint speech and language model, which fuses a pretrained foundational speech model and a large language model (LLM), while preserving the in-context learning capability intrinsic to the pretrained LLM. In this paper, we apply SLM to speech dialog applications where the dialog states are inferred directly from the audio signal. Task-oriented dialogs often contain dom… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Journal ref: Proc. ICASSP 2024

  24. arXiv:2401.07120  [pdf, other

    cs.NI eess.SP quant-ph

    Generative AI-enabled Quantum Computing Networks and Intelligent Resource Allocation

    Authors: Minrui Xu, Dusit Niyato, Jiawen Kang, Zehui Xiong, Yuan Cao, Yulan Gao, Chao Ren, Han Yu

    Abstract: Quantum computing networks enable scalable collaboration and secure information exchange among multiple classical and quantum computing nodes while executing large-scale generative AI computation tasks and advanced quantum algorithms. Quantum computing networks overcome limitations such as the number of qubits and coherence time of entangled pairs and offer advantages for generative AI infrastruct… ▽ More

    Submitted 13 January, 2024; originally announced January 2024.

  25. arXiv:2312.16422  [pdf, other

    eess.AS cs.SD

    Selective-Memory Meta-Learning with Environment Representations for Sound Event Localization and Detection

    Authors: **bo Hu, Yin Cao, Ming Wu, Qiuqiang Kong, Feiran Yang, Mark D. Plumbley, Jun Yang

    Abstract: Environment shifts and conflicts present significant challenges for learning-based sound event localization and detection (SELD) methods. SELD systems, when trained in particular acoustic settings, often show restricted generalization capabilities for diverse acoustic environments. Furthermore, it is notably costly to obtain annotated samples for spatial sound events. Deploying a SELD system in a… ▽ More

    Submitted 27 December, 2023; originally announced December 2023.

    Comments: 13 pages, 11 figures

  26. arXiv:2312.15628  [pdf, other

    cs.SD eess.AS

    Balanced SNR-Aware Distillation for Guided Text-to-Audio Generation

    Authors: Bingzhi Liu, Yin Cao, Haohe Liu, Yi Zhou

    Abstract: Diffusion models have demonstrated promising results in text-to-audio generation tasks. However, their practical usability is hindered by slow sampling speeds, limiting their applicability in high-throughput scenarios. To address this challenge, progressive distillation methods have been effective in producing more compact and efficient models. Nevertheless, these methods encounter issues with unb… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

    Comments: 5 pages

  27. arXiv:2312.15195  [pdf, other

    cs.AI cs.LG eess.SY

    Mutual Information as Intrinsic Reward of Reinforcement Learning Agents for On-demand Ride Pooling

    Authors: Xianjie Zhang, Jiahao Sun, Chen Gong, Kai Wang, Yifei Cao, Hao Chen, Hao Chen, Yu Liu

    Abstract: The emergence of on-demand ride pooling services allows each vehicle to serve multiple passengers at a time, thus increasing drivers' income and enabling passengers to travel at lower prices than taxi/car on-demand services (only one passenger can be assigned to a car at a time like UberX and Lyft). Although on-demand ride pooling services can bring so many benefits, ride pooling services need a w… ▽ More

    Submitted 7 January, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

    Comments: Accepted by AAMAS 2024

  28. arXiv:2312.11898  [pdf, other

    cs.LG eess.SP

    Short-Term Multi-Horizon Line Loss Rate Forecasting of a Distribution Network Using Attention-GCN-LSTM

    Authors: Jie Liu, Yijia Cao, Yong Li, Yixiu Guo, Wei Deng

    Abstract: Accurately predicting line loss rates is vital for effective line loss management in distribution networks, especially over short-term multi-horizons ranging from one hour to one week. In this study, we propose Attention-GCN-LSTM, a novel method that combines Graph Convolutional Networks (GCN), Long Short-Term Memory (LSTM), and a three-level attention mechanism to address this challenge. By captu… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

  29. arXiv:2311.10689  [pdf, other

    eess.AS

    GhostVec: A New Threat to Speaker Privacy of End-to-End Speech Recognition System

    Authors: Xiaojiao Chen, Sheng Li, Jiyi Li, Hao Huang, Yang Cao, Liang He

    Abstract: Speaker adaptation systems face privacy concerns, for such systems are trained on private datasets and often overfitting. This paper demonstrates that an attacker can extract speaker information by querying speaker-adapted speech recognition (ASR) systems. We focus on the speaker information of a transformer-based ASR and propose GhostVec, a simple and efficient attack method to extract the speake… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

    Comments: accepted in ACM Multimedia Asia 2023

  30. arXiv:2311.10664  [pdf, other

    eess.AS

    Reprogramming Self-supervised Learning-based Speech Representations for Speaker Anonymization

    Authors: Xiaojiao Chen, Sheng Li, Jiyi Li, Hao Huang, Yang Cao, Liang He

    Abstract: Current speaker anonymization methods, especially with self-supervised learning (SSL) models, require massive computational resources when hiding speaker identity. This paper proposes an effective and parameter-efficient speaker anonymization method based on recent End-to-End model reprogramming technology. To improve the anonymization performance, we first extract speaker representation from larg… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

    Comments: accepted in ACM Multimedia Asia2023

  31. arXiv:2311.04791  [pdf, other

    eess.SP

    Integrated Distributed Semantic Communication and Over-the-air Computation for Cooperative Spectrum Sensing

    Authors: Peng Yi, Yang Cao, Xin Kang, Ying-Chang Liang

    Abstract: Cooperative spectrum sensing (CSS) is a promising approach to improve the detection of primary users (PUs) using multiple sensors. However, there are several challenges for existing combination methods, i.e., performance degradation and ceiling effect for hard-decision fusion (HDF), as well as significant uploading latency and non-robustness to noise in the reporting channel for soft-data fusion (… ▽ More

    Submitted 25 April, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

    Comments: 13 pages,10 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  32. arXiv:2310.19293  [pdf, other

    eess.IV cs.CV

    FetusMapV2: Enhanced Fetal Pose Estimation in 3D Ultrasound

    Authors: Chaoyu Chen, Xin Yang, Yuhao Huang, Wenlong Shi, Yan Cao, Mingyuan Luo, Xindi Hu, Lei Zhue, Lequan Yu, Kejuan Yue, Yuanji Zhang, Yi Xiong, Dong Ni, Weijun Huang

    Abstract: Fetal pose estimation in 3D ultrasound (US) involves identifying a set of associated fetal anatomical landmarks. Its primary objective is to provide comprehensive information about the fetus through landmark connections, thus benefiting various critical applications, such as biometric measurements, plane localization, and fetal movement monitoring. However, accurately estimating the 3D fetal pose… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

    Comments: 16 pages, 11 figures, accepted by Medical Image Analysis(2023)

  33. arXiv:2310.00230  [pdf, other

    cs.CL cs.SD eess.AS

    SLM: Bridge the thin gap between speech and text foundation models

    Authors: Mingqiu Wang, Wei Han, Izhak Shafran, Zelin Wu, Chung-Cheng Chiu, Yuan Cao, Yongqiang Wang, Nanxin Chen, Yu Zhang, Hagen Soltau, Paul Rubenstein, Lukas Zilka, Dian Yu, Zhong Meng, Golan Pundak, Nikhil Siddhartha, Johan Schalkwyk, Yonghui Wu

    Abstract: We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally preserves their capabilities, and only trains a simple adapter with just 1\% (156M) of the foundation models' parameters. This adaptation not only leads SLM to achiev… ▽ More

    Submitted 29 September, 2023; originally announced October 2023.

  34. arXiv:2309.16247  [pdf, other

    eess.AS cs.SD

    PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System

    Authors: Xiang Lyu, Yuhang Cao, Qing Wang, **g**g Yin, Yuguang Yang, Pengpeng Zou, Yanni Hu, Heng Lu

    Abstract: Speaker-attributed automatic speech recognition (SA-ASR) improves the accuracy and applicability of multi-speaker ASR systems in real-world scenarios by assigning speaker labels to transcribed texts. However, SA-ASR poses unique challenges due to factors such as speaker overlap, speaker variability, background noise, and reverberation. In this study, we propose PP-MeT system, a real-world personal… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

  35. arXiv:2309.08377  [pdf, other

    eess.AS cs.CL cs.SD

    DiaCorrect: Error Correction Back-end For Speaker Diarization

    Authors: Jiangyu Han, Federico Landini, Johan Rohdin, Mireia Diez, Lukas Burget, Yuhang Cao, Heng Lu, Jan Cernocky

    Abstract: In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in a simple yet effective way. This method is inspired by error correction techniques in automatic speech recognition. Our model consists of two parallel convolutional encoders and a transform-based decoder. By exploiting the interactions between the input recording and the initia… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  36. arXiv:2308.13790  [pdf, other

    eess.IV cs.CV

    FFPN: Fourier Feature Pyramid Network for Ultrasound Image Segmentation

    Authors: Chaoyu Chen, Xin Yang, Rusi Chen, Junxuan Yu, Liwei Du, Jian Wang, Xindi Hu, Yan Cao, Yingying Liu, Dong Ni

    Abstract: Ultrasound (US) image segmentation is an active research area that requires real-time and highly accurate analysis in many scenarios. The detect-to-segment (DTS) frameworks have been recently proposed to balance accuracy and efficiency. However, existing approaches may suffer from inadequate contour encoding or fail to effectively leverage the encoded results. In this paper, we introduce a novel F… ▽ More

    Submitted 26 August, 2023; originally announced August 2023.

    Comments: 10 pages, 5 figures, Accepted by MLMI 2023

  37. arXiv:2308.08847  [pdf, other

    eess.AS cs.SD

    META-SELD: Meta-Learning for Fast Adaptation to the new environment in Sound Event Localization and Detection

    Authors: **bo Hu, Yin Cao, Ming Wu, Feiran Yang, Ziying Yu, Wenwu Wang, Mark D. Plumbley, Jun Yang

    Abstract: For learning-based sound event localization and detection (SELD) methods, different acoustic environments in the training and test sets may result in large performance differences in the validation and evaluation stages. Different environments, such as different sizes of rooms, different reverberation times, and different background noise, may be reasons for a learning-based system to fail. On the… ▽ More

    Submitted 17 August, 2023; originally announced August 2023.

    Comments: Submitted to DCASE 2023 Workshop

  38. arXiv:2307.14603  [pdf, other

    eess.IV cs.CV

    A Weakly Supervised Segmentation Network Embedding Cross-scale Attention Guidance and Noise-sensitive Constraint for Detecting Tertiary Lymphoid Structures of Pancreatic Tumors

    Authors: Bingxue Wang, Liwen Zou, Jun Chen, Yingying Cao, Zhenghua Cai, Yudong Qiu, Liang Mao, Zhongqiu Wang, **gya Chen, Luying Gui, ** Yang

    Abstract: The presence of tertiary lymphoid structures (TLSs) on pancreatic pathological images is an important prognostic indicator of pancreatic tumors. Therefore, TLSs detection on pancreatic pathological images plays a crucial role in diagnosis and treatment for patients with pancreatic tumors. However, fully supervised detection algorithms based on deep learning usually require a large number of manual… ▽ More

    Submitted 26 July, 2023; originally announced July 2023.

  39. arXiv:2307.14335  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    WavJourney: Compositional Audio Creation with Large Language Models

    Authors: Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang, **hua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang

    Abstract: Despite breakthroughs in audio generation models, their capabilities are often confined to domain-specific conditions such as speech transcriptions and audio captions. However, real-world audio creation aims to generate harmonious audio containing various elements such as speech, music, and sound effects with controllable conditions, which is challenging to address using existing audio generation… ▽ More

    Submitted 26 November, 2023; v1 submitted 26 July, 2023; originally announced July 2023.

    Comments: GitHub: https://github.com/Audio-AGI/WavJourney

  40. arXiv:2307.10813  [pdf, other

    cs.CV cs.SD eess.AS eess.IV

    Perceptual Quality Assessment of Omnidirectional Audio-visual Signals

    Authors: Xilei Zhu, Huiyu Duan, Yuqin Cao, Yuxin Zhu, Yucheng Zhu, **g Liu, Li Chen, Xiongkuo Min, Guangtao Zhai

    Abstract: Omnidirectional videos (ODVs) play an increasingly important role in the application fields of medical, education, advertising, tourism, etc. Assessing the quality of ODVs is significant for service-providers to improve the user's Quality of Experience (QoE). However, most existing quality assessment studies for ODVs only focus on the visual distortions of videos, while ignoring that the overall Q… ▽ More

    Submitted 20 July, 2023; originally announced July 2023.

    Comments: 12 pages, 5 figures, to be published in CICAI2023

    ACM Class: I.4.0; I.5.4

  41. arXiv:2307.09729  [pdf, other

    cs.CV cs.MM eess.IV

    NTIRE 2023 Quality Assessment of Video Enhancement Challenge

    Authors: Xiaohong Liu, Xiongkuo Min, Wei Sun, Yulun Zhang, Kai Zhang, Radu Timofte, Guangtao Zhai, Yixuan Gao, Yuqin Cao, Tengchuan Kou, Yunlong Dong, Ziheng Jia, Yilin Li, Wei Wu, Shuming Hu, Sibin Deng, Pengxiang Xiao, Ying Chen, Kai Li, Kai Zhao, Kun Yuan, Ming Sun, Heng Cong, Hao Wang, Lingzhi Fu , et al. (47 additional authors not shown)

    Abstract: This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

  42. arXiv:2307.09714  [pdf

    physics.optics eess.IV

    Flexible single multimode fiber imaging using white LED

    Authors: Minyu Fan, Kun Liu, Jie Zhu, Yu Cao, Sha Wang

    Abstract: Multimode fiber (MMF) has been proven to have good potential in imaging and optical communication because of its advantages of small diameter and large mode numbers. However, due to the mode coupling and modal dispersion, it is very sensitive to environmental changes. Minor changes in the fiber shape can lead to difficulties in information reconstruction. Here, white LED and cascaded Unet are used… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

  43. arXiv:2307.04630  [pdf, other

    cs.SD eess.AS

    The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

    Authors: Kun Song, Yi lei, Peikun Chen, Yiqing Cao, Kun Wei, Yongmao Zhang, Lei Xie, Ning Jiang, Guoqing Zhao

    Abstract: This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. The system is built in a cascaded manner consisting of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS). We make tremendous efforts to handle the challenging multi-source input. Spec… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

    Comments: IWSLT@ACL 2023 system paper. Our submitted system ranks 1st in the S2ST task of the IWSLT 2023 evaluation campaign

  44. arXiv:2307.00828  [pdf, other

    eess.SY cs.LG math.OC

    Model-Assisted Probabilistic Safe Adaptive Control With Meta-Bayesian Learning

    Authors: Shengbo Wang, Ke Li, Yin Yang, Yuting Cao, Tingwen Huang, Shi** Wen

    Abstract: Breaking safety constraints in control systems can lead to potential risks, resulting in unexpected costs or catastrophic damage. Nevertheless, uncertainty is ubiquitous, even among similar tasks. In this paper, we develop a novel adaptive safe control framework that integrates meta learning, Bayesian models, and control barrier function (CBF) method. Specifically, with the help of CBF method, we… ▽ More

    Submitted 13 July, 2023; v1 submitted 3 July, 2023; originally announced July 2023.

  45. arXiv:2306.09995  [pdf, other

    cs.LG cs.AI cs.CY eess.SY

    Fairness in Preference-based Reinforcement Learning

    Authors: Umer Siddique, Abhinav Sinha, Yongcan Cao

    Abstract: In this paper, we address the issue of fairness in preference-based reinforcement learning (PbRL) in the presence of multiple objectives. The main objective is to design control policies that can optimize multiple objectives while treating each objective fairly. Toward this objective, we design a new fairness-induced preference-based reinforcement learning or FPbRL. The main idea of FPbRL is to le… ▽ More

    Submitted 1 September, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

    Comments: Accepted to The Many Facets of Preference Learning Workshop at the International Conference on Machine Learning (ICML)

  46. arXiv:2306.07944  [pdf, other

    eess.AS cs.AI cs.CL

    Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding

    Authors: Mingqiu Wang, Izhak Shafran, Hagen Soltau, Wei Han, Yuan Cao, Dian Yu, Laurent El Shafey

    Abstract: Large Language Models (LLMs) have been applied in the speech domain, often incurring a performance drop due to misaligned between speech and language representations. To bridge this gap, we propose a joint speech and language model (SLM) using a Speech2Text adapter, which maps speech into text token embedding space without speech information loss. Additionally, using a CTC-based blank-filtering, w… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

  47. arXiv:2306.01598  [pdf, other

    cs.CV cs.RO eess.IV

    Towards Source-free Domain Adaptive Semantic Segmentation via Importance-aware and Prototype-contrast Learning

    Authors: Yihong Cao, Hui Zhang, Xiao Lu, Zheng Xiao, Kailun Yang, Yaonan Wang

    Abstract: Domain adaptive semantic segmentation enables robust pixel-wise understanding in real-world driving scenes. Source-free domain adaptation, as a more practical technique, addresses the concerns of data privacy and storage limitations in typical unsupervised domain adaptation methods, making it especially relevant in the context of intelligent vehicles. It utilizes a well-trained source model and un… ▽ More

    Submitted 26 March, 2024; v1 submitted 2 June, 2023; originally announced June 2023.

    Comments: Accepted to IEEE Transactions on Intelligent Vehicles (T-IV). The source code is publicly available at https://github.com/yihong-97/Source-free-IAPC

  48. arXiv:2305.13929  [pdf

    eess.SP cs.LG

    Deep Learning and Image Super-Resolution-Guided Beam and Power Allocation for mmWave Networks

    Authors: Yuwen Cao, Tomoaki Ohtsuki, Setareh Maghsudi, Tony Q. S. Quek

    Abstract: In this paper, we develop a deep learning (DL)-guided hybrid beam and power allocation approach for multiuser millimeter-wave (mmWave) networks, which facilitates swift beamforming at the base station (BS). The following persisting challenges motivated our research: (i) User and vehicular mobility, as well as redundant beam-reselections in mmWave networks, degrade the efficiency; (ii) Due to the l… ▽ More

    Submitted 8 May, 2023; originally announced May 2023.

  49. arXiv:2304.14598  [pdf, other

    cs.IT eess.SP

    A manifold learning-based CSI feedback framework for FDD massive MIMO

    Authors: Yandi Cao, Haifan Yin, Ziao Qin, Weidong Li, Weimin Wu, Merouane Debbah

    Abstract: Massive multi-input multi-output (MIMO) in Frequency Division Duplex (FDD) mode suffers from heavy feedback overhead for Channel State Information (CSI). In this paper, a novel manifold learning-based CSI feedback framework (MLCF) is proposed to reduce the feedback and improve the spectral efficiency of FDD massive MIMO. Manifold learning (ML) is an effective method for dimensionality reduction. H… ▽ More

    Submitted 27 April, 2023; originally announced April 2023.

    Comments: 12 pages, 5 figures

  50. arXiv:2304.06451  [pdf, other

    eess.SY

    A robust design of time-varying internal model principle-based control for ultra-precision tracking in a direct-drive servo stage

    Authors: Yue Cao, Zhen Zhang

    Abstract: This paper proposes a robust design of the time-varying internal model principle-based control (TV-IMPC) for tracking sophisticated references generated by linear time-varying (LTV) autonomous systems. The existing TV-IMPC design usually requires a complete knowledge of the plant I/O (input/output) model, leading to the lack of structural robustness. To tackle this issue, we, in the paper, design… ▽ More

    Submitted 13 April, 2023; originally announced April 2023.