Skip to main content

Showing 1–50 of 101 results for author: Zou, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.04867  [pdf, other

    eess.IV cs.CV

    MIPI 2024 Challenge on Demosaic for HybridEVS Camera: Methods and Results

    Authors: Yaqi Wu, Zhihao Fan, Xiaofeng Chu, Jimmy S. Ren, Xiaoming Li, Zongsheng Yue, Chongyi Li, Shangcheng Zhou, Ruicheng Feng, Yuekun Dai, Peiqing Yang, Chen Change Loy, Senyan Xu, Zhi**g Sun, Jiaying Zhu, Yurui Zhu, Xueyang Fu, Zheng-Jun Zha, Jun Cao, Cheng Li, Shu Chen, Liang Ma, Shiyang Zhou, Hai** Zeng, Kai Feng , et al. (24 additional authors not shown)

    Abstract: The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: MIPI@CVPR2024. Website: https://mipi-challenge.org/MIPI2024/

  2. arXiv:2404.16484  [pdf, other

    cs.CV eess.IV

    Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

    Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, **shan Pan, Jiangxin Dong, **hui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi **, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

    Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: CVPR 2024, AI for Streaming (AIS) Workshop

  3. arXiv:2404.10343  [pdf, other

    cs.CV eess.IV

    The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

    Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More

    Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

    Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

  4. arXiv:2401.11141  [pdf, other

    cs.IT eess.SP

    Wideband Beamforming for RIS Assisted Near-Field Communications

    Authors: Ji Wang, Jian Xiao, Yixuan Zou, Wenwu Xie, Yuanwei Liu

    Abstract: A near-field wideband beamforming scheme is investigated for reconfigurable intelligent surface (RIS) assisted multiple-input multiple-output (MIMO) systems, in which a deep learning-based end-to-end (E2E) optimization framework is proposed to maximize the system spectral efficiency. To deal with the near-field double beam split effect, the base station is equipped with frequency-dependent hybrid… ▽ More

    Submitted 20 January, 2024; originally announced January 2024.

  5. arXiv:2312.09760  [pdf, other

    eess.AS cs.SD

    U2-KWS: Unified Two-pass Open-vocabulary Keyword Spotting with Keyword Bias

    Authors: Ao Zhang, Pan Zhou, Kaixun Huang, Yong Zou, Ming Liu, Lei Xie

    Abstract: Open-vocabulary keyword spotting (KWS), which allows users to customize keywords, has attracted increasingly more interest. However, existing methods based on acoustic models and post-processing train the acoustic model with ASR training criteria to model all phonemes, making the acoustic model under-optimized for the KWS task. To solve this problem, we propose a novel unified two-pass open-vocabu… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: Accepted by ASRU2023

  6. arXiv:2311.12770  [pdf, other

    eess.IV cs.CV

    Swift Parameter-free Attention Network for Efficient Super-Resolution

    Authors: Cheng Wan, Hongyuan Yu, Zhiqi Li, Yihang Chen, Yajun Zou, Yuqing Liu, Xuanwu Yin, Kunlong Zuo

    Abstract: Single Image Super-Resolution (SISR) is a crucial task in low-level computer vision, aiming to reconstruct high-resolution images from low-resolution counterparts. Conventional attention mechanisms have significantly improved SISR performance but often result in complex network structures and large number of parameters, leading to slow inference speed and large model size. To address this issue, w… ▽ More

    Submitted 12 May, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

    Comments: NTIRE2024 ESR winner

  7. arXiv:2310.15011  [pdf, ps, other

    cs.IT eess.SP

    Interference Management by Harnessing Multi-Domain Resources in Spectrum-Sharing Aided Satellite-Ground Integrated Networks

    Authors: Xiao** Ding, Yue Lei, Yulong Zou, Gengxin Zhang, Lajos Hanzo

    Abstract: A spectrum-sharing satellite-ground integrated network is conceived, consisting of a pair of non-geostationary orbit (NGSO) constellations and multiple terrestrial base stations, which impose the co-frequency interference (CFI) on each other. The CFI may increase upon increasing the number of satellites. To manage the potentially severe interference, we propose to rely on joint multi-domain resour… ▽ More

    Submitted 29 January, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: Submitted to IEEE Transactions on Vehicular Technology

  8. arXiv:2309.11477  [pdf, other

    eess.SY

    Multi-Agent Robust Control Synthesis from Global Temporal Logic Tasks

    Authors: Tiange Yang, Yuanyuan Zou, **feng Liu, Tianyu Jia, Shaoyuan Li

    Abstract: This paper focuses on the heterogeneous multi-agent control problem under global temporal logic tasks. We define a specification language, called extended capacity temporal logic (ECaTL), to describe the required global tasks, including the number of times that a local or coupled signal temporal logic (STL) task needs to be satisfied and the synchronous requirements on task satisfaction. The robus… ▽ More

    Submitted 17 November, 2023; v1 submitted 20 September, 2023; originally announced September 2023.

    Comments: 6 pages, 3 figures

  9. arXiv:2309.02020  [pdf, other

    eess.IV cs.CV

    RawHDR: High Dynamic Range Image Reconstruction from a Single Raw Image

    Authors: Yunhao Zou, Chenggang Yan, Ying Fu

    Abstract: High dynamic range (HDR) images capture much more intensity levels than standard ones. Current methods predominantly generate HDR images from 8-bit low dynamic range (LDR) sRGB images that have been degraded by the camera processing pipeline. However, it becomes a formidable task to retrieve extremely high dynamic range scenes from such limited bit-depth data. Unlike existing methods, the core ide… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

    Comments: ICCV 2023

  10. arXiv:2309.01212  [pdf, other

    cs.SD eess.AS

    NADiffuSE: Noise-aware Diffusion-based Model for Speech Enhancement

    Authors: Wen Wang, Dongchao Yang, Qichen Ye, Bowen Cao, Yuexian Zou

    Abstract: The goal of speech enhancement (SE) is to eliminate the background interference from the noisy speech signal. Generative models such as diffusion models (DM) have been applied to the task of SE because of better generalization in unseen noisy scenes. Technical routes for the DM-based SE methods can be summarized into three types: task-adapted diffusion process formulation, generator-plus-condition… ▽ More

    Submitted 3 September, 2023; originally announced September 2023.

  11. arXiv:2308.06732  [pdf, ps, other

    cs.NI eess.SY

    UD-MAC: Delay Tolerant Multiple Access Control Protocol for Unmanned Aerial Vehicle Networks

    Authors: Yingying Zou, Zhiqing Wei, Yanpeng Cui, Xinyi Liu, Zhiyong Feng

    Abstract: In unmanned aerial vehicle (UAV) networks, high-capacity data transmission is of utmost importance for applications such as intelligent transportation, smart cities, and forest monitoring, which rely on the mobility of UAVs to collect and transmit large amount of data, including video and image data. Due to the short flight time of UAVs, the network capacity will be reduced when they return to the… ▽ More

    Submitted 13 August, 2023; originally announced August 2023.

  12. arXiv:2308.00187  [pdf, ps, other

    cs.RO cs.CV eess.SP

    Detecting the Anomalies in LiDAR Pointcloud

    Authors: Chiyu Zhang, Ji Han, Yao Zou, Kexin Dong, Yujia Li, Junchun Ding, Xiaoling Han

    Abstract: LiDAR sensors play an important role in the perception stack of modern autonomous driving systems. Adverse weather conditions such as rain, fog and dust, as well as some (occasional) LiDAR hardware fault may cause the LiDAR to produce pointcloud with abnormal patterns such as scattered noise points and uncommon intensity values. In this paper, we propose a novel approach to detect whether a LiDAR… ▽ More

    Submitted 31 July, 2023; originally announced August 2023.

  13. arXiv:2307.15344  [pdf, other

    cs.SD eess.AS

    Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

    Authors: Yifei Xin, Yuexian Zou

    Abstract: Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs between whole audio clips and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short segments and phrases or frames and words. In this paper, we introduce a hierarchical cross-modal interaction (HCI) method for ATR by simultaneously exploring clip-sentence, segment-phras… ▽ More

    Submitted 28 July, 2023; originally announced July 2023.

    Comments: Accepted by Interspeech2023

  14. arXiv:2307.05362  [pdf, other

    eess.SP cs.LG

    SleepEGAN: A GAN-enhanced Ensemble Deep Learning Model for Imbalanced Classification of Sleep Stages

    Authors: Xuewei Cheng, Ke Huang, Yi Zou, Shujie Ma

    Abstract: Deep neural networks have played an important role in automatic sleep stage classification because of their strong representation and in-model feature transformation abilities. However, class imbalance and individual heterogeneity which typically exist in raw EEG signals of sleep data can significantly affect the classification performance of any machine learning algorithms. To solve these two pro… ▽ More

    Submitted 3 July, 2023; originally announced July 2023.

    Comments: 20 pages, 6 figures

  15. arXiv:2307.00861  [pdf, other

    cs.RO eess.SY

    Perch a quadrotor on planes by the ceiling effect

    Authors: Yuying Zou, Haotian Li, Yunfan Ren, Wei Xu, Yihang Li, Yixi Cai, Shenji Zhou, Fu Zhang

    Abstract: Perching is a promising solution for a small unmanned aerial vehicle (UAV) to save energy and extend operation time. This paper proposes a quadrotor that can perch on planar structures using the ceiling effect. Compared with the existing work, this perching method does not require any claws, hooks, or adhesive pads, leading to a simpler system design. This method does not limit the perching by sur… ▽ More

    Submitted 3 July, 2023; originally announced July 2023.

  16. arXiv:2306.05196  [pdf, other

    eess.IV cs.CV

    Channel prior convolutional attention for medical image segmentation

    Authors: Hejun Huang, Zuguo Chen, Ying Zou, Ming Lu, Chaoyang Chen

    Abstract: Characteristics such as low contrast and significant organ shape variations are often exhibited in medical images. The improvement of segmentation performance in medical imaging is limited by the generally insufficient adaptive capabilities of existing attention mechanisms. An efficient Channel Prior Convolutional Attention (CPCA) method is proposed in this paper, supporting the dynamic distributi… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

  17. arXiv:2305.02765  [pdf, other

    cs.SD eess.AS

    HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec

    Authors: Dongchao Yang, Songxiang Liu, Rongjie Huang, **chuan Tian, Chao Weng, Yuexian Zou

    Abstract: Audio codec models are widely used in audio communication as a crucial technique for compressing audio into discrete representations. Nowadays, audio codec models are increasingly utilized in generation fields as intermediate representations. For instance, AudioLM is an audio generation model that uses the discrete representation of SoundStream as a training target, while VALL-E employs the Encode… ▽ More

    Submitted 7 May, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

    Comments: The second version of HiFi-Codec

  18. arXiv:2303.17395  [pdf, other

    eess.AS cs.CL cs.MM cs.SD

    WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research

    Authors: Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, Wenwu Wang

    Abstract: The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approx… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: 12 pages

  19. arXiv:2303.08671  [pdf, other

    cs.NI eess.SY

    A Dual-Cluster-Head Based Medium Access Control for Large-Scale UAV Ad-Hoc Networks

    Authors: Xinru Zhao, Zhiqing Wei, Yingying Zou, Hao Ma, Yanpeng Cui, Zhiyong Feng

    Abstract: Unmanned Aerial Vehicle (UAV) ad hoc network has achieved significant growth for its flexibility, extensibility, and high deployability in recent years. The application of clustering scheme for UAV ad hoc network is imperative to enhance the performance of throughput and energy efficiency. In conventional clustering scheme, a single cluster head (CH) is always assigned in each cluster. However, th… ▽ More

    Submitted 26 February, 2023; originally announced March 2023.

    Comments: 10 pages, 12 figures, journal

  20. arXiv:2303.05681  [pdf, other

    cs.SD eess.AS

    Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior Matrix Revised Loss

    Authors: Yifei Xin, Dongchao Yang, Yuexian Zou

    Abstract: In text-audio retrieval (TAR) tasks, due to the heterogeneity of contents between text and audio, the semantic information contained in the text is only similar to certain frames within the audio. Yet, existing works aggregate the entire audio without considering the text, such as mean-pooling over the frames, which is likely to encode misleading audio information not described in the given text.… ▽ More

    Submitted 30 March, 2023; v1 submitted 9 March, 2023; originally announced March 2023.

  21. arXiv:2303.05678  [pdf, other

    cs.SD cs.LG eess.AS

    Improving Weakly Supervised Sound Event Detection with Causal Intervention

    Authors: Yifei Xin, Dongchao Yang, Fan Cui, Yujun Wang, Yuexian Zou

    Abstract: Existing weakly supervised sound event detection (WSSED) work has not explored both types of co-occurrences simultaneously, i.e., some sound events often co-occur, and their occurrences are usually accompanied by specific background sounds, so they would be inevitably entangled, causing misclassification and biased localization results with only clip-level supervision. To tackle this issue, we fir… ▽ More

    Submitted 9 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP2023

  22. arXiv:2303.01086  [pdf, other

    cs.CL cs.SD eess.AS

    LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion

    Authors: Chunfeng Wang, Peisong Huang, Yuxiang Zou, Haoyu Zhang, Shichao Liu, Xiang Yin, Zejun Ma

    Abstract: As a key component of automated speech recognition (ASR) and the front-end in text-to-speech (TTS), grapheme-to-phoneme (G2P) plays the role of converting letters to their corresponding pronunciations. Existing methods are either slow or poor in performance, and are limited in application scenarios, particularly in the process of on-device inference. In this paper, we integrate the advantages of b… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP2023

  23. arXiv:2302.09328  [pdf, other

    cs.MM cs.SD eess.AS

    SSVMR: Saliency-based Self-training for Video-Music Retrieval

    Authors: Xuxin Cheng, Zhihong Zhu, Hongxiang Li, Yaowei Li, Yuexian Zou

    Abstract: With the rise of short videos, the demand for selecting appropriate background music (BGM) for a video has increased significantly, video-music retrieval (VMR) task gradually draws much attention by research community. As other cross-modal learning tasks, existing VMR approaches usually attempt to measure the similarity between the video and music in the feature space. However, they (1) neglect th… ▽ More

    Submitted 18 February, 2023; originally announced February 2023.

    Comments: Accepted by ICASSP 2023

  24. arXiv:2301.01369  [pdf, other

    eess.IV cs.CV cs.LG

    Brain Tissue Segmentation Across the Human Lifespan via Supervised Contrastive Learning

    Authors: Xiaoyang Chen, **jian Wu, Wenjiao Lyu, Yicheng Zou, Kim-Han Thung, Siyuan Liu, Ye Wu, Sahar Ahmad, Pew-Thian Yap

    Abstract: Automatic segmentation of brain MR images into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) is critical for tissue volumetric analysis and cortical surface reconstruction. Due to dramatic structural and appearance changes associated with developmental and aging processes, existing brain tissue segmentation methods are only viable for specific age groups. Consequently, methods… ▽ More

    Submitted 3 January, 2023; originally announced January 2023.

  25. arXiv:2212.08348  [pdf, other

    cs.SD eess.AS

    Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation

    Authors: Rongzhi Gu, Shi-Xiong Zhang, Yuexian Zou, Dong Yu

    Abstract: Recently, frequency domain all-neural beamforming methods have achieved remarkable progress for multichannel speech separation. In parallel, the integration of time domain network structure and beamforming also gains significant attention. This study proposes a novel all-neural beamforming method in time domain and makes an attempt to unify the all-neural beamforming pipelines for time domain and… ▽ More

    Submitted 23 December, 2022; v1 submitted 16 December, 2022; originally announced December 2022.

  26. arXiv:2212.03657  [pdf, other

    cs.CL cs.SD eess.AS

    M3ST: Mix at Three Levels for Speech Translation

    Authors: Xuxin Cheng, Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Yuexian Zou

    Abstract: How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine… ▽ More

    Submitted 7 December, 2022; originally announced December 2022.

    Comments: Submitted to ICASSP 2023

  27. arXiv:2212.01839  [pdf, other

    eess.SP

    Proximal Gradient-Based Unfolding for Massive Random Access in IoT Networks

    Authors: Yinan Zou, Yong Zhou, Xu Chen, Yonina C. Eldar

    Abstract: Grant-free random access is an effective technology for enabling low-overhead and low-latency massive access, where joint activity detection and channel estimation (JADCE) is a critical issue. Although existing compressive sensing algorithms can be applied for JADCE, they usually fail to simultaneously harvest the following properties: effective sparsity inducing, fast convergence, robust to diffe… ▽ More

    Submitted 4 December, 2022; originally announced December 2022.

  28. arXiv:2211.02448  [pdf, other

    cs.SD eess.AS

    NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS

    Authors: Dongchao Yang, Songxiang Liu, Jianwei Yu, Helin Wang, Chao Weng, Yuexian Zou

    Abstract: Expressive text-to-speech (TTS) can synthesize a new speaking style by imiating prosody and timbre from a reference audio, which faces the following challenges: (1) The highly dynamic prosody information in the reference audio is difficult to extract, especially, when the reference audio contains background noise. (2) The TTS systems should have good generalization for unseen speaking styles. In t… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP2023

  29. arXiv:2210.02725  [pdf, ps, other

    cs.IT eess.SP

    Exploiting NOMA and RIS in Integrated Sensing and Communication

    Authors: Jiakuo Zuo, Yuanwei Liu, Chenming Zhu, Yixuan Zou, Dengyin Zhang, Naofal Al-Dhahir

    Abstract: A novel integrated sensing and communication (ISAC) system is proposed, where a dual-functional base station is utilized to transmit the superimposed non-orthogonal multiple access (NOMA) communication signal for serving communication users and sensing targets simultaneously. Furthermore, a new reconfigurable intelligent surface (RIS)-aided-sensing structure is also proposed to address the signifi… ▽ More

    Submitted 6 October, 2022; originally announced October 2022.

    Comments: arXiv admin note: text overlap with arXiv:2208.04786

  30. arXiv:2210.01438  [pdf, other

    eess.IV cs.CV

    Complementary consistency semi-supervised learning for 3D left atrial image segmentation

    Authors: Hejun Huang, Zuguo Chen, Chaoyang Chen, Ming Lu, Ying Zou

    Abstract: A network based on complementary consistency training, called CC-Net, has been proposed for semi-supervised left atrium image segmentation. CC-Net efficiently utilizes unlabeled data from the perspective of complementary information to address the problem of limited ability of existing semi-supervised segmentation algorithms to extract information from unlabeled data. The complementary symmetric s… ▽ More

    Submitted 4 April, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

  31. arXiv:2209.12666  [pdf, other

    eess.SY

    Anti-Delay Kalman Filter Fusion Algorithm for Vehicle-borne Sensor Network with Finite-Time Convergence

    Authors: Hang Yu, Keren Dai, Haojie Li, Yao Zou, Xiang Ma, Shaojie Ma, He Zhang

    Abstract: Intelligent vehicles in autonomous driving and obstacle avoidance, the precise relative state of vehicles put forward a higher demand. For a vehicle-borne sensor network with time-varying transmission delays, the problem of coordinate fusion of vehicle state is the focus of this paper. By the ingeniously designed low-complexity integration with a consensus strategy and buffer technology, an anti-d… ▽ More

    Submitted 19 September, 2022; originally announced September 2022.

  32. arXiv:2207.12793  [pdf

    eess.SY

    Modeling mandatory and discretionary lane changes using dynamic interaction networks

    Authors: Yue Zhang, Yajie Zou, Yuanchang Xie, Lei Chen

    Abstract: A quantitative understanding of dynamic lane-changing (LC) interaction patterns is indispensable for improving the decision-making of autonomous vehicles, especially in mixed traffic with human-driven vehicles. This paper develops a novel framework combining the hidden Markov model and graph structure to identify the difference in dynamic interaction networks between mandatory lane changes (MLC) a… ▽ More

    Submitted 26 July, 2022; originally announced July 2022.

  33. arXiv:2207.09983  [pdf, other

    cs.SD cs.AI eess.AS

    Diffsound: Discrete Diffusion Model for Text-to-sound Generation

    Authors: Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, Dong Yu

    Abstract: Generating sound effects that humans want is an important topic. However, there are few studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses t… ▽ More

    Submitted 28 April, 2023; v1 submitted 20 July, 2022; originally announced July 2022.

    Comments: Accepted by TASLP2022

  34. arXiv:2207.00475  [pdf, other

    cs.CV cs.AI cs.LG eess.IV

    Agent with Tangent-based Formulation and Anatomical Perception for Standard Plane Localization in 3D Ultrasound

    Authors: Yuxin Zou, Haoran Dou, Yuhao Huang, Xin Yang, Jikuan Qian, Chaojiong Zhen, Xiaodan Ji, Nishant Ravikumar, Guoqiang Chen, Weijun Huang, Alejandro F. Frangi, Dong Ni

    Abstract: Standard plane (SP) localization is essential in routine clinical ultrasound (US) diagnosis. Compared to 2D US, 3D US can acquire multiple view planes in one scan and provide complete anatomy with the addition of coronal plane. However, manually navigating SPs in 3D US is laborious and biased due to the orientation variability and huge search space. In this study, we introduce a novel reinforcemen… ▽ More

    Submitted 1 July, 2022; originally announced July 2022.

    Comments: Accepted by MICCAI 2022

  35. arXiv:2206.12281  [pdf

    eess.SP

    Real-time Dual-channel 2 * 2 MIMO Fiber-THz-Fiber Seamless Integration System at 385 GHz and 435 GHz

    Authors: Jiao Zhang, Min Zhu, Bingchang Hua, Mingzheng Lei, Yuancheng Cai, Liang Tian, Yucong Zou, Like Ma, Yongming Huang, Jianjun Yu, Xiaohu You

    Abstract: We demonstrate the first practical real-time dual-channel fiber-THz-fiber 2 * 2 MIMO seamless integration system with a record net data rate of 2 * 103.125 Gb/s at 385 GHz and 435 GHz over two spans of 20 km SSMF and 3 m wireless link.

    Submitted 24 June, 2022; originally announced June 2022.

    Comments: This paper has been accepted by ECOC 2022

  36. arXiv:2205.03242  [pdf

    eess.SP cs.AI cs.LG

    Electrocardiographic Deep Learning for Predicting Post-Procedural Mortality

    Authors: David Ouyang, John Theurer, Nathan R. Stein, J. Weston Hughes, Pierre Elias, Bryan He, Neal Yuan, Grant Duffy, Roopinder K. Sandhu, Joseph Ebinger, Patrick Botting, Melvin Jujjavarapu, Brian Claggett, James E. Tooley, Tim Poterucha, Jonathan H. Chen, Michael Nurok, Marco Perez, Adler Perotte, James Y. Zou, Nancy R. Cook, Sumeet S. Chugh, Susan Cheng, Christine M. Albert

    Abstract: Background. Pre-operative risk assessments used in clinical practice are limited in their ability to identify risk for post-operative mortality. We hypothesize that electrocardiograms contain hidden risk markers that can help prognosticate post-operative mortality. Methods. In a derivation cohort of 45,969 pre-operative patients (age 59+- 19 years, 55 percent women), a deep learning algorithm was… ▽ More

    Submitted 30 April, 2022; originally announced May 2022.

  37. Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

    Authors: Xinmeng Xu, Rongzhi Gu, Yuexian Zou

    Abstract: Hand-crafted spatial features, such as inter-channel intensity difference (IID) and inter-channel phase difference (IPD), play a fundamental role in recent deep learning based dual-microphone speech enhancement (DMSE) systems. However, learning the mutual relationship between artificially designed spatial and spectral features is hard in the end-to-end DMSE. In this work, a novel architecture for… ▽ More

    Submitted 2 May, 2022; originally announced May 2022.

    Comments: Accepted by ICASSP 2022

  38. arXiv:2204.14272  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    End-to-end Spoken Conversational Question Answering: Task, Dataset and Model

    Authors: Chenyu You, Nuo Chen, Fenglin Liu, Shen Ge, Xian Wu, Yuexian Zou

    Abstract: In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts. However, the most natural way that human seek or test their knowledge is via human conversations. Therefore, we propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows given the speech… ▽ More

    Submitted 29 April, 2022; originally announced April 2022.

    Comments: In Findings of NAACL 2022. arXiv admin note: substantial text overlap with arXiv:2010.08923

  39. arXiv:2204.07375  [pdf, other

    eess.AS cs.SD

    Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

    Authors: Zifeng Zhao, Rongzhi Gu, Dongchao Yang, **chuan Tian, Yuexian Zou

    Abstract: Dominant researches adopt supervised training for speaker extraction, while the scarcity of ideally clean corpus and channel mismatch problem are rarely considered. To this end, we propose speaker-aware mixture of mixtures training (SAMoM), utilizing the consistency of speaker identity among target source, enrollment utterance and target estimate to weakly supervise the training of a deep speaker… ▽ More

    Submitted 15 April, 2022; originally announced April 2022.

    Comments: 5 pages, 4 tables, 4 figures. Submitted to INTERSPEECH 2022

  40. arXiv:2204.06455  [pdf, other

    eess.IV cs.CV

    WSSS4LUAD: Grand Challenge on Weakly-supervised Tissue Semantic Segmentation for Lung Adenocarcinoma

    Authors: Chu Han, Xipeng Pan, Lixu Yan, Huan Lin, Bingbing Li, Su Yao, Shanshan Lv, Zhenwei Shi, **hai Mai, Jiatai Lin, Bingchao Zhao, Zeyan Xu, Zhizhen Wang, Yumeng Wang, Yuan Zhang, Huihui Wang, Chao Zhu, Chunhui Lin, Lijian Mao, Min Wu, Luwen Duan, **gsong Zhu, Dong Hu, Zijie Fang, Yang Chen , et al. (18 additional authors not shown)

    Abstract: Lung cancer is the leading cause of cancer death worldwide, and adenocarcinoma (LUAD) is the most common subtype. Exploiting the potential value of the histopathology images can promote precision medicine in oncology. Tissue segmentation is the basic upstream task of histopathology image analysis. Existing deep learning models have achieved superior segmentation performance but require sufficient… ▽ More

    Submitted 13 April, 2022; v1 submitted 13 April, 2022; originally announced April 2022.

  41. arXiv:2204.02143  [pdf, other

    cs.SD eess.AS

    RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection

    Authors: Dongchao Yang, Helin Wang, Zhongjie Ye, Yuexian Zou, Wenwu Wang

    Abstract: Target sound detection (TSD) aims to detect the target sound from a mixture audio given the reference information. Previous methods use a conditional network to extract a sound-discriminative embedding from the reference audio, and then use it to detect the target sound from the mixture audio. However, the network performs much differently when using different reference audios (e.g. performs poorl… ▽ More

    Submitted 5 April, 2022; originally announced April 2022.

    Comments: submitted to interspeech2022

  42. arXiv:2204.02088  [pdf, other

    cs.SD eess.AS

    A Mixed supervised Learning Framework for Target Sound Detection

    Authors: Dongchao Yang, Helin Wang, Yuexian Zou, Wenwu Wang

    Abstract: Target sound detection (TSD) aims to detect the target sound from mixture audio given the reference information. Previous works have shown that TSD models can be trained on fully-annotated (frame-level label) or weakly-annotated (clip-level label) data. However, there are some clear evidences show that the performance of the model trained on weakly-annotated data is worse than that trained on full… ▽ More

    Submitted 19 July, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: submitted to DCASE workshop

  43. arXiv:2204.01731  [pdf, ps, other

    cs.LG eess.SP

    Gan-Based Joint Activity Detection and Channel Estimation For Grant-free Random Access

    Authors: Shuang Liang, Yinan Zou, Yong Zhou

    Abstract: Joint activity detection and channel estimation (JADCE) for grant-free random access is a critical issue that needs to be addressed to support massive connectivity in IoT networks. However, the existing model-free learning method can only achieve either activity detection or channel estimation, but not both. In this paper, we propose a novel model-free learning method based on generative adversari… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: 5 pages, 5 figures IEEE ICASSP2022

  44. arXiv:2204.01716  [pdf, other

    eess.IV cs.CV

    Estimating Fine-Grained Noise Model via Contrastive Learning

    Authors: Yunhao Zou, Ying Fu

    Abstract: Image denoising has achieved unprecedented progress as great efforts have been made to exploit effective deep denoisers. To improve the denoising performance in realworld, two typical solutions are used in recent trends: devising better noise models for the synthesis of more realistic training data, and estimating noise level function to guide non-blind denoisers. In this work, we combine both noi… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

  45. arXiv:2204.01355  [pdf, other

    eess.AS cs.SD

    Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

    Authors: Zifeng Zhao, Dongchao Yang, Rongzhi Gu, Haoran Zhang, Yuexian Zou

    Abstract: Recently, end-to-end speaker extraction has attracted increasing attention and shown promising results. However, its performance is often inferior to that of a blind source separation (BSS) counterpart with a similar network architecture, due to the auxiliary speaker encoder may sometimes generate ambiguous speaker embeddings. Such ambiguous guidance information may confuse the separation network… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: 5 pages, 1 table, 5 figures. Submitted to INTERSPEECH 2022

  46. arXiv:2204.00821  [pdf, other

    cs.SD eess.AS

    Improving Target Sound Extraction with Timestamp Information

    Authors: Helin Wang, Dongchao Yang, Chao Weng, Jianwei Yu, Yuexian Zou

    Abstract: Target sound extraction (TSE) aims to extract the sound part of a target sound event class from a mixture audio with multiple sound events. The previous works mainly focus on the problems of weakly-labelled data, jointly learning and new classes, however, no one cares about the onset and offset times of the target sound event, which has been emphasized in the auditory scene analysis. In this pap… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

    Comments: submitted to interspeech2022

  47. arXiv:2203.16772  [pdf, other

    cs.SD cs.AI eess.AS

    Learning Decoupling Features Through Orthogonality Regularization

    Authors: Li Wang, Rongzhi Gu, Weiji Zhuang, Peng Gao, Yujun Wang, Yuexian Zou

    Abstract: Keyword spotting (KWS) and speaker verification (SV) are two important tasks in speech applications. Research shows that the state-of-art KWS and SV models are trained independently using different datasets since they expect to learn distinctive acoustic features. However, humans can distinguish language content and the speaker identity simultaneously. Motivated by this, we believe it is important… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted at ICASSP 2022

  48. arXiv:2203.15614  [pdf, other

    cs.CL cs.SD eess.AS

    Integrating Lattice-Free MMI into End-to-End Speech Recognition

    Authors: **chuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu

    Abstract: In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems.… ▽ More

    Submitted 22 August, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022

  49. arXiv:2203.14753  [pdf, other

    eess.SP

    Knowledge-Guided Learning for Transceiver Design in Over-the-Air Federated Learning

    Authors: Yinan Zou, Zixin Wang, Xu Chen, Haibo Zhou, Yong Zhou

    Abstract: In this paper, we consider communication-efficient over-the-air federated learning (FL), where multiple edge devices with non-independent and identically distributed datasets perform multiple local iterations in each communication round and then concurrently transmit their updated gradients to an edge server over the same radio channel for global model aggregation using over-the-air computation (A… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

  50. arXiv:2201.07881  [pdf

    cs.RO eess.SY

    Analysis of lane-change conflict between cars and trucks at merging section using UAV video data

    Authors: Yichen Lu, Kai Cheng, Yue Zhang, Xinqiang Chen, Yajie Zou

    Abstract: The freeway on-ramp merging section is often identified as a crash-prone spot due to the high frequency of traffic conflicts. Very few traffic conflict analysis studies comprehensively consider different vehicle types at freeway merging section. Thus, the main objective of this study is to analyse conflicts between different vehicle types at freeway merging section. Field data are collected by Unm… ▽ More

    Submitted 5 January, 2022; originally announced January 2022.