-
MIPI 2024 Challenge on Demosaic for HybridEVS Camera: Methods and Results
Authors:
Yaqi Wu,
Zhihao Fan,
Xiaofeng Chu,
Jimmy S. Ren,
Xiaoming Li,
Zongsheng Yue,
Chongyi Li,
Shangcheng Zhou,
Ruicheng Feng,
Yuekun Dai,
Peiqing Yang,
Chen Change Loy,
Senyan Xu,
Zhi**g Sun,
Jiaying Zhu,
Yurui Zhu,
Xueyang Fu,
Zheng-Jun Zha,
Jun Cao,
Cheng Li,
Shu Chen,
Liang Ma,
Shiyang Zhou,
Hai** Zeng,
Kai Feng
, et al. (24 additional authors not shown)
Abstract:
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra…
▽ More
The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Nighttime Flare Removal track on MIPI 2024. In total, 170 participants were successfully registered, and 14 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art performance on Nighttime Flare Removal. More details of this challenge and the link to the dataset can be found at https://mipi-challenge.org/MIPI2024/.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey
Authors:
Marcos V. Conde,
Zhijun Lei,
Wen Li,
Cosmin Stejerean,
Ioannis Katsavounidis,
Radu Timofte,
Kihwan Yoon,
Ganzorig Gankhuyag,
Jiangtao Lv,
Long Sun,
**shan Pan,
Jiangxin Dong,
**hui Tang,
Zhiyuan Li,
Hao Wei,
Chenyang Ge,
Dongyang Zhang,
Tianle Liu,
Huaian Chen,
Yi **,
Menghan Zhou,
Yiqiang Yan,
Si Gao,
Biao Wu,
Shaoli Liu
, et al. (50 additional authors not shown)
Abstract:
This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod…
▽ More
This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF codec, instead of JPEG. All the proposed methods improve PSNR fidelity over Lanczos interpolation, and process images under 10ms. Out of the 160 participants, 25 teams submitted their code and models. The solutions present novel designs tailored for memory-efficiency and runtime on edge devices. This survey describes the best solutions for real-time SR of compressed high-resolution images.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report
Authors:
Bin Ren,
Yawei Li,
Nancy Mehta,
Radu Timofte,
Hongyuan Yu,
Cheng Wan,
Yuxin Hong,
Bingnan Han,
Zhuoyuan Wu,
Yajun Zou,
Yuqing Liu,
Jizhe Li,
Keji He,
Chao Fan,
Heng Zhang,
Xiaolin Zhang,
Xuanwu Yin,
Kunlong Zuo,
Bohao Liao,
Peizhe Xia,
Long Peng,
Zhibo Du,
Xin Di,
Wangkai Li,
Yang Wang
, et al. (109 additional authors not shown)
Abstract:
This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such…
▽ More
This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/.
△ Less
Submitted 25 June, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Wideband Beamforming for RIS Assisted Near-Field Communications
Authors:
Ji Wang,
Jian Xiao,
Yixuan Zou,
Wenwu Xie,
Yuanwei Liu
Abstract:
A near-field wideband beamforming scheme is investigated for reconfigurable intelligent surface (RIS) assisted multiple-input multiple-output (MIMO) systems, in which a deep learning-based end-to-end (E2E) optimization framework is proposed to maximize the system spectral efficiency. To deal with the near-field double beam split effect, the base station is equipped with frequency-dependent hybrid…
▽ More
A near-field wideband beamforming scheme is investigated for reconfigurable intelligent surface (RIS) assisted multiple-input multiple-output (MIMO) systems, in which a deep learning-based end-to-end (E2E) optimization framework is proposed to maximize the system spectral efficiency. To deal with the near-field double beam split effect, the base station is equipped with frequency-dependent hybrid precoding architecture by introducing sub-connected true time delay (TTD) units, while two specific RIS architectures, namely true time delay-based RIS (TTD-RIS) and virtual subarray-based RIS (SA-RIS), are exploited to realize the frequency-dependent passive beamforming at the RIS. Furthermore, the efficient E2E beamforming models without explicit channel state information are proposed, which jointly exploits the uplink channel training module and the downlink wideband beamforming module. In the proposed network architecture of the E2E models, the classical communication signal processing methods, i.e., polarized filtering and sparsity transform, are leveraged to develop a signal-guided beamforming network. Numerical results show that the proposed E2E models have superior beamforming performance and robustness to conventional beamforming benchmarks. Furthermore, the tradeoff between the beamforming gain and the hardware complexity is investigated for different frequency-dependent RIS architectures, in which the TTD-RIS can achieve better spectral efficiency than the SA-RIS while requiring additional energy consumption and hardware cost.
△ Less
Submitted 20 January, 2024;
originally announced January 2024.
-
U2-KWS: Unified Two-pass Open-vocabulary Keyword Spotting with Keyword Bias
Authors:
Ao Zhang,
Pan Zhou,
Kaixun Huang,
Yong Zou,
Ming Liu,
Lei Xie
Abstract:
Open-vocabulary keyword spotting (KWS), which allows users to customize keywords, has attracted increasingly more interest. However, existing methods based on acoustic models and post-processing train the acoustic model with ASR training criteria to model all phonemes, making the acoustic model under-optimized for the KWS task. To solve this problem, we propose a novel unified two-pass open-vocabu…
▽ More
Open-vocabulary keyword spotting (KWS), which allows users to customize keywords, has attracted increasingly more interest. However, existing methods based on acoustic models and post-processing train the acoustic model with ASR training criteria to model all phonemes, making the acoustic model under-optimized for the KWS task. To solve this problem, we propose a novel unified two-pass open-vocabulary KWS (U2-KWS) framework inspired by the two-pass ASR model U2. Specifically, we employ the CTC branch as the first stage model to detect potential keyword candidates and the decoder branch as the second stage model to validate candidates. In order to enhance any customized keywords, we redesign the U2 training procedure for U2-KWS and add keyword information by audio and text cross-attention into both branches. We perform experiments on our internal dataset and Aishell-1. The results show that U2-KWS can achieve a significant relative wake-up rate improvement of 41% compared to the traditional customized KWS systems when the false alarm rate is fixed to 0.5 times per hour.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
Swift Parameter-free Attention Network for Efficient Super-Resolution
Authors:
Cheng Wan,
Hongyuan Yu,
Zhiqi Li,
Yihang Chen,
Yajun Zou,
Yuqing Liu,
Xuanwu Yin,
Kunlong Zuo
Abstract:
Single Image Super-Resolution (SISR) is a crucial task in low-level computer vision, aiming to reconstruct high-resolution images from low-resolution counterparts. Conventional attention mechanisms have significantly improved SISR performance but often result in complex network structures and large number of parameters, leading to slow inference speed and large model size. To address this issue, w…
▽ More
Single Image Super-Resolution (SISR) is a crucial task in low-level computer vision, aiming to reconstruct high-resolution images from low-resolution counterparts. Conventional attention mechanisms have significantly improved SISR performance but often result in complex network structures and large number of parameters, leading to slow inference speed and large model size. To address this issue, we propose the Swift Parameter-free Attention Network (SPAN), a highly efficient SISR model that balances parameter count, inference speed, and image quality. SPAN employs a novel parameter-free attention mechanism, which leverages symmetric activation functions and residual connections to enhance high-contribution information and suppress redundant information. Our theoretical analysis demonstrates the effectiveness of this design in achieving the attention mechanism's purpose. We evaluate SPAN on multiple benchmarks, showing that it outperforms existing efficient super-resolution models in terms of both image quality and inference speed, achieving a significant quality-speed trade-off. This makes SPAN highly suitable for real-world applications, particularly in resource-constrained scenarios. Notably, we won the first place both in the overall performance track and runtime track of the NTIRE 2024 efficient super-resolution challenge. Our code and models are made publicly available at https://github.com/hongyuanyu/SPAN.
△ Less
Submitted 12 May, 2024; v1 submitted 21 November, 2023;
originally announced November 2023.
-
Interference Management by Harnessing Multi-Domain Resources in Spectrum-Sharing Aided Satellite-Ground Integrated Networks
Authors:
Xiao** Ding,
Yue Lei,
Yulong Zou,
Gengxin Zhang,
Lajos Hanzo
Abstract:
A spectrum-sharing satellite-ground integrated network is conceived, consisting of a pair of non-geostationary orbit (NGSO) constellations and multiple terrestrial base stations, which impose the co-frequency interference (CFI) on each other. The CFI may increase upon increasing the number of satellites. To manage the potentially severe interference, we propose to rely on joint multi-domain resour…
▽ More
A spectrum-sharing satellite-ground integrated network is conceived, consisting of a pair of non-geostationary orbit (NGSO) constellations and multiple terrestrial base stations, which impose the co-frequency interference (CFI) on each other. The CFI may increase upon increasing the number of satellites. To manage the potentially severe interference, we propose to rely on joint multi-domain resource aided interference management (JMDR-IM). Specifically, the coverage overlap of the constellations considered is analyzed. Then, multi-domain resources - including both the beam-domain and power-domain - are jointly utilized for managing the CFI in an overlap** coverage region. This joint resource utilization is performed by relying on our specifically designed beam-shut-off and switching based beam scheduling, as well as on long short-term memory based joint autoregressive moving average assisted deep Q network aided power scheduling. Moreover, the outage probability (OP) of the proposed JMDR-IM scheme is derived, and the asymptotic analysis of the OP is also provided. Our performance evaluations demonstrate the superiority of the proposed JMDR-IM scheme in terms of its increased throughput and reduced OP.
△ Less
Submitted 29 January, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
Multi-Agent Robust Control Synthesis from Global Temporal Logic Tasks
Authors:
Tiange Yang,
Yuanyuan Zou,
**feng Liu,
Tianyu Jia,
Shaoyuan Li
Abstract:
This paper focuses on the heterogeneous multi-agent control problem under global temporal logic tasks. We define a specification language, called extended capacity temporal logic (ECaTL), to describe the required global tasks, including the number of times that a local or coupled signal temporal logic (STL) task needs to be satisfied and the synchronous requirements on task satisfaction. The robus…
▽ More
This paper focuses on the heterogeneous multi-agent control problem under global temporal logic tasks. We define a specification language, called extended capacity temporal logic (ECaTL), to describe the required global tasks, including the number of times that a local or coupled signal temporal logic (STL) task needs to be satisfied and the synchronous requirements on task satisfaction. The robustness measure for ECaTL is formally designed. In particular, the robustness for synchronous tasks is evaluated from both the temporal and spatial perspectives. Mixed-integer linear constraints are designed to encode ECaTL specifications, and a two-step optimization framework is further proposed to realize task-satisfied motion planning with high spatial robustness and synchronicity. Simulations are conducted to demonstrate the expressivity of ECaTL and the efficiency of the proposed control synthesis approach.
△ Less
Submitted 17 November, 2023; v1 submitted 20 September, 2023;
originally announced September 2023.
-
RawHDR: High Dynamic Range Image Reconstruction from a Single Raw Image
Authors:
Yunhao Zou,
Chenggang Yan,
Ying Fu
Abstract:
High dynamic range (HDR) images capture much more intensity levels than standard ones. Current methods predominantly generate HDR images from 8-bit low dynamic range (LDR) sRGB images that have been degraded by the camera processing pipeline. However, it becomes a formidable task to retrieve extremely high dynamic range scenes from such limited bit-depth data. Unlike existing methods, the core ide…
▽ More
High dynamic range (HDR) images capture much more intensity levels than standard ones. Current methods predominantly generate HDR images from 8-bit low dynamic range (LDR) sRGB images that have been degraded by the camera processing pipeline. However, it becomes a formidable task to retrieve extremely high dynamic range scenes from such limited bit-depth data. Unlike existing methods, the core idea of this work is to incorporate more informative Raw sensor data to generate HDR images, aiming to recover scene information in hard regions (the darkest and brightest areas of an HDR scene). To this end, we propose a model tailor-made for Raw images, harnessing the unique features of Raw data to facilitate the Raw-to-HDR map**. Specifically, we learn exposure masks to separate the hard and easy regions of a high dynamic scene. Then, we introduce two important guidances, dual intensity guidance, which guides less informative channels with more informative ones, and global spatial guidance, which extrapolates scene specifics over an extended spatial domain. To verify our Raw-to-HDR approach, we collect a large Raw/HDR paired dataset for both training and testing. Our empirical evaluations validate the superiority of the proposed Raw-to-HDR reconstruction model, as well as our newly captured dataset in the experiments.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
NADiffuSE: Noise-aware Diffusion-based Model for Speech Enhancement
Authors:
Wen Wang,
Dongchao Yang,
Qichen Ye,
Bowen Cao,
Yuexian Zou
Abstract:
The goal of speech enhancement (SE) is to eliminate the background interference from the noisy speech signal. Generative models such as diffusion models (DM) have been applied to the task of SE because of better generalization in unseen noisy scenes. Technical routes for the DM-based SE methods can be summarized into three types: task-adapted diffusion process formulation, generator-plus-condition…
▽ More
The goal of speech enhancement (SE) is to eliminate the background interference from the noisy speech signal. Generative models such as diffusion models (DM) have been applied to the task of SE because of better generalization in unseen noisy scenes. Technical routes for the DM-based SE methods can be summarized into three types: task-adapted diffusion process formulation, generator-plus-conditioner (GPC) structures and the multi-stage frameworks. We focus on the first two approaches, which are constructed under the GPC architecture and use the task-adapted diffusion process to better deal with the real noise. However, the performance of these SE models is limited by the following issues: (a) Non-Gaussian noise estimation in the task-adapted diffusion process. (b) Conditional domain bias caused by the weak conditioner design in the GPC structure. (c) Large amount of residual noise caused by unreasonable interpolation operations during inference. To solve the above problems, we propose a noise-aware diffusion-based SE model (NADiffuSE) to boost the SE performance, where the noise representation is extracted from the noisy speech signal and introduced as a global conditional information for estimating the non-Gaussian components. Furthermore, the anchor-based inference algorithm is employed to achieve a compromise between the speech distortion and noise residual. In order to mitigate the performance degradation caused by the conditional domain bias in the GPC framework, we investigate three model variants, all of which can be viewed as multi-stage SE based on the preprocessing networks for Mel spectrograms. Experimental results show that NADiffuSE outperforms other DM-based SE models under the GPC infrastructure. Audio samples are available at: https://square-of-w.github.io/NADiffuSE-demo/.
△ Less
Submitted 3 September, 2023;
originally announced September 2023.
-
UD-MAC: Delay Tolerant Multiple Access Control Protocol for Unmanned Aerial Vehicle Networks
Authors:
Yingying Zou,
Zhiqing Wei,
Yanpeng Cui,
Xinyi Liu,
Zhiyong Feng
Abstract:
In unmanned aerial vehicle (UAV) networks, high-capacity data transmission is of utmost importance for applications such as intelligent transportation, smart cities, and forest monitoring, which rely on the mobility of UAVs to collect and transmit large amount of data, including video and image data. Due to the short flight time of UAVs, the network capacity will be reduced when they return to the…
▽ More
In unmanned aerial vehicle (UAV) networks, high-capacity data transmission is of utmost importance for applications such as intelligent transportation, smart cities, and forest monitoring, which rely on the mobility of UAVs to collect and transmit large amount of data, including video and image data. Due to the short flight time of UAVs, the network capacity will be reduced when they return to the ground unit for charging. Hence, we suggest that UAVs can apply a store-carry-and-forward (SCF) transmission mode to carry packets on their way back to the ground unit for improving network throughput. In this paper, we propose a novel protocol, named UAV delay-tolerant multiple access control (UD-MAC), which can support different transmission modes in UAV networks. We set a higher priority for SCF transmission and analyze the probability of being in SCF mode to derive network throughput. The simulation results show that the network throughput of UD-MAC is improved by 57% to 83% compared to VeMAC.
△ Less
Submitted 13 August, 2023;
originally announced August 2023.
-
Detecting the Anomalies in LiDAR Pointcloud
Authors:
Chiyu Zhang,
Ji Han,
Yao Zou,
Kexin Dong,
Yujia Li,
Junchun Ding,
Xiaoling Han
Abstract:
LiDAR sensors play an important role in the perception stack of modern autonomous driving systems. Adverse weather conditions such as rain, fog and dust, as well as some (occasional) LiDAR hardware fault may cause the LiDAR to produce pointcloud with abnormal patterns such as scattered noise points and uncommon intensity values. In this paper, we propose a novel approach to detect whether a LiDAR…
▽ More
LiDAR sensors play an important role in the perception stack of modern autonomous driving systems. Adverse weather conditions such as rain, fog and dust, as well as some (occasional) LiDAR hardware fault may cause the LiDAR to produce pointcloud with abnormal patterns such as scattered noise points and uncommon intensity values. In this paper, we propose a novel approach to detect whether a LiDAR is generating anomalous pointcloud by analyzing the pointcloud characteristics. Specifically, we develop a pointcloud quality metric based on the LiDAR points' spatial and intensity distribution to characterize the noise level of the pointcloud, which relies on pure mathematical analysis and does not require any labeling or training as learning-based methods do. Therefore, the method is scalable and can be quickly deployed either online to improve the autonomy safety by monitoring anomalies in the LiDAR data or offline to perform in-depth study of the LiDAR behavior over large amount of data. The proposed approach is studied with extensive real public road data collected by LiDARs with different scanning mechanisms and laser spectrums, and is proven to be able to effectively handle various known and unknown sources of pointcloud anomaly.
△ Less
Submitted 31 July, 2023;
originally announced August 2023.
-
Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions
Authors:
Yifei Xin,
Yuexian Zou
Abstract:
Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs between whole audio clips and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short segments and phrases or frames and words. In this paper, we introduce a hierarchical cross-modal interaction (HCI) method for ATR by simultaneously exploring clip-sentence, segment-phras…
▽ More
Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs between whole audio clips and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short segments and phrases or frames and words. In this paper, we introduce a hierarchical cross-modal interaction (HCI) method for ATR by simultaneously exploring clip-sentence, segment-phrase, and frame-word relationships, achieving a comprehensive multi-modal semantic comparison. Besides, we also present a novel ATR framework that leverages auxiliary captions (AC) generated by a pretrained captioner to perform feature interaction between audio and generated captions, which yields enhanced audio representations and is complementary to the original ATR matching branch. The audio and generated captions can also form new audio-text pairs as data augmentation for training. Experiments show that our HCI significantly improves the ATR performance. Moreover, our AC framework also shows stable performance gains on multiple datasets.
△ Less
Submitted 28 July, 2023;
originally announced July 2023.
-
SleepEGAN: A GAN-enhanced Ensemble Deep Learning Model for Imbalanced Classification of Sleep Stages
Authors:
Xuewei Cheng,
Ke Huang,
Yi Zou,
Shujie Ma
Abstract:
Deep neural networks have played an important role in automatic sleep stage classification because of their strong representation and in-model feature transformation abilities. However, class imbalance and individual heterogeneity which typically exist in raw EEG signals of sleep data can significantly affect the classification performance of any machine learning algorithms. To solve these two pro…
▽ More
Deep neural networks have played an important role in automatic sleep stage classification because of their strong representation and in-model feature transformation abilities. However, class imbalance and individual heterogeneity which typically exist in raw EEG signals of sleep data can significantly affect the classification performance of any machine learning algorithms. To solve these two problems, this paper develops a generative adversarial network (GAN)-powered ensemble deep learning model, named SleepEGAN, for the imbalanced classification of sleep stages. To alleviate class imbalance, we propose a new GAN (called EGAN) architecture adapted to the features of EEG signals for data augmentation. The generated samples for the minority classes are used in the training process. In addition, we design a cost-free ensemble learning strategy to reduce the model estimation variance caused by the heterogeneity between the validation and test sets, so as to enhance the accuracy and robustness of prediction performance. We show that the proposed method can improve classification accuracy compared to several existing state-of-the-art methods using three public sleep datasets.
△ Less
Submitted 3 July, 2023;
originally announced July 2023.
-
Perch a quadrotor on planes by the ceiling effect
Authors:
Yuying Zou,
Haotian Li,
Yunfan Ren,
Wei Xu,
Yihang Li,
Yixi Cai,
Shenji Zhou,
Fu Zhang
Abstract:
Perching is a promising solution for a small unmanned aerial vehicle (UAV) to save energy and extend operation time. This paper proposes a quadrotor that can perch on planar structures using the ceiling effect. Compared with the existing work, this perching method does not require any claws, hooks, or adhesive pads, leading to a simpler system design. This method does not limit the perching by sur…
▽ More
Perching is a promising solution for a small unmanned aerial vehicle (UAV) to save energy and extend operation time. This paper proposes a quadrotor that can perch on planar structures using the ceiling effect. Compared with the existing work, this perching method does not require any claws, hooks, or adhesive pads, leading to a simpler system design. This method does not limit the perching by surface angle or material either. The design of the quadrotor that only uses its propeller guards for surface contact is presented in this paper. We also discussed the automatic perching strategy including trajectory generation and power management. Experiments are conducted to verify that the approach is practical and the UAV can perch on planes with different angles. Energy consumption in the perching state is assessed, showing that more than 30% of power can be saved. Meanwhile, the quadrotor exhibits improved stability while perching compared to when it is hovering.
△ Less
Submitted 3 July, 2023;
originally announced July 2023.
-
Channel prior convolutional attention for medical image segmentation
Authors:
Hejun Huang,
Zuguo Chen,
Ying Zou,
Ming Lu,
Chaoyang Chen
Abstract:
Characteristics such as low contrast and significant organ shape variations are often exhibited in medical images. The improvement of segmentation performance in medical imaging is limited by the generally insufficient adaptive capabilities of existing attention mechanisms. An efficient Channel Prior Convolutional Attention (CPCA) method is proposed in this paper, supporting the dynamic distributi…
▽ More
Characteristics such as low contrast and significant organ shape variations are often exhibited in medical images. The improvement of segmentation performance in medical imaging is limited by the generally insufficient adaptive capabilities of existing attention mechanisms. An efficient Channel Prior Convolutional Attention (CPCA) method is proposed in this paper, supporting the dynamic distribution of attention weights in both channel and spatial dimensions. Spatial relationships are effectively extracted while preserving the channel prior by employing a multi-scale depth-wise convolutional module. The ability to focus on informative channels and important regions is possessed by CPCA. A segmentation network called CPCANet for medical image segmentation is proposed based on CPCA. CPCANet is validated on two publicly available datasets. Improved segmentation performance is achieved by CPCANet while requiring fewer computational resources through comparisons with state-of-the-art algorithms. Our code is publicly available at \url{https://github.com/Cuthbert-Huang/CPCANet}.
△ Less
Submitted 8 June, 2023;
originally announced June 2023.
-
HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec
Authors:
Dongchao Yang,
Songxiang Liu,
Rongjie Huang,
**chuan Tian,
Chao Weng,
Yuexian Zou
Abstract:
Audio codec models are widely used in audio communication as a crucial technique for compressing audio into discrete representations. Nowadays, audio codec models are increasingly utilized in generation fields as intermediate representations. For instance, AudioLM is an audio generation model that uses the discrete representation of SoundStream as a training target, while VALL-E employs the Encode…
▽ More
Audio codec models are widely used in audio communication as a crucial technique for compressing audio into discrete representations. Nowadays, audio codec models are increasingly utilized in generation fields as intermediate representations. For instance, AudioLM is an audio generation model that uses the discrete representation of SoundStream as a training target, while VALL-E employs the Encodec model as an intermediate feature to aid TTS tasks. Despite their usefulness, two challenges persist: (1) training these audio codec models can be difficult due to the lack of publicly available training processes and the need for large-scale data and GPUs; (2) achieving good reconstruction performance requires many codebooks, which increases the burden on generation models. In this study, we propose a group-residual vector quantization (GRVQ) technique and use it to develop a novel \textbf{Hi}gh \textbf{Fi}delity Audio Codec model, HiFi-Codec, which only requires 4 codebooks. We train all the models using publicly available TTS data such as LibriTTS, VCTK, AISHELL, and more, with a total duration of over 1000 hours, using 8 GPUs. Our experimental results show that HiFi-Codec outperforms Encodec in terms of reconstruction performance despite requiring only 4 codebooks. To facilitate research in audio codec and generation, we introduce AcademiCodec, the first open-source audio codec toolkit that offers training codes and pre-trained models for Encodec, SoundStream, and HiFi-Codec. Code and pre-trained model can be found on: \href{https://github.com/yangdongchao/AcademiCodec}{https://github.com/yangdongchao/AcademiCodec}
△ Less
Submitted 7 May, 2023; v1 submitted 4 May, 2023;
originally announced May 2023.
-
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research
Authors:
Xinhao Mei,
Chutong Meng,
Haohe Liu,
Qiuqiang Kong,
Tom Ko,
Chengqi Zhao,
Mark D. Plumbley,
Yuexian Zou,
Wenwu Wang
Abstract:
The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approx…
▽ More
The advancement of audio-language (AL) multimodal learning tasks has been significant in recent years. However, researchers face challenges due to the costly and time-consuming collection process of existing audio-language datasets, which are limited in size. To address this data scarcity issue, we introduce WavCaps, the first large-scale weakly-labelled audio captioning dataset, comprising approximately 400k audio clips with paired captions. We sourced audio clips and their raw descriptions from web sources and a sound event detection dataset. However, the online-harvested raw descriptions are highly noisy and unsuitable for direct use in tasks such as automated audio captioning. To overcome this issue, we propose a three-stage processing pipeline for filtering noisy data and generating high-quality captions, where ChatGPT, a large language model, is leveraged to filter and transform raw descriptions automatically. We conduct a comprehensive analysis of the characteristics of WavCaps dataset and evaluate it on multiple downstream audio-language multimodal learning tasks. The systems trained on WavCaps outperform previous state-of-the-art (SOTA) models by a significant margin. Our aspiration is for the WavCaps dataset we have proposed to facilitate research in audio-language multimodal learning and demonstrate the potential of utilizing ChatGPT to enhance academic research. Our dataset and codes are available at https://github.com/XinhaoMei/WavCaps.
△ Less
Submitted 30 March, 2023;
originally announced March 2023.
-
A Dual-Cluster-Head Based Medium Access Control for Large-Scale UAV Ad-Hoc Networks
Authors:
Xinru Zhao,
Zhiqing Wei,
Yingying Zou,
Hao Ma,
Yanpeng Cui,
Zhiyong Feng
Abstract:
Unmanned Aerial Vehicle (UAV) ad hoc network has achieved significant growth for its flexibility, extensibility, and high deployability in recent years. The application of clustering scheme for UAV ad hoc network is imperative to enhance the performance of throughput and energy efficiency. In conventional clustering scheme, a single cluster head (CH) is always assigned in each cluster. However, th…
▽ More
Unmanned Aerial Vehicle (UAV) ad hoc network has achieved significant growth for its flexibility, extensibility, and high deployability in recent years. The application of clustering scheme for UAV ad hoc network is imperative to enhance the performance of throughput and energy efficiency. In conventional clustering scheme, a single cluster head (CH) is always assigned in each cluster. However, this method has some weaknesses such as overload and premature death of CH when the number of UAVs increased. In order to solve this problem, we propose a dual-cluster-head based medium access control (DCHMAC) scheme for large-scale UAV networks. In DCHMAC, two CHs are elected to manage resource allocation and data forwarding cooperatively. Specifically, two CHs work on different channels. One of CH is used for intra-cluster communication and the other one is for inter-cluster communication. A Markov chain model is developed to analyse the throughput of the network. Simulation result shows that compared with FM-MAC (flying ad hoc networks multi-channel MAC,FM-MAC), DCHMAC improves the throughput by approximately 20%-50% and prolongs the network lifetime by approximately 40%.
△ Less
Submitted 26 February, 2023;
originally announced March 2023.
-
Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior Matrix Revised Loss
Authors:
Yifei Xin,
Dongchao Yang,
Yuexian Zou
Abstract:
In text-audio retrieval (TAR) tasks, due to the heterogeneity of contents between text and audio, the semantic information contained in the text is only similar to certain frames within the audio. Yet, existing works aggregate the entire audio without considering the text, such as mean-pooling over the frames, which is likely to encode misleading audio information not described in the given text.…
▽ More
In text-audio retrieval (TAR) tasks, due to the heterogeneity of contents between text and audio, the semantic information contained in the text is only similar to certain frames within the audio. Yet, existing works aggregate the entire audio without considering the text, such as mean-pooling over the frames, which is likely to encode misleading audio information not described in the given text. In this paper, we present a text-aware attention pooling (TAP) module for TAR, which is essentially a scaled dot product attention for a text to attend to its most semantically similar frames. Furthermore, previous methods only conduct the softmax for every single-side retrieval, ignoring the potential cross-retrieval information. By exploring the intrinsic prior of each text-audio pair, we introduce a prior matrix revised (PMR) loss to filter the hard case with high (or low) text-to-audio but low (or high) audio-to-text similarity scores, thus achieving the dual optimal match. Experiments show that our TAP significantly outperforms various text-agnostic pooling functions. Moreover, our PMR loss also shows stable performance gains on multiple datasets.
△ Less
Submitted 30 March, 2023; v1 submitted 9 March, 2023;
originally announced March 2023.
-
Improving Weakly Supervised Sound Event Detection with Causal Intervention
Authors:
Yifei Xin,
Dongchao Yang,
Fan Cui,
Yujun Wang,
Yuexian Zou
Abstract:
Existing weakly supervised sound event detection (WSSED) work has not explored both types of co-occurrences simultaneously, i.e., some sound events often co-occur, and their occurrences are usually accompanied by specific background sounds, so they would be inevitably entangled, causing misclassification and biased localization results with only clip-level supervision. To tackle this issue, we fir…
▽ More
Existing weakly supervised sound event detection (WSSED) work has not explored both types of co-occurrences simultaneously, i.e., some sound events often co-occur, and their occurrences are usually accompanied by specific background sounds, so they would be inevitably entangled, causing misclassification and biased localization results with only clip-level supervision. To tackle this issue, we first establish a structural causal model (SCM) to reveal that the context is the main cause of co-occurrence confounders that mislead the model to learn spurious correlations between frames and clip-level labels. Based on the causal analysis, we propose a causal intervention (CI) method for WSSED to remove the negative impact of co-occurrence confounders by iteratively accumulating every possible context of each class and then re-projecting the contexts to the frame-level features for making the event boundary clearer. Experiments show that our method effectively improves the performance on multiple datasets and can generalize to various baseline models.
△ Less
Submitted 9 March, 2023;
originally announced March 2023.
-
LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion
Authors:
Chunfeng Wang,
Peisong Huang,
Yuxiang Zou,
Haoyu Zhang,
Shichao Liu,
Xiang Yin,
Zejun Ma
Abstract:
As a key component of automated speech recognition (ASR) and the front-end in text-to-speech (TTS), grapheme-to-phoneme (G2P) plays the role of converting letters to their corresponding pronunciations. Existing methods are either slow or poor in performance, and are limited in application scenarios, particularly in the process of on-device inference. In this paper, we integrate the advantages of b…
▽ More
As a key component of automated speech recognition (ASR) and the front-end in text-to-speech (TTS), grapheme-to-phoneme (G2P) plays the role of converting letters to their corresponding pronunciations. Existing methods are either slow or poor in performance, and are limited in application scenarios, particularly in the process of on-device inference. In this paper, we integrate the advantages of both expert knowledge and connectionist temporal classification (CTC) based neural network and propose a novel method named LiteG2P which is fast, light and theoretically parallel. With the carefully leading design, LiteG2P can be applied both on cloud and on device. Experimental results on the CMU dataset show that the performance of the proposed method is superior to the state-of-the-art CTC based method with 10 times fewer parameters, and even comparable to the state-of-the-art Transformer-based sequence-to-sequence model with less parameters and 33 times less computation.
△ Less
Submitted 2 March, 2023;
originally announced March 2023.
-
SSVMR: Saliency-based Self-training for Video-Music Retrieval
Authors:
Xuxin Cheng,
Zhihong Zhu,
Hongxiang Li,
Yaowei Li,
Yuexian Zou
Abstract:
With the rise of short videos, the demand for selecting appropriate background music (BGM) for a video has increased significantly, video-music retrieval (VMR) task gradually draws much attention by research community. As other cross-modal learning tasks, existing VMR approaches usually attempt to measure the similarity between the video and music in the feature space. However, they (1) neglect th…
▽ More
With the rise of short videos, the demand for selecting appropriate background music (BGM) for a video has increased significantly, video-music retrieval (VMR) task gradually draws much attention by research community. As other cross-modal learning tasks, existing VMR approaches usually attempt to measure the similarity between the video and music in the feature space. However, they (1) neglect the inevitable label noise; (2) neglect to enhance the ability to capture critical video clips. In this paper, we propose a novel saliency-based self-training framework, which is termed SSVMR. Specifically, we first explore to fully make use of the information containing in the training dataset by applying a semi-supervised method to suppress the adverse impact of label noise problem, where a self-training approach is adopted. In addition, we propose to capture the saliency of the video by mixing two videos at span level and preserving the locality of the two original videos. Inspired by back translation in NLP, we also conduct back retrieval to obtain more training data. Experimental results on MVD dataset show that our SSVMR achieves the state-of-the-art performance by a large margin, obtaining a relative improvement of 34.8% over the previous best model in terms of R@1.
△ Less
Submitted 18 February, 2023;
originally announced February 2023.
-
Brain Tissue Segmentation Across the Human Lifespan via Supervised Contrastive Learning
Authors:
Xiaoyang Chen,
**jian Wu,
Wenjiao Lyu,
Yicheng Zou,
Kim-Han Thung,
Siyuan Liu,
Ye Wu,
Sahar Ahmad,
Pew-Thian Yap
Abstract:
Automatic segmentation of brain MR images into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) is critical for tissue volumetric analysis and cortical surface reconstruction. Due to dramatic structural and appearance changes associated with developmental and aging processes, existing brain tissue segmentation methods are only viable for specific age groups. Consequently, methods…
▽ More
Automatic segmentation of brain MR images into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) is critical for tissue volumetric analysis and cortical surface reconstruction. Due to dramatic structural and appearance changes associated with developmental and aging processes, existing brain tissue segmentation methods are only viable for specific age groups. Consequently, methods developed for one age group may fail for another. In this paper, we make the first attempt to segment brain tissues across the entire human lifespan (0-100 years of age) using a unified deep learning model. To overcome the challenges related to structural variability underpinned by biological processes, intensity inhomogeneity, motion artifacts, scanner-induced differences, and acquisition protocols, we propose to use contrastive learning to improve the quality of feature representations in a latent space for effective lifespan tissue segmentation. We compared our approach with commonly used segmentation methods on a large-scale dataset of 2,464 MR images. Experimental results show that our model accurately segments brain tissues across the lifespan and outperforms existing methods.
△ Less
Submitted 3 January, 2023;
originally announced January 2023.
-
Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation
Authors:
Rongzhi Gu,
Shi-Xiong Zhang,
Yuexian Zou,
Dong Yu
Abstract:
Recently, frequency domain all-neural beamforming methods have achieved remarkable progress for multichannel speech separation. In parallel, the integration of time domain network structure and beamforming also gains significant attention. This study proposes a novel all-neural beamforming method in time domain and makes an attempt to unify the all-neural beamforming pipelines for time domain and…
▽ More
Recently, frequency domain all-neural beamforming methods have achieved remarkable progress for multichannel speech separation. In parallel, the integration of time domain network structure and beamforming also gains significant attention. This study proposes a novel all-neural beamforming method in time domain and makes an attempt to unify the all-neural beamforming pipelines for time domain and frequency domain multichannel speech separation. The proposed model consists of two modules: separation and beamforming. Both modules perform temporal-spectral-spatial modeling and are trained from end-to-end using a joint loss function. The novelty of this study lies in two folds. Firstly, a time domain directional feature conditioned on the direction of the target speaker is proposed, which can be jointly optimized within the time domain architecture to enhance target signal estimation. Secondly, an all-neural beamforming network in time domain is designed to refine the pre-separated results. This module features with parametric time-variant beamforming coefficient estimation, without explicitly following the derivation of optimal filters that may lead to an upper bound. The proposed method is evaluated on simulated reverberant overlapped speech data derived from the AISHELL-1 corpus. Experimental results demonstrate significant performance improvements over frequency domain state-of-the-arts, ideal magnitude masks and existing time domain neural beamforming methods.
△ Less
Submitted 23 December, 2022; v1 submitted 16 December, 2022;
originally announced December 2022.
-
M3ST: Mix at Three Levels for Speech Translation
Authors:
Xuxin Cheng,
Qianqian Dong,
Fengpeng Yue,
Tom Ko,
Mingxuan Wang,
Yuexian Zou
Abstract:
How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine…
▽ More
How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine-tuning based on a pre-trained model using external machine translation (MT) data. In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data. At the second stage of fine-tuning, we take both original speech sequences and original text sequences in parallel into the model to fine-tune the network, and use Jensen-Shannon divergence to regularize their outputs. Experiments on MuST-C speech translation benchmark and analysis show that M^3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
△ Less
Submitted 7 December, 2022;
originally announced December 2022.
-
Proximal Gradient-Based Unfolding for Massive Random Access in IoT Networks
Authors:
Yinan Zou,
Yong Zhou,
Xu Chen,
Yonina C. Eldar
Abstract:
Grant-free random access is an effective technology for enabling low-overhead and low-latency massive access, where joint activity detection and channel estimation (JADCE) is a critical issue. Although existing compressive sensing algorithms can be applied for JADCE, they usually fail to simultaneously harvest the following properties: effective sparsity inducing, fast convergence, robust to diffe…
▽ More
Grant-free random access is an effective technology for enabling low-overhead and low-latency massive access, where joint activity detection and channel estimation (JADCE) is a critical issue. Although existing compressive sensing algorithms can be applied for JADCE, they usually fail to simultaneously harvest the following properties: effective sparsity inducing, fast convergence, robust to different pilot sequences, and adaptive to time-varying networks. To this end, we propose an unfolding framework for JADCE based on the proximal gradient method. Specifically, we formulate the JADCE problem as a group-row-sparse matrix recovery problem and leverage a minimax concave penalty rather than the widely-used $\ell_1$-norm to induce sparsity. We then develop a proximal gradient-based unfolding neural network that parameterizes the algorithmic iterations. To improve convergence rate, we incorporate momentum into the unfolding neural network, and prove the accelerated convergence theoretically. Based on the convergence analysis, we further develop an adaptive-tuning algorithm, which adjusts its parameters to different signal-to-noise ratio settings. Simulations show that the proposed unfolding neural network achieves better recovery performance, convergence rate, and adaptivity than current baselines.
△ Less
Submitted 4 December, 2022;
originally announced December 2022.
-
NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS
Authors:
Dongchao Yang,
Songxiang Liu,
Jianwei Yu,
Helin Wang,
Chao Weng,
Yuexian Zou
Abstract:
Expressive text-to-speech (TTS) can synthesize a new speaking style by imiating prosody and timbre from a reference audio, which faces the following challenges: (1) The highly dynamic prosody information in the reference audio is difficult to extract, especially, when the reference audio contains background noise. (2) The TTS systems should have good generalization for unseen speaking styles. In t…
▽ More
Expressive text-to-speech (TTS) can synthesize a new speaking style by imiating prosody and timbre from a reference audio, which faces the following challenges: (1) The highly dynamic prosody information in the reference audio is difficult to extract, especially, when the reference audio contains background noise. (2) The TTS systems should have good generalization for unseen speaking styles. In this paper, we present a \textbf{no}ise-\textbf{r}obust \textbf{e}xpressive TTS model (NoreSpeech), which can robustly transfer speaking style in a noisy reference utterance to synthesized speech. Specifically, our NoreSpeech includes several components: (1) a novel DiffStyle module, which leverages powerful probabilistic denoising diffusion models to learn noise-agnostic speaking style features from a teacher model by knowledge distillation; (2) a VQ-VAE block, which maps the style features into a controllable quantized latent space for improving the generalization of style transfer; and (3) a straight-forward but effective parameter-free text-style alignment module, which enables NoreSpeech to transfer style to a textual input from a length-mismatched reference utterance. Experiments demonstrate that NoreSpeech is more effective than previous expressive TTS models in noise environments. Audio samples and code are available at: \href{http://dongchaoyang.top/NoreSpeech\_demo/}{http://dongchaoyang.top/NoreSpeech\_demo/}
△ Less
Submitted 4 November, 2022;
originally announced November 2022.
-
Exploiting NOMA and RIS in Integrated Sensing and Communication
Authors:
Jiakuo Zuo,
Yuanwei Liu,
Chenming Zhu,
Yixuan Zou,
Dengyin Zhang,
Naofal Al-Dhahir
Abstract:
A novel integrated sensing and communication (ISAC) system is proposed, where a dual-functional base station is utilized to transmit the superimposed non-orthogonal multiple access (NOMA) communication signal for serving communication users and sensing targets simultaneously. Furthermore, a new reconfigurable intelligent surface (RIS)-aided-sensing structure is also proposed to address the signifi…
▽ More
A novel integrated sensing and communication (ISAC) system is proposed, where a dual-functional base station is utilized to transmit the superimposed non-orthogonal multiple access (NOMA) communication signal for serving communication users and sensing targets simultaneously. Furthermore, a new reconfigurable intelligent surface (RIS)-aided-sensing structure is also proposed to address the significant path loss or blockage of LoS links for the sensing task. Based on this setup, the beampattern gain at the RIS for the radar target is derived and adopted as a sensing metric. The objective of this paper is to maximize the minimum beampattern gain by jointly optimizing active beamforming, power allocation coefficients and passive beamforming. To tackle the non-convexity of the formulated optimization problem, the beampattern gain and constraints are first transformed into more tractable forms. Then, an iterative block coordinate descent (IBCD) algorithm is proposed by employing successive convex approximation (SCA), Schur complement, semidefinite relaxation (SDR) and sequential rank-one constraint relaxation (SRCR) methods. To reduce the complexity of the proposed IBCD algorithm, a low-complexity iterative alternating optimization (IAO) algorithm is proposed. Particularly, the active beamforming is optimized by solving a semidefinite programming (SDP) problem and the closed-form solutions of the power allocation coefficients are derived. Numerical results show that: i) the proposed RIS-NOMA-ISAC system always outperforms the RIS-ISAC system without NOMA in beampattern gain and illumination power; ii) the low-complexity IAO algorithm achieves a comparable performance to that achieved by the IBCD algorithm. iii) high beampattern gain can be achieved by the proposed joint optimization algorithms in underloaded and overloaded communication scenarios.
△ Less
Submitted 6 October, 2022;
originally announced October 2022.
-
Complementary consistency semi-supervised learning for 3D left atrial image segmentation
Authors:
Hejun Huang,
Zuguo Chen,
Chaoyang Chen,
Ming Lu,
Ying Zou
Abstract:
A network based on complementary consistency training, called CC-Net, has been proposed for semi-supervised left atrium image segmentation. CC-Net efficiently utilizes unlabeled data from the perspective of complementary information to address the problem of limited ability of existing semi-supervised segmentation algorithms to extract information from unlabeled data. The complementary symmetric s…
▽ More
A network based on complementary consistency training, called CC-Net, has been proposed for semi-supervised left atrium image segmentation. CC-Net efficiently utilizes unlabeled data from the perspective of complementary information to address the problem of limited ability of existing semi-supervised segmentation algorithms to extract information from unlabeled data. The complementary symmetric structure of CC-Net includes a main model and two auxiliary models. The complementary model inter-perturbations between the main and auxiliary models force consistency to form complementary consistency. The complementary information obtained by the two auxiliary models helps the main model to effectively focus on ambiguous areas, while enforcing consistency between the models is advantageous in obtaining decision boundaries with low uncertainty. CC-Net has been validated on two public datasets. In the case of specific proportions of labeled data, compared with current advanced algorithms, CC-Net has the best semi-supervised segmentation performance. Our code is publicly available at https://github.com/Cuthbert-Huang/CC-Net.
△ Less
Submitted 4 April, 2023; v1 submitted 4 October, 2022;
originally announced October 2022.
-
Anti-Delay Kalman Filter Fusion Algorithm for Vehicle-borne Sensor Network with Finite-Time Convergence
Authors:
Hang Yu,
Keren Dai,
Haojie Li,
Yao Zou,
Xiang Ma,
Shaojie Ma,
He Zhang
Abstract:
Intelligent vehicles in autonomous driving and obstacle avoidance, the precise relative state of vehicles put forward a higher demand. For a vehicle-borne sensor network with time-varying transmission delays, the problem of coordinate fusion of vehicle state is the focus of this paper. By the ingeniously designed low-complexity integration with a consensus strategy and buffer technology, an anti-d…
▽ More
Intelligent vehicles in autonomous driving and obstacle avoidance, the precise relative state of vehicles put forward a higher demand. For a vehicle-borne sensor network with time-varying transmission delays, the problem of coordinate fusion of vehicle state is the focus of this paper. By the ingeniously designed low-complexity integration with a consensus strategy and buffer technology, an anti-delay distributed Kalman filter (DKF) with finite-time convergence is proposed.By introducing the matrix weight to assess local estimates, the optimal fusion state result is available in the sense of linear minimum variance. In addition, to accommodate practical engineering in intelligent vehicles, the communication weight coefficient and directed topology with unidirectional transmission are also considered. From a theoretical perspective, the proof of error covariances upper bounds with different communication topologies with delays are presented. Furthermore, the maximum allowable delays of vehicle-borne sensor network is derived backwards. Simulations verify that while considering various non-ideal factors above, the proposed DFK algorithm produces more accurate and robust fusion estimation state results than existing algorithms, making it more valuable in practical applications. Simultaneously, a mobile car trajectory tracking experiment is carried out, which further verifies the feasibility of the proposed algorithm.
△ Less
Submitted 19 September, 2022;
originally announced September 2022.
-
Modeling mandatory and discretionary lane changes using dynamic interaction networks
Authors:
Yue Zhang,
Yajie Zou,
Yuanchang Xie,
Lei Chen
Abstract:
A quantitative understanding of dynamic lane-changing (LC) interaction patterns is indispensable for improving the decision-making of autonomous vehicles, especially in mixed traffic with human-driven vehicles. This paper develops a novel framework combining the hidden Markov model and graph structure to identify the difference in dynamic interaction networks between mandatory lane changes (MLC) a…
▽ More
A quantitative understanding of dynamic lane-changing (LC) interaction patterns is indispensable for improving the decision-making of autonomous vehicles, especially in mixed traffic with human-driven vehicles. This paper develops a novel framework combining the hidden Markov model and graph structure to identify the difference in dynamic interaction networks between mandatory lane changes (MLC) and discretionary lane changes (DLC). A hidden Markov model is developed to decompose LC interactions into homogenous segments and reveal the temporal properties of these segments. Then, conditional mutual information is used to quantify the interaction intensity, and the graph structure is used to characterize the connectivity between vehicles. Finally, the critical vehicle in each dynamic interaction network is identified. Based on the LC events extracted from the INTERACTION dataset, the proposed analytical framework is applied to modeling MLC and DLC under congested traffic with levels of service E and F. The results show that there are multiple heterogeneous dynamic interaction network structures in an LC process. A comparison of MLC and DLC demonstrates that MLC are more complex, while DLC are more random. The complexity of MLC is attributed to the intense interaction and frequent transition of the interaction network structure, while the random DLC demonstrate no obvious evolution rules and dominant vehicles in interaction networks. The findings in this study are useful for understanding the connectivity structure between vehicles in LC interactions, and for designing appropriate and well-directed driving decision-making models for autonomous vehicles and advanced driver-assistance systems.
△ Less
Submitted 26 July, 2022;
originally announced July 2022.
-
Diffsound: Discrete Diffusion Model for Text-to-sound Generation
Authors:
Dongchao Yang,
Jianwei Yu,
Helin Wang,
Wen Wang,
Chao Weng,
Yuexian Zou,
Dong Yu
Abstract:
Generating sound effects that humans want is an important topic. However, there are few studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses t…
▽ More
Generating sound effects that humans want is an important topic. However, there are few studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform. We found that the decoder significantly influences the generation performance. Thus, we focus on designing a good decoder in this study. We begin with the traditional autoregressive decoder, which has been proved as a state-of-the-art method in previous sound generation works. However, the AR decoder always predicts the mel-spectrogram tokens one by one in order, which introduces the unidirectional bias and accumulation of errors problems. Moreover, with the AR decoder, the sound generation time increases linearly with the sound duration. To overcome the shortcomings introduced by AR decoders, we propose a non-autoregressive decoder based on the discrete diffusion model, named Diffsound. Specifically, the Diffsound predicts all of the mel-spectrogram tokens in one step and then refines the predicted tokens in the next step, so the best-predicted results can be obtained after several steps. Our experiments show that our proposed Diffsound not only produces better text-to-sound generation results when compared with the AR decoder but also has a faster generation speed, e.g., MOS: 3.56 \textit{v.s} 2.786, and the generation speed is five times faster than the AR decoder.
△ Less
Submitted 28 April, 2023; v1 submitted 20 July, 2022;
originally announced July 2022.
-
Agent with Tangent-based Formulation and Anatomical Perception for Standard Plane Localization in 3D Ultrasound
Authors:
Yuxin Zou,
Haoran Dou,
Yuhao Huang,
Xin Yang,
Jikuan Qian,
Chaojiong Zhen,
Xiaodan Ji,
Nishant Ravikumar,
Guoqiang Chen,
Weijun Huang,
Alejandro F. Frangi,
Dong Ni
Abstract:
Standard plane (SP) localization is essential in routine clinical ultrasound (US) diagnosis. Compared to 2D US, 3D US can acquire multiple view planes in one scan and provide complete anatomy with the addition of coronal plane. However, manually navigating SPs in 3D US is laborious and biased due to the orientation variability and huge search space. In this study, we introduce a novel reinforcemen…
▽ More
Standard plane (SP) localization is essential in routine clinical ultrasound (US) diagnosis. Compared to 2D US, 3D US can acquire multiple view planes in one scan and provide complete anatomy with the addition of coronal plane. However, manually navigating SPs in 3D US is laborious and biased due to the orientation variability and huge search space. In this study, we introduce a novel reinforcement learning (RL) framework for automatic SP localization in 3D US. Our contribution is three-fold. First, we formulate SP localization in 3D US as a tangent-point-based problem in RL to restructure the action space and significantly reduce the search space. Second, we design an auxiliary task learning strategy to enhance the model's ability to recognize subtle differences crossing Non-SPs and SPs in plane search. Finally, we propose a spatial-anatomical reward to effectively guide learning trajectories by exploiting spatial and anatomical information simultaneously. We explore the efficacy of our approach on localizing four SPs on uterus and fetal brain datasets. The experiments indicate that our approach achieves a high localization accuracy as well as robust performance.
△ Less
Submitted 1 July, 2022;
originally announced July 2022.
-
Real-time Dual-channel 2 * 2 MIMO Fiber-THz-Fiber Seamless Integration System at 385 GHz and 435 GHz
Authors:
Jiao Zhang,
Min Zhu,
Bingchang Hua,
Mingzheng Lei,
Yuancheng Cai,
Liang Tian,
Yucong Zou,
Like Ma,
Yongming Huang,
Jianjun Yu,
Xiaohu You
Abstract:
We demonstrate the first practical real-time dual-channel fiber-THz-fiber 2 * 2 MIMO seamless integration system with a record net data rate of 2 * 103.125 Gb/s at 385 GHz and 435 GHz over two spans of 20 km SSMF and 3 m wireless link.
We demonstrate the first practical real-time dual-channel fiber-THz-fiber 2 * 2 MIMO seamless integration system with a record net data rate of 2 * 103.125 Gb/s at 385 GHz and 435 GHz over two spans of 20 km SSMF and 3 m wireless link.
△ Less
Submitted 24 June, 2022;
originally announced June 2022.
-
Electrocardiographic Deep Learning for Predicting Post-Procedural Mortality
Authors:
David Ouyang,
John Theurer,
Nathan R. Stein,
J. Weston Hughes,
Pierre Elias,
Bryan He,
Neal Yuan,
Grant Duffy,
Roopinder K. Sandhu,
Joseph Ebinger,
Patrick Botting,
Melvin Jujjavarapu,
Brian Claggett,
James E. Tooley,
Tim Poterucha,
Jonathan H. Chen,
Michael Nurok,
Marco Perez,
Adler Perotte,
James Y. Zou,
Nancy R. Cook,
Sumeet S. Chugh,
Susan Cheng,
Christine M. Albert
Abstract:
Background. Pre-operative risk assessments used in clinical practice are limited in their ability to identify risk for post-operative mortality. We hypothesize that electrocardiograms contain hidden risk markers that can help prognosticate post-operative mortality. Methods. In a derivation cohort of 45,969 pre-operative patients (age 59+- 19 years, 55 percent women), a deep learning algorithm was…
▽ More
Background. Pre-operative risk assessments used in clinical practice are limited in their ability to identify risk for post-operative mortality. We hypothesize that electrocardiograms contain hidden risk markers that can help prognosticate post-operative mortality. Methods. In a derivation cohort of 45,969 pre-operative patients (age 59+- 19 years, 55 percent women), a deep learning algorithm was developed to leverage waveform signals from pre-operative ECGs to discriminate post-operative mortality. Model performance was assessed in a holdout internal test dataset and in two external hospital cohorts and compared with the Revised Cardiac Risk Index (RCRI) score. Results. In the derivation cohort, there were 1,452 deaths. The algorithm discriminates mortality with an AUC of 0.83 (95% CI 0.79-0.87) surpassing the discrimination of the RCRI score with an AUC of 0.67 (CI 0.61-0.72) in the held out test cohort. Patients determined to be high risk by the deep learning model's risk prediction had an unadjusted odds ratio (OR) of 8.83 (5.57-13.20) for post-operative mortality as compared to an unadjusted OR of 2.08 (CI 0.77-3.50) for post-operative mortality for RCRI greater than 2. The deep learning algorithm performed similarly for patients undergoing cardiac surgery with an AUC of 0.85 (CI 0.77-0.92), non-cardiac surgery with an AUC of 0.83 (0.79-0.88), and catherization or endoscopy suite procedures with an AUC of 0.76 (0.72-0.81). The algorithm similarly discriminated risk for mortality in two separate external validation cohorts from independent healthcare systems with AUCs of 0.79 (0.75-0.83) and 0.75 (0.74-0.76) respectively. Conclusion. The findings demonstrate how a novel deep learning algorithm, applied to pre-operative ECGs, can improve discrimination of post-operative mortality.
△ Less
Submitted 30 April, 2022;
originally announced May 2022.
-
Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention
Authors:
Xinmeng Xu,
Rongzhi Gu,
Yuexian Zou
Abstract:
Hand-crafted spatial features, such as inter-channel intensity difference (IID) and inter-channel phase difference (IPD), play a fundamental role in recent deep learning based dual-microphone speech enhancement (DMSE) systems. However, learning the mutual relationship between artificially designed spatial and spectral features is hard in the end-to-end DMSE. In this work, a novel architecture for…
▽ More
Hand-crafted spatial features, such as inter-channel intensity difference (IID) and inter-channel phase difference (IPD), play a fundamental role in recent deep learning based dual-microphone speech enhancement (DMSE) systems. However, learning the mutual relationship between artificially designed spatial and spectral features is hard in the end-to-end DMSE. In this work, a novel architecture for DMSE using a multi-head cross-attention based convolutional recurrent network (MHCA-CRN) is presented. The proposed MHCA-CRN model includes a channel-wise encoding structure for preserving intra-channel features and a multi-head cross-attention mechanism for fully exploiting cross-channel features. In addition, the proposed approach specifically formulates the decoder with an extra SNR estimator to estimate frame-level SNR under a multi-task learning framework, which is expected to avoid speech distortion led by end-to-end DMSE module. Finally, a spectral gain function is adopted to further suppress the unnatural residual noise. Experiment results demonstrated superior performance of the proposed model against several state-of-the-art models.
△ Less
Submitted 2 May, 2022;
originally announced May 2022.
-
End-to-end Spoken Conversational Question Answering: Task, Dataset and Model
Authors:
Chenyu You,
Nuo Chen,
Fenglin Liu,
Shen Ge,
Xian Wu,
Yuexian Zou
Abstract:
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts. However, the most natural way that human seek or test their knowledge is via human conversations. Therefore, we propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows given the speech…
▽ More
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts. However, the most natural way that human seek or test their knowledge is via human conversations. Therefore, we propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows given the speech documents. In this task, our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering. To this end, instead of directly adopting automatically generated speech transcripts with highly noisy data, we propose a novel unified data distillation approach, DDNet, which effectively ingests cross-modal information to achieve fine-grained representations of the speech and language modalities. Moreover, we propose a simple and novel mechanism, termed Dual Attention, by encouraging better alignments between audio and text to ease the process of knowledge transfer. To evaluate the capacity of SCQA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 40k question-answer pairs from 4k conversations. The performance of the existing state-of-the-art methods significantly degrade on our dataset, hence demonstrating the necessity of cross-modal information integration. Our experimental results demonstrate that our proposed method achieves superior performance in spoken conversational question answering tasks.
△ Less
Submitted 29 April, 2022;
originally announced April 2022.
-
Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction
Authors:
Zifeng Zhao,
Rongzhi Gu,
Dongchao Yang,
**chuan Tian,
Yuexian Zou
Abstract:
Dominant researches adopt supervised training for speaker extraction, while the scarcity of ideally clean corpus and channel mismatch problem are rarely considered. To this end, we propose speaker-aware mixture of mixtures training (SAMoM), utilizing the consistency of speaker identity among target source, enrollment utterance and target estimate to weakly supervise the training of a deep speaker…
▽ More
Dominant researches adopt supervised training for speaker extraction, while the scarcity of ideally clean corpus and channel mismatch problem are rarely considered. To this end, we propose speaker-aware mixture of mixtures training (SAMoM), utilizing the consistency of speaker identity among target source, enrollment utterance and target estimate to weakly supervise the training of a deep speaker extractor. In SAMoM, the input is constructed by mixing up different speaker-aware mixtures (SAMs), each contains multiple speakers with their identities known and enrollment utterances available. Informed by enrollment utterances, target speech is extracted from the input one by one, such that the estimated targets can approximate the original SAMs after a remix in accordance with the identity consistency. Moreover, using SAMoM in a semi-supervised setting with a certain amount of clean sources enables application in noisy scenarios. Extensive experiments on Libri2Mix show that the proposed method achieves promising results without access to any clean sources (11.06dB SI-SDRi). With a domain adaptation, our approach even outperformed supervised framework in a cross-domain evaluation on AISHELL-1.
△ Less
Submitted 15 April, 2022;
originally announced April 2022.
-
WSSS4LUAD: Grand Challenge on Weakly-supervised Tissue Semantic Segmentation for Lung Adenocarcinoma
Authors:
Chu Han,
Xipeng Pan,
Lixu Yan,
Huan Lin,
Bingbing Li,
Su Yao,
Shanshan Lv,
Zhenwei Shi,
**hai Mai,
Jiatai Lin,
Bingchao Zhao,
Zeyan Xu,
Zhizhen Wang,
Yumeng Wang,
Yuan Zhang,
Huihui Wang,
Chao Zhu,
Chunhui Lin,
Lijian Mao,
Min Wu,
Luwen Duan,
**gsong Zhu,
Dong Hu,
Zijie Fang,
Yang Chen
, et al. (18 additional authors not shown)
Abstract:
Lung cancer is the leading cause of cancer death worldwide, and adenocarcinoma (LUAD) is the most common subtype. Exploiting the potential value of the histopathology images can promote precision medicine in oncology. Tissue segmentation is the basic upstream task of histopathology image analysis. Existing deep learning models have achieved superior segmentation performance but require sufficient…
▽ More
Lung cancer is the leading cause of cancer death worldwide, and adenocarcinoma (LUAD) is the most common subtype. Exploiting the potential value of the histopathology images can promote precision medicine in oncology. Tissue segmentation is the basic upstream task of histopathology image analysis. Existing deep learning models have achieved superior segmentation performance but require sufficient pixel-level annotations, which is time-consuming and expensive. To enrich the label resources of LUAD and to alleviate the annotation efforts, we organize this challenge WSSS4LUAD to call for the outstanding weakly-supervised semantic segmentation (WSSS) techniques for histopathology images of LUAD. Participants have to design the algorithm to segment tumor epithelial, tumor-associated stroma and normal tissue with only patch-level labels. This challenge includes 10,091 patch-level annotations (the training set) and over 130 million labeled pixels (the validation and test sets), from 87 WSIs (67 from GDPH, 20 from TCGA). All the labels were generated by a pathologist-in-the-loop pipeline with the help of AI models and checked by the label review board. Among 532 registrations, 28 teams submitted the results in the test phase with over 1,000 submissions. Finally, the first place team achieved mIoU of 0.8413 (tumor: 0.8389, stroma: 0.7931, normal: 0.8919). According to the technical reports of the top-tier teams, CAM is still the most popular approach in WSSS. Cutmix data augmentation has been widely adopted to generate more reliable samples. With the success of this challenge, we believe that WSSS approaches with patch-level annotations can be a complement to the traditional pixel annotations while reducing the annotation efforts. The entire dataset has been released to encourage more researches on computational pathology in LUAD and more novel WSSS techniques.
△ Less
Submitted 13 April, 2022; v1 submitted 13 April, 2022;
originally announced April 2022.
-
RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection
Authors:
Dongchao Yang,
Helin Wang,
Zhongjie Ye,
Yuexian Zou,
Wenwu Wang
Abstract:
Target sound detection (TSD) aims to detect the target sound from a mixture audio given the reference information. Previous methods use a conditional network to extract a sound-discriminative embedding from the reference audio, and then use it to detect the target sound from the mixture audio. However, the network performs much differently when using different reference audios (e.g. performs poorl…
▽ More
Target sound detection (TSD) aims to detect the target sound from a mixture audio given the reference information. Previous methods use a conditional network to extract a sound-discriminative embedding from the reference audio, and then use it to detect the target sound from the mixture audio. However, the network performs much differently when using different reference audios (e.g. performs poorly for noisy and short-duration reference audios), and tends to make wrong decisions for transient events (i.e. shorter than $1$ second). To overcome these problems, in this paper, we present a reference-aware and duration-robust network (RaDur) for TSD. More specifically, in order to make the network more aware of the reference information, we propose an embedding enhancement module to take into account the mixture audio while generating the embedding, and apply the attention pooling to enhance the features of target sound-related frames and weaken the features of noisy frames. In addition, a duration-robust focal loss is proposed to help model different-duration events. To evaluate our method, we build two TSD datasets based on UrbanSound and Audioset. Extensive experiments show the effectiveness of our methods.
△ Less
Submitted 5 April, 2022;
originally announced April 2022.
-
A Mixed supervised Learning Framework for Target Sound Detection
Authors:
Dongchao Yang,
Helin Wang,
Yuexian Zou,
Wenwu Wang
Abstract:
Target sound detection (TSD) aims to detect the target sound from mixture audio given the reference information. Previous works have shown that TSD models can be trained on fully-annotated (frame-level label) or weakly-annotated (clip-level label) data. However, there are some clear evidences show that the performance of the model trained on weakly-annotated data is worse than that trained on full…
▽ More
Target sound detection (TSD) aims to detect the target sound from mixture audio given the reference information. Previous works have shown that TSD models can be trained on fully-annotated (frame-level label) or weakly-annotated (clip-level label) data. However, there are some clear evidences show that the performance of the model trained on weakly-annotated data is worse than that trained on fully-annotated data. To fill this gap, we provide a mixed supervision perspective, in which learning novel categories (target domain) using weak annotations with the help of full annotations of existing base categories (source domain). To realize this, a mixed supervised learning framework is proposed, which contains two mutually-hel** student models (\textit{f\_student} and \textit{w\_student}) that learn from fully-annotated and weakly-annotated data, respectively. The motivation is that \textit{f\_student} learned from fully-annotated data has a better ability to capture detailed information than \textit{w\_student}. Thus, we first let \textit{f\_student} guide \textit{w\_student} to learn the ability to capture details, so \textit{w\_student} can perform better in the target domain. Then we let \textit{w\_student} guide \textit{f\_student} to fine-tune on the target domain. The process can be repeated several times so that the two students perform very well in the target domain. To evaluate our method, we built three TSD datasets based on UrbanSound and Audioset. Experimental results show that our methods offer about 8\% improvement in event-based F-score as compared with a recent baseline.
△ Less
Submitted 19 July, 2022; v1 submitted 5 April, 2022;
originally announced April 2022.
-
Gan-Based Joint Activity Detection and Channel Estimation For Grant-free Random Access
Authors:
Shuang Liang,
Yinan Zou,
Yong Zhou
Abstract:
Joint activity detection and channel estimation (JADCE) for grant-free random access is a critical issue that needs to be addressed to support massive connectivity in IoT networks. However, the existing model-free learning method can only achieve either activity detection or channel estimation, but not both. In this paper, we propose a novel model-free learning method based on generative adversari…
▽ More
Joint activity detection and channel estimation (JADCE) for grant-free random access is a critical issue that needs to be addressed to support massive connectivity in IoT networks. However, the existing model-free learning method can only achieve either activity detection or channel estimation, but not both. In this paper, we propose a novel model-free learning method based on generative adversarial network (GAN) to tackle the JADCE problem. We adopt the U-net architecture to build the generator rather than the standard GAN architecture, where a pre-estimated value that contains the activity information is adopted as input to the generator. By leveraging the properties of the pseudoinverse, the generator is refined by using an affine projection and a skip connection to ensure the output of the generator is consistent with the measurement. Moreover, we build a two-layer fully-connected neural network to design pilot matrix for reducing the impact of receiver noise. Simulation results show that the proposed method outperforms the existing methods in high SNR regimes, as both data consistency projection and pilot matrix optimization improve the learning ability.
△ Less
Submitted 4 April, 2022;
originally announced April 2022.
-
Estimating Fine-Grained Noise Model via Contrastive Learning
Authors:
Yunhao Zou,
Ying Fu
Abstract:
Image denoising has achieved unprecedented progress as great efforts have been made to exploit effective deep denoisers. To improve the denoising performance in realworld, two typical solutions are used in recent trends: devising better noise models for the synthesis of more realistic training data, and estimating noise level function to guide non-blind denoisers. In this work, we combine both noi…
▽ More
Image denoising has achieved unprecedented progress as great efforts have been made to exploit effective deep denoisers. To improve the denoising performance in realworld, two typical solutions are used in recent trends: devising better noise models for the synthesis of more realistic training data, and estimating noise level function to guide non-blind denoisers. In this work, we combine both noise modeling and estimation, and propose an innovative noise model estimation and noise synthesis pipeline for realistic noisy image generation. Specifically, our model learns a noise estimation model with fine-grained statistical noise model in a contrastive manner. Then, we use the estimated noise parameters to model camera-specific noise distribution, and synthesize realistic noisy training data. The most striking thing for our work is that by calibrating noise models of several sensors, our model can be extended to predict other cameras. In other words, we can estimate cameraspecific noise models for unknown sensors with only testing images, without laborious calibration frames or paired noisy/clean data. The proposed pipeline endows deep denoisers with competitive performances with state-of-the-art real noise modeling methods.
△ Less
Submitted 2 April, 2022;
originally announced April 2022.
-
Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches
Authors:
Zifeng Zhao,
Dongchao Yang,
Rongzhi Gu,
Haoran Zhang,
Yuexian Zou
Abstract:
Recently, end-to-end speaker extraction has attracted increasing attention and shown promising results. However, its performance is often inferior to that of a blind source separation (BSS) counterpart with a similar network architecture, due to the auxiliary speaker encoder may sometimes generate ambiguous speaker embeddings. Such ambiguous guidance information may confuse the separation network…
▽ More
Recently, end-to-end speaker extraction has attracted increasing attention and shown promising results. However, its performance is often inferior to that of a blind source separation (BSS) counterpart with a similar network architecture, due to the auxiliary speaker encoder may sometimes generate ambiguous speaker embeddings. Such ambiguous guidance information may confuse the separation network and hence lead to wrong extraction results, which deteriorates the overall performance. We refer to this as the target confusion problem. In this paper, we conduct an analysis of such an issue and solve it in two stages. In the training phase, we propose to integrate metric learning methods to improve the distinguishability of embeddings produced by the speaker encoder. While for inference, a novel post-filtering strategy is designed to revise the wrong results. Specifically, we first identify these confusion samples by measuring the similarities between output estimates and enrollment utterances, after which the true target sources are recovered by a subtraction operation. Experiments show that performance improvement of more than 1dB SI-SDRi can be brought, which validates the effectiveness of our methods and emphasizes the impact of the target confusion problem.
△ Less
Submitted 4 April, 2022;
originally announced April 2022.
-
Improving Target Sound Extraction with Timestamp Information
Authors:
Helin Wang,
Dongchao Yang,
Chao Weng,
Jianwei Yu,
Yuexian Zou
Abstract:
Target sound extraction (TSE) aims to extract the sound part of a target sound event class from a mixture audio with multiple sound events.
The previous works mainly focus on the problems of weakly-labelled data, jointly learning and new classes, however, no one cares about the onset and offset times of the target sound event, which has been emphasized in the auditory scene analysis. In this pap…
▽ More
Target sound extraction (TSE) aims to extract the sound part of a target sound event class from a mixture audio with multiple sound events.
The previous works mainly focus on the problems of weakly-labelled data, jointly learning and new classes, however, no one cares about the onset and offset times of the target sound event, which has been emphasized in the auditory scene analysis. In this paper, we study to utilize such timestamp information to help extract the target sound via a target sound detection network and a target-weighted time-frequency loss function.
More specifically, we use the detection result of a target sound detection (TSD) network as the additional information to guide the learning of target sound extraction network. We also find that the result of TSE can further improve the performance of the TSD network, so that a mutual learning framework of the target sound detection and extraction is proposed. In addition, a target-weighted time-frequency loss function is designed to pay more attention to the temporal regions of the target sound during training. Experimental results on the synthesized data generated from the Freesound Datasets show that our proposed method can significantly improve the performance of TSE.
△ Less
Submitted 2 April, 2022;
originally announced April 2022.
-
Learning Decoupling Features Through Orthogonality Regularization
Authors:
Li Wang,
Rongzhi Gu,
Weiji Zhuang,
Peng Gao,
Yujun Wang,
Yuexian Zou
Abstract:
Keyword spotting (KWS) and speaker verification (SV) are two important tasks in speech applications. Research shows that the state-of-art KWS and SV models are trained independently using different datasets since they expect to learn distinctive acoustic features. However, humans can distinguish language content and the speaker identity simultaneously. Motivated by this, we believe it is important…
▽ More
Keyword spotting (KWS) and speaker verification (SV) are two important tasks in speech applications. Research shows that the state-of-art KWS and SV models are trained independently using different datasets since they expect to learn distinctive acoustic features. However, humans can distinguish language content and the speaker identity simultaneously. Motivated by this, we believe it is important to explore a method that can effectively extract common features while decoupling task-specific features. Bearing this in mind, a two-branch deep network (KWS branch and SV branch) with the same network structure is developed and a novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously where speaker-invariant keyword representations and keyword-invariant speaker representations are expected respectively. Experiments are conducted on Google Speech Commands Dataset (GSCD). The results demonstrate that the orthogonality regularization helps the network to achieve SOTA EER of 1.31% and 1.87% on KWS and SV, respectively.
△ Less
Submitted 30 March, 2022;
originally announced March 2022.
-
Integrating Lattice-Free MMI into End-to-End Speech Recognition
Authors:
**chuan Tian,
Jianwei Yu,
Chao Weng,
Yuexian Zou,
Dong Yu
Abstract:
In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems.…
▽ More
In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems. However, the effectiveness and efficiency of the MBR-based methods are compromised: the MBR criterion is only used in system training, which creates a mismatch between training and decoding; the on-the-fly decoding process in MBR-based methods results in the need for pre-trained models and slow training speeds. To this end, novel algorithms are proposed in this work to integrate another widely used discriminative criterion, lattice-free maximum mutual information (LF-MMI), into E2E ASR systems not only in the training stage but also in the decoding process. The proposed LF-MMI training and decoding methods show their effectiveness on two widely used E2E frameworks: Attention-Based Encoder-Decoders (AEDs) and Neural Transducers (NTs). Compared with MBR-based methods, the proposed LF-MMI method: maintains the consistency between training and decoding; eschews the on-the-fly decoding process; trains from randomly initialized models with superior training efficiency. Experiments suggest that the LF-MMI method outperforms its MBR counterparts and consistently leads to statistically significant performance improvements on various frameworks and datasets from 30 hours to 14.3k hours. The proposed method achieves state-of-the-art (SOTA) results on Aishell-1 (CER 4.10%) and Aishell-2 (CER 5.02%) datasets. Code is released.
△ Less
Submitted 22 August, 2022; v1 submitted 29 March, 2022;
originally announced March 2022.
-
Knowledge-Guided Learning for Transceiver Design in Over-the-Air Federated Learning
Authors:
Yinan Zou,
Zixin Wang,
Xu Chen,
Haibo Zhou,
Yong Zhou
Abstract:
In this paper, we consider communication-efficient over-the-air federated learning (FL), where multiple edge devices with non-independent and identically distributed datasets perform multiple local iterations in each communication round and then concurrently transmit their updated gradients to an edge server over the same radio channel for global model aggregation using over-the-air computation (A…
▽ More
In this paper, we consider communication-efficient over-the-air federated learning (FL), where multiple edge devices with non-independent and identically distributed datasets perform multiple local iterations in each communication round and then concurrently transmit their updated gradients to an edge server over the same radio channel for global model aggregation using over-the-air computation (AirComp). We derive the upper bound of the time-average norm of the gradients to characterize the convergence of AirComp-assisted FL, which reveals the impact of the model aggregation errors accumulated over all communication rounds on convergence. Based on the convergence analysis, we formulate an optimization problem to minimize the upper bound to enhance the learning performance, followed by proposing an alternating optimization algorithm to facilitate the optimal transceiver design for AirComp-assisted FL. As the alternating optimization algorithm suffers from high computation complexity, we further develop a knowledge-guided learning algorithm that exploits the structure of the analytic expression of the optimal transmit power to achieve computation-efficient transceiver design. Simulation results demonstrate that the proposed knowledge-guided learning algorithm achieves a comparable performance as the alternating optimization algorithm, but with a much lower computation complexity. Moreover, both proposed algorithms outperform the baseline methods in terms of convergence speed and test accuracy.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
Analysis of lane-change conflict between cars and trucks at merging section using UAV video data
Authors:
Yichen Lu,
Kai Cheng,
Yue Zhang,
Xinqiang Chen,
Yajie Zou
Abstract:
The freeway on-ramp merging section is often identified as a crash-prone spot due to the high frequency of traffic conflicts. Very few traffic conflict analysis studies comprehensively consider different vehicle types at freeway merging section. Thus, the main objective of this study is to analyse conflicts between different vehicle types at freeway merging section. Field data are collected by Unm…
▽ More
The freeway on-ramp merging section is often identified as a crash-prone spot due to the high frequency of traffic conflicts. Very few traffic conflict analysis studies comprehensively consider different vehicle types at freeway merging section. Thus, the main objective of this study is to analyse conflicts between different vehicle types at freeway merging section. Field data are collected by Unmanned Aerial Vehicle (UAV) at merging areas in Shanghai, China. Vehicle extraction method is utilized to obtain vehicle trajectories. Time-to-collision (TTC) is utilized as the surrogate safety measure. TTC of car-car conflicts are the smallest while TTC of truck-truck conflicts are the largest. Traffic conflicts frequently occur at on-ramp and acceleration lane. Results show the spatial distribution of lane-change conflicts is significantly different between different vehicle types, suggesting that vehicle drivers should maintain safe distance especially car drivers. Besides, in order to decrease lane-change conflict at merging area, traffic management agencies are suggested to change dotted lie to solid lane at the beginning of acceleration lane.
△ Less
Submitted 5 January, 2022;
originally announced January 2022.