Search | arXiv e-print repository

Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision

Authors: Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Shiliang Zhang, Wen Wang

Abstract: Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an… ▽ More Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an utterance to the same prototypes as the representation of the original view, thereby enabling effective knowledge transfer between the views. Originally, due to the lack of negative pairs in the SDPN training process, the network tends to align positive pairs very closely in the embedding space, a phenomenon known as model collapse. To alleviate this problem, we introduce a diversity regularization term to embeddings in SDPN. Comprehensive experiments on the VoxCeleb datasets demonstrate the superiority of SDPN in self-supervised speaker verification. SDPN sets a new state-of-the-art on the VoxCeleb1 speaker verification evaluation benchmark, achieving Equal Error Rate 1.80%, 1.99%, and 3.62% for trial VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H respectively, without using any speaker labels in training. △ Less

Submitted 25 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

Comments: We update this paper to an earlier paper

arXiv:2406.02167 [pdf, other]

ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational Efficiency

Authors: Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Shiliang Zhang, Junjie Li

Abstract: Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings. To address this challenge, a multi-scale feature fusion approach has been proposed to effectively capture speaker characteristics from short utterances. Constrained by the model's size, a robust backbone Enhanced Res2Net (ERes2Net) combining global and local feature fusion… ▽ More Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings. To address this challenge, a multi-scale feature fusion approach has been proposed to effectively capture speaker characteristics from short utterances. Constrained by the model's size, a robust backbone Enhanced Res2Net (ERes2Net) combining global and local feature fusion demonstrates sub-optimal performance in short-duration speaker verification. To further improve the short-duration feature extraction capability of ERes2Net, we expand the channel dimension within each stage. However, this modification also increases the number of model parameters and computational complexity. To alleviate this problem, we propose an improved ERes2NetV2 by pruning redundant structures, ultimately reducing both the model parameters and its computational cost. A range of experiments conducted on the VoxCeleb datasets exhibits the superiority of ERes2NetV2, which achieves EER of 0.61% for the full-duration trial, 0.98% for the 3s-duration trial, and 1.48% for the 2s-duration trial on VoxCeleb1-O, respectively. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2405.18435 [pdf, other]

QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge

Authors: Hongwei Bran Li, Fernando Navarro, Ivan Ezhov, Amirhossein Bayat, Dhritiman Das, Florian Kofler, Suprosanna Shit, Diana Waldmannstetter, Johannes C. Paetzold, Xiaobin Hu, Benedikt Wiestler, Lucas Zimmer, Tamaz Amiranashvili, Chinmay Prabhakar, Christoph Berger, Jonas Weidner, Michelle Alonso-Basant, Arif Rashid, Ujjwal Baid, Wesam Adel, Deniz Ali, Bhakti Baheti, Yingbin Bai, Ishaan Bhatt, Sabri Can Cetindag , et al. (55 additional authors not shown)

Abstract: Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the de… ▽ More Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the development and evaluation of automated segmentation algorithms. Accurately modeling and quantifying this variability is essential for enhancing the robustness and clinical applicability of these algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ), which was organized in conjunction with International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2020 and 2021. The challenge focuses on the uncertainty quantification of medical image segmentation which considers the omnipresence of inter-rater variability in imaging datasets. The large collection of images with multi-rater annotations features various modalities such as MRI and CT; various organs such as the brain, prostate, kidney, and pancreas; and different image dimensions 2D-vs-3D. A total of 24 teams submitted different solutions to the problem, combining various baseline models, Bayesian neural networks, and ensemble model techniques. The obtained results indicate the importance of the ensemble models, as well as the need for further research to develop efficient 3D methods for uncertainty quantification methods in 3D segmentation tasks. △ Less

Submitted 24 June, 2024; v1 submitted 19 March, 2024; originally announced May 2024.

Comments: initial technical report

arXiv:2405.14210 [pdf, other]

Eidos: Efficient, Imperceptible Adversarial 3D Point Clouds

Authors: Hanwei Zhang, Luo Cheng, Qisong He, Wei Huang, Renjue Li, Ronan Sicre, Xiaowei Huang, Holger Hermanns, Lijun Zhang

Abstract: Classification of 3D point clouds is a challenging machine learning (ML) task with important real-world applications in a spectrum from autonomous driving and robot-assisted surgery to earth observation from low orbit. As with other ML tasks, classification models are notoriously brittle in the presence of adversarial attacks. These are rooted in imperceptible changes to inputs with the effect tha… ▽ More Classification of 3D point clouds is a challenging machine learning (ML) task with important real-world applications in a spectrum from autonomous driving and robot-assisted surgery to earth observation from low orbit. As with other ML tasks, classification models are notoriously brittle in the presence of adversarial attacks. These are rooted in imperceptible changes to inputs with the effect that a seemingly well-trained model ends up misclassifying the input. This paper adds to the understanding of adversarial attacks by presenting Eidos, a framework providing Efficient Imperceptible aDversarial attacks on 3D pOint cloudS. Eidos supports a diverse set of imperceptibility metrics. It employs an iterative, two-step procedure to identify optimal adversarial examples, thereby enabling a runtime-imperceptibility trade-off. We provide empirical evidence relative to several popular 3D point cloud classification models and several established 3D attack methods, showing Eidos' superiority with respect to efficiency as well as imperceptibility. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: Preprint

arXiv:2403.19971 [pdf, other]

3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization

Authors: Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Tinglong Zhu, Changhe Song, Rongjie Huang, Ziyang Ma, Qian Chen, Shiliang Zhang, Xihao Li

Abstract: This paper introduces 3D-Speaker-Toolkit, an open source toolkit for multi-modal speaker verification and diarization. It is designed for the needs of academic researchers and industrial practitioners. The 3D-Speaker-Toolkit adeptly leverages the combined strengths of acoustic, semantic, and visual data, seamlessly fusing these modalities to offer robust speaker recognition capabilities. The acous… ▽ More This paper introduces 3D-Speaker-Toolkit, an open source toolkit for multi-modal speaker verification and diarization. It is designed for the needs of academic researchers and industrial practitioners. The 3D-Speaker-Toolkit adeptly leverages the combined strengths of acoustic, semantic, and visual data, seamlessly fusing these modalities to offer robust speaker recognition capabilities. The acoustic module extracts speaker embeddings from acoustic features, employing both fully-supervised and self-supervised learning approaches. The semantic module leverages advanced language models to apprehend the substance and context of spoken language, thereby augmenting the system's proficiency in distinguishing speakers through linguistic patterns. Finally, the visual module applies image processing technologies to scrutinize facial features, which bolsters the precision of speaker diarization in multi-speaker environments. Collectively, these modules empower the 3D-Speaker-Toolkit to attain elevated levels of accuracy and dependability in executing speaker-related tasks, establishing a new benchmark in multi-modal speaker analysis. The 3D-Speaker project also includes a handful of open-sourced state-of-the-art models and a large dataset containing over 10,000 speakers. The toolkit is publicly available at https://github.com/alibaba-damo-academy/3D-Speaker. △ Less

Submitted 29 March, 2024; originally announced March 2024.

arXiv:2403.17701

Rotate to Scan: UNet-like Mamba with Triplet SSM Module for Medical Image Segmentation

Authors: Hao Tang, Lianglun Cheng, Guoheng Huang, Zhengguang Tan, Junhao Lu, Kaihong Wu

Abstract: Image segmentation holds a vital position in the realms of diagnosis and treatment within the medical domain. Traditional convolutional neural networks (CNNs) and Transformer models have made significant advancements in this realm, but they still encounter challenges because of limited receptive field or high computing complexity. Recently, State Space Models (SSMs), particularly Mamba and its var… ▽ More Image segmentation holds a vital position in the realms of diagnosis and treatment within the medical domain. Traditional convolutional neural networks (CNNs) and Transformer models have made significant advancements in this realm, but they still encounter challenges because of limited receptive field or high computing complexity. Recently, State Space Models (SSMs), particularly Mamba and its variants, have demonstrated notable performance in the field of vision. However, their feature extraction methods may not be sufficiently effective and retain some redundant structures, leaving room for parameter reduction. Motivated by previous spatial and channel attention methods, we propose Triplet Mamba-UNet. The method leverages residual VSS Blocks to extract intensive contextual features, while Triplet SSM is employed to fuse features across spatial and channel dimensions. We conducted experiments on ISIC17, ISIC18, CVC-300, CVC-ClinicDB, Kvasir-SEG, CVC-ColonDB, and Kvasir-Instrument datasets, demonstrating the superior segmentation performance of our proposed TM-UNet. Additionally, compared to the previous VM-UNet, our model achieves a one-third reduction in parameters. △ Less

Submitted 3 May, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

Comments: Experimental method encountered errors, undergoing experiment again

arXiv:2401.16775 [pdf, other]

doi 10.1109/TSP.2024.3361090

Activity Detection for Massive Connectivity in Cell-free Networks with Unknown Large-scale Fading, Channel Statistics, Noise Variance, and Activity Probability: A Bayesian Approach

Authors: Hao Zhang, Qingfeng Lin, Yang Li, Lei Cheng, Yik-Chung Wu

Abstract: Activity detection is an important task in the next generation grant-free multiple access. While there are a number of existing algorithms designed for this purpose, they mostly require precise information about the network, such as large-scale fading coefficients, small-scale fading channel statistics, noise variance at the access points, and user activity probability. Acquiring these information… ▽ More Activity detection is an important task in the next generation grant-free multiple access. While there are a number of existing algorithms designed for this purpose, they mostly require precise information about the network, such as large-scale fading coefficients, small-scale fading channel statistics, noise variance at the access points, and user activity probability. Acquiring these information would take a significant overhead and their estimated values might not be accurate. This problem is even more severe in cell-free networks as there are many of these parameters to be acquired. Therefore, this paper sets out to investigate the activity detection problem without the above-mentioned information. In order to handle so many unknown parameters, this paper employs the Bayesian approach, where the unknown variables are endowed with prior distributions which effectively act as regularizations. Together with the likelihood function, a maximum a posteriori (MAP) estimator and a variational inference algorithm are derived. Extensive simulations demonstrate that the proposed methods, even without the knowledge of these system parameters, perform better than existing state-of-the-art methods, such as covariance-based and approximate message passing methods. △ Less

Submitted 2 February, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: 16 pages, 9 figures, accepted for publication in IEEE Transactions on Signal Processing

MSC Class: 68T01

arXiv:2401.01738 [pdf, other]

doi 10.1109/TWC.2024.3351856

Integrated Sensing and Communication with Massive MIMO: A Unified Tensor Approach for Channel and Target Parameter Estimation

Authors: Ruoyu Zhang, Lei Cheng, Shuai Wang, Yi Lou, Yulong Gao, Wen Wu, Derrick Wing Kwan Ng

Abstract: Benefitting from the vast spatial degrees of freedom, the amalgamation of integrated sensing and communication (ISAC) and massive multiple-input multiple-output (MIMO) is expected to simultaneously improve spectral and energy efficiencies as well as the sensing capability. However, a large number of antennas deployed in massive MIMO-ISAC raises critical challenges in acquiring both accurate channe… ▽ More Benefitting from the vast spatial degrees of freedom, the amalgamation of integrated sensing and communication (ISAC) and massive multiple-input multiple-output (MIMO) is expected to simultaneously improve spectral and energy efficiencies as well as the sensing capability. However, a large number of antennas deployed in massive MIMO-ISAC raises critical challenges in acquiring both accurate channel state information and target parameter information. To overcome these two challenges with a unified framework, we first analyze their underlying system models and then propose a novel tensor-based approach that addresses both the channel estimation and target sensing problems. Specifically, by parameterizing the high-dimensional communication channel exploiting a small number of physical parameters, we associate the channel state information with the sensing parameters of targets in terms of angular, delay, and Doppler dimensions. Then, we propose a shared training pattern adopting the same time-frequency resources such that both the channel estimation and target parameter estimation can be formulated as a canonical polyadic decomposition problem with a similar mathematical expression. On this basis, we first investigate the uniqueness condition of the tensor factorization and the maximum number of resolvable targets by utilizing the specific Vandermonde △ Less

Submitted 3 January, 2024; originally announced January 2024.

Journal ref: IEEE Transactions on Wireless Communications, 2024

arXiv:2311.05929 [pdf, other]

Efficient Segmentation with Texture in Ore Images Based on Box-supervised Approach

Authors: Guodong Sun, Delong Huang, Yuting Peng, Le Cheng, Bo Wu, Yang Zhang

Abstract: Image segmentation methods have been utilized to determine the particle size distribution of crushed ores. Due to the complex working environment, high-powered computing equipment is difficult to deploy. At the same time, the ore distribution is stacked, and it is difficult to identify the complete features. To address this issue, an effective box-supervised technique with texture features is prov… ▽ More Image segmentation methods have been utilized to determine the particle size distribution of crushed ores. Due to the complex working environment, high-powered computing equipment is difficult to deploy. At the same time, the ore distribution is stacked, and it is difficult to identify the complete features. To address this issue, an effective box-supervised technique with texture features is provided for ore image segmentation that can identify complete and independent ores. Firstly, a ghost feature pyramid network (Ghost-FPN) is proposed to process the features obtained from the backbone to reduce redundant semantic information and computation generated by complex networks. Then, an optimized detection head is proposed to obtain the feature to maintain accuracy. Finally, Lab color space (Lab) and local binary patterns (LBP) texture features are combined to form a fusion feature similarity-based loss function to improve accuracy while incurring no loss. Experiments on MS COCO have shown that the proposed fusion features are also worth studying on other types of datasets. Extensive experimental results demonstrate the effectiveness of the proposed method, which achieves over 50 frames per second with a small model size of 21.6 MB. Meanwhile, the method maintains a high level of accuracy compared with the state-of-the-art approaches on ore image dataset. The source code is available at \url{https://github.com/MVME-HBUT/OREINST}. △ Less

Submitted 10 November, 2023; originally announced November 2023.

Comments: 14 pages, 8 figures

arXiv:2309.10456 [pdf, other]

Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation

Authors: Luyao Cheng, Siqi Zheng, Qinglin Zhang, Hui Wang, Yafeng Chen, Qian Chen, Shiliang Zhang

Abstract: Speaker diarization has gained considerable attention within speech processing research community. Mainstream speaker diarization rely primarily on speakers' voice characteristics extracted from acoustic signals and often overlook the potential of semantic information. Considering the fact that speech signals can efficiently convey the content of a speech, it is of our interest to fully exploit th… ▽ More Speaker diarization has gained considerable attention within speech processing research community. Mainstream speaker diarization rely primarily on speakers' voice characteristics extracted from acoustic signals and often overlook the potential of semantic information. Considering the fact that speech signals can efficiently convey the content of a speech, it is of our interest to fully exploit these semantic cues utilizing language models. In this work we propose a novel approach to effectively leverage semantic information in clustering-based speaker diarization systems. Firstly, we introduce spoken language understanding modules to extract speaker-related semantic information and utilize these information to construct pairwise constraints. Secondly, we present a novel framework to integrate these constraints into the speaker diarization pipeline, enhancing the performance of the entire system. Extensive experiments conducted on the public dataset demonstrate the consistent superiority of our proposed approach over acoustic-only speaker diarization systems. △ Less

Submitted 4 February, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

arXiv:2309.00787 [pdf, other]

Online Targetless Radar-Camera Extrinsic Calibration Based on the Common Features of Radar and Camera

Authors: Lei Cheng, Siyang Cao

Abstract: Sensor fusion is essential for autonomous driving and autonomous robots, and radar-camera fusion systems have gained popularity due to their complementary sensing capabilities. However, accurate calibration between these two sensors is crucial to ensure effective fusion and improve overall system performance. Calibration involves intrinsic and extrinsic calibration, with the latter being particula… ▽ More Sensor fusion is essential for autonomous driving and autonomous robots, and radar-camera fusion systems have gained popularity due to their complementary sensing capabilities. However, accurate calibration between these two sensors is crucial to ensure effective fusion and improve overall system performance. Calibration involves intrinsic and extrinsic calibration, with the latter being particularly important for achieving accurate sensor fusion. Unfortunately, many target-based calibration methods require complex operating procedures and well-designed experimental conditions, posing challenges for researchers attempting to reproduce the results. To address this issue, we introduce a novel approach that leverages deep learning to extract a common feature from raw radar data (i.e., Range-Doppler-Angle data) and camera images. Instead of explicitly representing these common features, our method implicitly utilizes these common features to match identical objects from both data sources. Specifically, the extracted common feature serves as an example to demonstrate an online targetless calibration method between the radar and camera systems. The estimation of the extrinsic transformation matrix is achieved through this feature-based approach. To enhance the accuracy and robustness of the calibration, we apply the RANSAC and Levenberg-Marquardt (LM) nonlinear optimization algorithm for deriving the matrix. Our experiments in the real world demonstrate the effectiveness and accuracy of our proposed method. △ Less

Submitted 24 January, 2024; v1 submitted 1 September, 2023; originally announced September 2023.

arXiv:2308.04930 [pdf, other]

Striking The Right Balance: Three-Dimensional Ocean Sound Speed Field Reconstruction Using Tensor Neural Networks

Authors: Siyuan Li, Lei Cheng, Ting Zhang, Hangfang Zhao, Jianlong Li

Abstract: Accurately reconstructing a three-dimensional ocean sound speed field (3D SSF) is essential for various ocean acoustic applications, but the sparsity and uncertainty of sound speed samples across a vast ocean region make it a challenging task. To tackle this challenge, a large body of reconstruction methods has been developed, including spline interpolation, matrix/tensor-based completion, and dee… ▽ More Accurately reconstructing a three-dimensional ocean sound speed field (3D SSF) is essential for various ocean acoustic applications, but the sparsity and uncertainty of sound speed samples across a vast ocean region make it a challenging task. To tackle this challenge, a large body of reconstruction methods has been developed, including spline interpolation, matrix/tensor-based completion, and deep neural networks-based reconstruction. However, a principled analysis of their effectiveness in 3D SSF reconstruction is still lacking. This paper performs a thorough analysis of the reconstruction error and highlights the need for a balanced representation model that integrates both expressiveness and conciseness. To meet this requirement, a 3D SSF-tailored tensor deep neural network is proposed, which utilizes tensor computations and deep neural network architectures to achieve remarkable 3D SSF reconstruction. The proposed model not only includes the previous tensor-based SSF representation model as a special case, but also has a natural ability to reject noise. The numerical results using the South China Sea 3D SSF data demonstrate that the proposed method outperforms state-of-the-art methods. The code is available at https://github.com/OceanSTARLab/Tensor-Neural-Network. △ Less

Submitted 9 August, 2023; originally announced August 2023.

arXiv:2308.02774 [pdf, other]

Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision

Authors: Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Shiliang Zhang, Wen Wang

Abstract: Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an… ▽ More Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an utterance to the same prototypes as the representation of the original view, thereby enabling effective knowledge transfer between the views. Originally, due to the lack of negative pairs in the SDPN training process, the network tends to align positive pairs very closely in the embedding space, a phenomenon known as model collapse. To alleviate this problem, we introduce a diversity regularization term to embeddings in SDPN. Comprehensive experiments on the VoxCeleb datasets demonstrate the superiority of SDPN in self-supervised speaker verification. SDPN sets a new state-of-the-art on the VoxCeleb1 speaker verification evaluation benchmark, achieving Equal Error Rate 1.80%, 1.99%, and 3.62% for trial VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H respectively, without using any speaker labels in training. Ablation studies show that both proposed learnable prototypes in self-distillation network and diversity regularization contribute to the verification performance. △ Less

Submitted 26 June, 2024; v1 submitted 4 August, 2023; originally announced August 2023.

Comments: arXiv admin note: text overlap with arXiv:2211.04168 I submitted the updated paper for arXiv:2308.02774 with the revised version. As for arXiv:2406.11169, I mistakenly submitted this last time, so I withdrew arXiv:2406.11169 and merged the latest content into arXiv: 2308.02774

arXiv:2307.15264 [pdf, other]

3D Radar and Camera Co-Calibration: A Flexible and Accurate Method for Target-based Extrinsic Calibration

Authors: Lei Cheng, Arindam Sengupta, Siyang Cao

Abstract: Advances in autonomous driving are inseparable from sensor fusion. Heterogeneous sensors are widely used for sensor fusion due to their complementary properties, with radar and camera being the most equipped sensors. Intrinsic and extrinsic calibration are essential steps in sensor fusion. The extrinsic calibration, independent of the sensor's own parameters, and performed after the sensors are in… ▽ More Advances in autonomous driving are inseparable from sensor fusion. Heterogeneous sensors are widely used for sensor fusion due to their complementary properties, with radar and camera being the most equipped sensors. Intrinsic and extrinsic calibration are essential steps in sensor fusion. The extrinsic calibration, independent of the sensor's own parameters, and performed after the sensors are installed, greatly determines the accuracy of sensor fusion. Many target-based methods require cumbersome operating procedures and well-designed experimental conditions, making them extremely challenging. To this end, we propose a flexible, easy-to-reproduce and accurate method for extrinsic calibration of 3D radar and camera. The proposed method does not require a specially designed calibration environment, and instead places a single corner reflector (CR) on the ground to iteratively collect radar and camera data simultaneously using Robot Operating System (ROS), and obtain radar-camera point correspondences based on their timestamps, and then use these point correspondences as input to solve the perspective-n-point (PnP) problem, and finally get the extrinsic calibration matrix. Also, RANSAC is used for robustness and the Levenberg-Marquardt (LM) nonlinear optimization algorithm is used for accuracy. Multiple controlled environment experiments as well as real-world experiments demonstrate the efficiency and accuracy (AED error is 15.31 pixels and Acc up to 89\%) of the proposed method. △ Less

Submitted 27 July, 2023; originally announced July 2023.

arXiv:2307.02113 [pdf]

doi 10.1109/LSP.2023.3295750

Multipath Time-delay Estimation with Impulsive Noise via Bayesian Compressive Sensing

Authors: Xingyu Ji, Lei Cheng, Hangfang Zhao

Abstract: Multipath time-delay estimation is commonly encountered in radar and sonar signal processing. In some real-life environments, impulse noise is ubiquitous and significantly degrades estimation performance. Here, we propose a Bayesian approach to tailor the Bayesian Compressive Sensing (BCS) to mitigate impulsive noises. In particular, a heavy-tail Laplacian distribution is used as a statistical mod… ▽ More Multipath time-delay estimation is commonly encountered in radar and sonar signal processing. In some real-life environments, impulse noise is ubiquitous and significantly degrades estimation performance. Here, we propose a Bayesian approach to tailor the Bayesian Compressive Sensing (BCS) to mitigate impulsive noises. In particular, a heavy-tail Laplacian distribution is used as a statistical model for impulse noise, while Laplacian prior is used for sparse multipath modeling. The Bayesian learning problem contains hyperparameters learning and parameter estimation, solved under the BCS inference framework. The performance of our proposed method is compared with benchmark methods, including compressive sensing (CS), BCS, and Laplacian-prior BCS (L-BCS). The simulation results show that our proposed method can estimate the multipath parameters more accurately and have a lower root mean squared estimation error (RMSE) in intensely impulsive noise. △ Less

Submitted 5 July, 2023; originally announced July 2023.

arXiv:2306.15354 [pdf, other]

3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement

Authors: Siqi Zheng, Luyao Cheng, Yafeng Chen, Hui Wang, Qian Chen

Abstract: Disentangling uncorrelated information in speech utterances is a crucial research topic within speech community. Different speech-related tasks focus on extracting distinct speech representations while minimizing the affects of other uncorrelated information. We present a large-scale speech corpus to facilitate the research of speech representation disentanglement. 3D-Speaker contains over 10,000… ▽ More Disentangling uncorrelated information in speech utterances is a crucial research topic within speech community. Different speech-related tasks focus on extracting distinct speech representations while minimizing the affects of other uncorrelated information. We present a large-scale speech corpus to facilitate the research of speech representation disentanglement. 3D-Speaker contains over 10,000 speakers, each of whom are simultaneously recorded by multiple Devices, locating at different Distances, and some speakers are speaking multiple Dialects. The controlled combinations of multi-dimensional audio data yield a matrix of a diverse blend of speech representation entanglement, thereby motivating intriguing methods to untangle them. The multi-domain nature of 3D-Speaker also makes it a suitable resource to evaluate large universal speech models and experiment methods of out-of-domain learning and self-supervised learning. https://3dspeaker.github.io/ △ Less

Submitted 24 September, 2023; v1 submitted 27 June, 2023; originally announced June 2023.

arXiv:2306.11149 [pdf, other]

Overcoming Beam Squint in Dual-Wideband mmWave MIMO Channel Estimation: A Bayesian Multi-Band Sparsity Approach

Authors: Le Xu, Lei Cheng, Ngai Wong, Yik-Chung Wu, H. Vincent Poor

Abstract: The beam squint effect, which manifests in different steering matrices in different sub-bands, has been widely considered a challenge in millimeter wave (mmWave) multiinput multi-output (MIMO) channel estimation. Existing methods either require specific forms of the precoding/combining matrix, which restrict their general practicality, or simply ignore the beam squint effect by only making use of… ▽ More The beam squint effect, which manifests in different steering matrices in different sub-bands, has been widely considered a challenge in millimeter wave (mmWave) multiinput multi-output (MIMO) channel estimation. Existing methods either require specific forms of the precoding/combining matrix, which restrict their general practicality, or simply ignore the beam squint effect by only making use of a single sub-band for channel estimation. Recognizing that different steering matrices are coupled by the same set of unknown channel parameters, this paper proposes to exploit the common sparsity structure of the virtual channel model so that signals from different subbands can be jointly utilized to enhance the performance of channel estimation. A probabilistic model is built to induce the common sparsity in the spatial domain, and the first-order Taylor expansion is adopted to get rid of the grid mismatch in the dictionaries. To learn the model parameters, a variational expectation-maximization (EM) algorithm is derived, which automatically obtains the balance between the likelihood function and the common sparsity prior information, and is applicable to arbitrary forms of precoding/combining matrices. Simulation results show the superior estimation accuracy of the proposed algorithm over existing methods under different noise powers and system configurations. △ Less

Submitted 19 June, 2023; originally announced June 2023.

arXiv:2306.11123 [pdf, other]

To Fold or Not to Fold: Graph Regularized Tensor Train for Visual Data Completion

Authors: Le Xu, Lei Cheng, Ngai Wong, Yik-Chung Wu

Abstract: Tensor train (TT) representation has achieved tremendous success in visual data completion tasks, especially when it is combined with tensor folding. However, folding an image or video tensor breaks the original data structure, leading to local information loss as nearby pixels may be assigned into different dimensions and become far away from each other. In this paper, to fully preserve the local… ▽ More Tensor train (TT) representation has achieved tremendous success in visual data completion tasks, especially when it is combined with tensor folding. However, folding an image or video tensor breaks the original data structure, leading to local information loss as nearby pixels may be assigned into different dimensions and become far away from each other. In this paper, to fully preserve the local information of the original visual data, we explore not folding the data tensor, and at the same time adopt graph information to regularize local similarity between nearby entries. To overcome the high computational complexity introduced by the graph-based regularization in the TT completion problem, we propose to break the original problem into multiple sub-problems with respect to each TT core fiber, instead of each TT core as in traditional methods. Furthermore, to avoid heavy parameter tuning, a sparsity promoting probabilistic model is built based on the generalized inverse Gaussian (GIG) prior, and an inference algorithm is derived under the mean-field approximation. Experiments on both synthetic data and real-world visual data show the superiority of the proposed methods. △ Less

Submitted 19 June, 2023; originally announced June 2023.

arXiv:2305.12927 [pdf, other]

Exploring Speaker-Related Information in Spoken Language Understanding for Better Speaker Diarization

Authors: Luyao Cheng, Siqi Zheng, Zhang Qinglin, Hui Wang, Yafeng Chen, Qian Chen

Abstract: Speaker diarization(SD) is a classic task in speech processing and is crucial in multi-party scenarios such as meetings and conversations. Current mainstream speaker diarization approaches consider acoustic information only, which result in performance degradation when encountering adverse acoustic conditions. In this paper, we propose methods to extract speaker-related information from semantic c… ▽ More Speaker diarization(SD) is a classic task in speech processing and is crucial in multi-party scenarios such as meetings and conversations. Current mainstream speaker diarization approaches consider acoustic information only, which result in performance degradation when encountering adverse acoustic conditions. In this paper, we propose methods to extract speaker-related information from semantic content in multi-party meetings, which, as we will show, can further benefit speaker diarization. We introduce two sub-tasks, Dialogue Detection and Speaker-Turn Detection, in which we effectively extract speaker information from conversational semantics. We also propose a simple yet effective algorithm to jointly model acoustic and semantic information and obtain speaker-identified texts. Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems. △ Less

Submitted 22 May, 2023; originally announced May 2023.

Comments: Accepted to Findings of ACL 2023

arXiv:2305.12838 [pdf, other]

An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification

Authors: Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Jiajun Qi

Abstract: Effective fusion of multi-scale features is crucial for improving speaker verification performance. While most existing methods aggregate multi-scale features in a layer-wise manner via simple operations, such as summation or concatenation. This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the… ▽ More Effective fusion of multi-scale features is crucial for improving speaker verification performance. While most existing methods aggregate multi-scale features in a layer-wise manner via simple operations, such as summation or concatenation. This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the performance. The local feature fusion (LFF) fuses the features within one single residual block to extract the local signal. The global feature fusion (GFF) takes acoustic features of different scales as input to aggregate global signal. To facilitate effective feature fusion in both LFF and GFF, an attentional feature fusion module is employed in the ERes2Net architecture, replacing summation or concatenation operations. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the ERes2Net in speaker verification. Code has been made publicly available at https://github.com/alibaba-damo-academy/3D-Speaker. △ Less

Submitted 3 August, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

arXiv:2305.01183 [pdf, other]

Faster OreFSDet : A Lightweight and Effective Few-shot Object Detector for Ore Images

Authors: Yang Zhang, Le Cheng, Yuting Peng, Chengming Xu, Yanwei Fu, Bo Wu, Guodong Sun

Abstract: For the ore particle size detection, obtaining a sizable amount of high-quality ore labeled data is time-consuming and expensive. General object detection methods often suffer from severe over-fitting with scarce labeled data. Despite their ability to eliminate over-fitting, existing few-shot object detectors encounter drawbacks such as slow detection speed and high memory requirements, making the… ▽ More For the ore particle size detection, obtaining a sizable amount of high-quality ore labeled data is time-consuming and expensive. General object detection methods often suffer from severe over-fitting with scarce labeled data. Despite their ability to eliminate over-fitting, existing few-shot object detectors encounter drawbacks such as slow detection speed and high memory requirements, making them difficult to implement in a real-world deployment scenario. To this end, we propose a lightweight and effective few-shot detector to achieve competitive performance with general object detection with only a few samples for ore images. First, the proposed support feature mining block characterizes the importance of location information in support features. Next, the relationship guidance block makes full use of support features to guide the generation of accurate candidate proposals. Finally, the dual-scale semantic aggregation module retrieves detailed features at different resolutions to contribute with the prediction process. Experimental results show that our method consistently exceeds the few-shot detectors with an excellent performance gap on all metrics. Moreover, our method achieves the smallest model size of 19MB as well as being competitive at 50 FPS detection speed compared with general object detectors. The source code is available at https://github.com/MVME-HBUT/Faster-OreFSDet. △ Less

Submitted 1 May, 2023; originally announced May 2023.

Comments: 18 pages, 11 figures

arXiv:2304.01064 [pdf, other]

Real-time 6K Image Rescaling with Rate-distortion Optimization

Authors: Chenyang Qi, Xin Yang, Ka Leong Cheng, Ying-Cong Chen, Qifeng Chen

Abstract: Contemporary image rescaling aims at embedding a high-resolution (HR) image into a low-resolution (LR) thumbnail image that contains embedded information for HR image reconstruction. Unlike traditional image super-resolution, this enables high-fidelity HR image restoration faithful to the original one, given the embedded information in the LR thumbnail. However, state-of-the-art image rescaling me… ▽ More Contemporary image rescaling aims at embedding a high-resolution (HR) image into a low-resolution (LR) thumbnail image that contains embedded information for HR image reconstruction. Unlike traditional image super-resolution, this enables high-fidelity HR image restoration faithful to the original one, given the embedded information in the LR thumbnail. However, state-of-the-art image rescaling methods do not optimize the LR image file size for efficient sharing and fall short of real-time performance for ultra-high-resolution (e.g., 6K) image reconstruction. To address these two challenges, we propose a novel framework (HyperThumbnail) for real-time 6K rate-distortion-aware image rescaling. Our framework first embeds an HR image into a JPEG LR thumbnail by an encoder with our proposed quantization prediction module, which minimizes the file size of the embedding LR JPEG thumbnail while maximizing HR reconstruction quality. Then, an efficient frequency-aware decoder reconstructs a high-fidelity HR image from the LR one in real time. Extensive experiments demonstrate that our framework outperforms previous image rescaling baselines in rate-distortion performance and can perform 6K image reconstruction in real time. △ Less

Submitted 19 May, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

Comments: Accepted by CVPR 2023; Github Repository: https://github.com/AbnerVictor/HyperThumbnail

arXiv:2303.09560 [pdf]

Methodology for Capacity Credit Evaluation of Physical and Virtual Energy Storage in Decarbonized Power System

Authors: Ning Qi, Peng Li, Lin Cheng, Ziyi Zhang, Wenrui Huang, Weiwei Yang

Abstract: Energy storage (ES) and virtual energy storage (VES) are key components to realizing power system decarbonization. Although ES and VES have been proven to deliver various types of grid services, little work has so far provided a systematical framework for quantifying their adequacy contribution and credible capacity value while incorporating human and market behavior. Therefore, this manuscript pr… ▽ More Energy storage (ES) and virtual energy storage (VES) are key components to realizing power system decarbonization. Although ES and VES have been proven to deliver various types of grid services, little work has so far provided a systematical framework for quantifying their adequacy contribution and credible capacity value while incorporating human and market behavior. Therefore, this manuscript proposed a novel evaluation framework to evaluate the capacity credit (CC) of ES and VES. To address the system capacity inadequacy and market behavior of storage, a two-stage coordinated dispatch is proposed to achieve the trade-off between day-ahead self-energy management of resources and efficient adjustment to real-time failures. And we further modeled the human behavior with storage operations and incorporate two types of decision-independent uncertainties (DIUs) (operate state and self-consumption) and one type of decision-dependent uncertainty (DDUs) (available capacity) into the proposed dispatch. Furthermore, novel reliability and CC indices (e.g., equivalent physical storage capacity (EPSC)) are introduced to evaluate the practical and theoretical adequacy contribution of ES and VES, as well as the ability to displace generation and physical storage while maintaining equivalent system adequacy. Exhaustive case studies based on the IEEE RTS-79 system and real-world data verify the significant consequence (10%-70% overestimated CC) of overlooking DIUs and DDUs in the previous works, while the proposed method outperforms other and can generate a credible and realistic result. Finally, we investigate key factors affecting the adequacy contribution of ES and VES, and reasonable suggestions are provided for better flexibility utilization of ES and VES in decarbonized power system. △ Less

Submitted 16 March, 2023; originally announced March 2023.

Comments: capacity credit, decision-dependent uncertainty, decarbonized power system

arXiv:2303.06340 [pdf, other]

Intelligent diagnostic scheme for lung cancer screening with Raman spectra data by tensor network machine learning

Authors: Yu-Jia An, Sheng-Chen Bai, Lin Cheng, Xiao-Guang Li, Cheng-en Wang, Xiao-Dong Han, Gang Su, Shi-Ju Ran, Cong Wang

Abstract: Artificial intelligence (AI) has brought tremendous impacts on biomedical sciences from academic researches to clinical applications, such as in biomarkers' detection and diagnosis, optimization of treatment, and identification of new therapeutic targets in drug discovery. However, the contemporary AI technologies, particularly deep machine learning (ML), severely suffer from non-interpretability,… ▽ More Artificial intelligence (AI) has brought tremendous impacts on biomedical sciences from academic researches to clinical applications, such as in biomarkers' detection and diagnosis, optimization of treatment, and identification of new therapeutic targets in drug discovery. However, the contemporary AI technologies, particularly deep machine learning (ML), severely suffer from non-interpretability, which might uncontrollably lead to incorrect predictions. Interpretability is particularly crucial to ML for clinical diagnosis as the consumers must gain necessary sense of security and trust from firm grounds or convincing interpretations. In this work, we propose a tensor-network (TN)-ML method to reliably predict lung cancer patients and their stages via screening Raman spectra data of Volatile organic compounds (VOCs) in exhaled breath, which are generally suitable as biomarkers and are considered to be an ideal way for non-invasive lung cancer screening. The prediction of TN-ML is based on the mutual distances of the breath samples mapped to the quantum Hilbert space. Thanks to the quantum probabilistic interpretation, the certainty of the predictions can be quantitatively characterized. The accuracy of the samples with high certainty is almost 100$\%$. The incorrectly-classified samples exhibit obviously lower certainty, and thus can be decipherably identified as anomalies, which will be handled by human experts to guarantee high reliability. Our work sheds light on shifting the ``AI for biomedical sciences'' from the conventional non-interpretable ML schemes to the interpretable human-ML interactive approaches, for the purpose of high accuracy and reliability. △ Less

Submitted 11 March, 2023; originally announced March 2023.

Comments: 10 pages, 7 figures

arXiv:2303.00332 [pdf, other]

CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking

Authors: Hui Wang, Siqi Zheng, Yafeng Chen, Luyao Cheng, Qian Chen

Abstract: Time delay neural network (TDNN) has been proven to be efficient for speaker verification. One of its successful variants, ECAPA-TDNN, achieved state-of-the-art performance at the cost of much higher computational complexity and slower inference speed. This makes it inadequate for scenarios with demanding inference rate and limited computational resources. We are thus interested in finding an arch… ▽ More Time delay neural network (TDNN) has been proven to be efficient for speaker verification. One of its successful variants, ECAPA-TDNN, achieved state-of-the-art performance at the cost of much higher computational complexity and slower inference speed. This makes it inadequate for scenarios with demanding inference rate and limited computational resources. We are thus interested in finding an architecture that can achieve the performance of ECAPA-TDNN and the efficiency of vanilla TDNN. In this paper, we propose an efficient network based on context-aware masking, namely CAM++, which uses densely connected time delay neural network (D-TDNN) as backbone and adopts a novel multi-granularity pooling to capture contextual information at different levels. Extensive experiments on two public benchmarks, VoxCeleb and CN-Celeb, demonstrate that the proposed architecture outperforms other mainstream speaker verification systems with lower computational cost and faster inference speed. △ Less

Submitted 16 June, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

arXiv:2212.14739 [pdf]

Semantic optical fiber communication system

Authors: Zhenming Yu, Hongyu Huang, Liming Cheng, Wei Zhang, Yueqiu Mu, Kun Xu

Abstract: The current optical communication systems minimize bit or symbol errors without considering the semantic meaning behind digital bits, thus transmitting a lot of unnecessary information. We propose and experimentally demonstrate a semantic optical fiber communication (SOFC) system. Instead of encoding information into bits for transmission, semantic information is extracted from the source using de… ▽ More The current optical communication systems minimize bit or symbol errors without considering the semantic meaning behind digital bits, thus transmitting a lot of unnecessary information. We propose and experimentally demonstrate a semantic optical fiber communication (SOFC) system. Instead of encoding information into bits for transmission, semantic information is extracted from the source using deep learning. The generated semantic symbols are then directly transmitted through an optical fiber. Compared with the bit-based structure, the SOFC system achieved higher information compression and a more stable performance, especially in the low received optical power regime, and enhanced the robustness against optical link impairments. This work introduces an intelligent optical communication system at the human analytical thinking level, which is a significant step toward a breakthrough in the current optical communication architecture. △ Less

Submitted 27 December, 2022; originally announced December 2022.

arXiv:2212.07608 [pdf, other]

Output-Dependent Gaussian Process State-Space Model

Authors: Zhidi Lin, Lei Cheng, Feng Yin, Lexi Xu, Shuguang Cui

Abstract: Gaussian process state-space model (GPSSM) is a fully probabilistic state-space model that has attracted much attention over the past decade. However, the outputs of the transition function in the existing GPSSMs are assumed to be independent, meaning that the GPSSMs cannot exploit the inductive biases between different outputs and lose certain model capacities. To address this issue, this paper p… ▽ More Gaussian process state-space model (GPSSM) is a fully probabilistic state-space model that has attracted much attention over the past decade. However, the outputs of the transition function in the existing GPSSMs are assumed to be independent, meaning that the GPSSMs cannot exploit the inductive biases between different outputs and lose certain model capacities. To address this issue, this paper proposes an output-dependent and more realistic GPSSM by utilizing the well-known, simple yet practical linear model of coregionalization (LMC) framework to represent the output dependency. To jointly learn the output-dependent GPSSM and infer the latent states, we propose a variational sparse GP-based learning method that only gently increases the computational complexity. Experiments on both synthetic and real datasets demonstrate the superiority of the output-dependent GPSSM in terms of learning and inference performance. △ Less

Submitted 14 December, 2022; originally announced December 2022.

Comments: 5 pages, 4 figures

arXiv:2211.12220 [pdf, other]

A Scope Sensitive and Result Attentive Model for Multi-Intent Spoken Language Understanding

Authors: Lizhi Cheng, Wenmian Yang, Weijia Jia

Abstract: Multi-Intent Spoken Language Understanding (SLU), a novel and more complex scenario of SLU, is attracting increasing attention. Unlike traditional SLU, each intent in this scenario has its specific scope. Semantic information outside the scope even hinders the prediction, which tremendously increases the difficulty of intent detection. More seriously, guiding slot filling with these inaccurate int… ▽ More Multi-Intent Spoken Language Understanding (SLU), a novel and more complex scenario of SLU, is attracting increasing attention. Unlike traditional SLU, each intent in this scenario has its specific scope. Semantic information outside the scope even hinders the prediction, which tremendously increases the difficulty of intent detection. More seriously, guiding slot filling with these inaccurate intent labels suffers error propagation problems, resulting in unsatisfied overall performance. To solve these challenges, in this paper, we propose a novel Scope-Sensitive Result Attention Network (SSRAN) based on Transformer, which contains a Scope Recognizer (SR) and a Result Attention Network (RAN). Scope Recognizer assignments scope information to each token, reducing the distraction of out-of-scope tokens. Result Attention Network effectively utilizes the bidirectional interaction between results of slot filling and intent detection, mitigating the error propagation problem. Experiments on two public datasets indicate that our model significantly improves SLU performance (5.4\% and 2.1\% on Overall accuracy) over the state-of-the-art baseline. △ Less

Submitted 22 November, 2022; originally announced November 2022.

arXiv:2211.04168 [pdf, other]

Pushing the limits of self-supervised speaker verification using regularized distillation framework

Authors: Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen

Abstract: Training robust speaker verification systems without speaker labels has long been a challenging task. Previous studies observed a large performance gap between self-supervised and fully supervised methods. In this paper, we apply a non-contrastive self-supervised learning framework called DIstillation with NO labels (DINO) and propose two regularization terms applied to embeddings in DINO. One reg… ▽ More Training robust speaker verification systems without speaker labels has long been a challenging task. Previous studies observed a large performance gap between self-supervised and fully supervised methods. In this paper, we apply a non-contrastive self-supervised learning framework called DIstillation with NO labels (DINO) and propose two regularization terms applied to embeddings in DINO. One regularization term guarantees the diversity of the embeddings, while the other regularization term decorrelates the variables of each embedding. The effectiveness of various data augmentation techniques are explored, on both time and frequency domain. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the regularized DINO framework in speaker verification. Our method achieves the state-of-the-art speaker verification performance under a single-stage self-supervised setting on VoxCeleb. Code has been made publicly available at https://github.com/alibaba-damo-academy/3D-Speaker. △ Less

Submitted 2 August, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

arXiv:2207.10869 [pdf, other]

Optimizing Image Compression via Joint Learning with Denoising

Authors: Ka Leong Cheng, Yueqi Xie, Qifeng Chen

Abstract: High levels of noise usually exist in today's captured images due to the relatively small sensors equipped in the smartphone cameras, where the noise brings extra challenges to lossy image compression algorithms. Without the capacity to tell the difference between image details and noise, general image compression methods allocate additional bits to explicitly store the undesired image noise durin… ▽ More High levels of noise usually exist in today's captured images due to the relatively small sensors equipped in the smartphone cameras, where the noise brings extra challenges to lossy image compression algorithms. Without the capacity to tell the difference between image details and noise, general image compression methods allocate additional bits to explicitly store the undesired image noise during compression and restore the unpleasant noisy image during decompression. Based on the observations, we optimize the image compression algorithm to be noise-aware as joint denoising and compression to resolve the bits misallocation problem. The key is to transform the original noisy images to noise-free bits by eliminating the undesired noise during compression, where the bits are later decompressed as clean images. Specifically, we propose a novel two-branch, weight-sharing architecture with plug-in feature denoisers to allow a simple and effective realization of the goal with little computational cost. Experimental results show that our method gains a significant improvement over the existing baseline methods on both the synthetic and real-world datasets. Our source code is available at https://github.com/felixcheng97/DenoiseCompression. △ Less

Submitted 22 July, 2022; originally announced July 2022.

Comments: Accepted to ECCV 2022

arXiv:2207.10204 [pdf, ps, other]

Watermark-Based Code Construction for Finite-State Markov Channel with Synchronisation Errors

Authors: Shamin Achari, Ling Cheng

Abstract: With advancements in telecommunications, data transmission over increasingly harsher channels that produce synchronisation errors is inevitable. Coding schemes for such channels are available through techniques such as the Davey-MacKay watermark coding; however, this is limited to memoryless channel estimates. Memory must be accounted for to ensure a realistic channel approximation - similar to a… ▽ More With advancements in telecommunications, data transmission over increasingly harsher channels that produce synchronisation errors is inevitable. Coding schemes for such channels are available through techniques such as the Davey-MacKay watermark coding; however, this is limited to memoryless channel estimates. Memory must be accounted for to ensure a realistic channel approximation - similar to a Finite State Markov Chain or Fritchman Model. A novel code construction and decoder are developed to correct synchronisation errors while considering the channel's correlated memory effects by incorporating ideas from the watermark scheme and memory modelling. Simulation results show that the proposed code construction and decoder rival the first and second-order Davey-MacKay type watermark decoder and even perform slightly better when the inner-channel capacity is higher than 0.9. The proposed system and decoder may prove helpful in fields such as free-space optics and possibly molecular communication, where harsh channels are used for communication. △ Less

Submitted 20 July, 2022; originally announced July 2022.

Comments: Submitted to Elsevier Digital Signal Processing

arXiv:2206.05427 [pdf, ps, other]

Reconfigurable Intelligent Surface-Aided 6G Massive Access: Coupled Tensor Modeling and Sparse Bayesian Learning

Authors: Xiaodan Shao, Lei Cheng, Xiaoming Chen, Chongwen Huang, Derrick Wing Kwan Ng

Abstract: This paper investigates a reconfigurable intelligent surface (RIS)-aided unsourced random access (URA) scheme for the sixth-generation (6G) wireless networks with massive sporadic traffic devices. First of all, this paper proposes a novel joint active device separation (the message recovery of active device) and channel estimation architecture for the RIS-aided URA. Specifically, the RIS passive r… ▽ More This paper investigates a reconfigurable intelligent surface (RIS)-aided unsourced random access (URA) scheme for the sixth-generation (6G) wireless networks with massive sporadic traffic devices. First of all, this paper proposes a novel joint active device separation (the message recovery of active device) and channel estimation architecture for the RIS-aided URA. Specifically, the RIS passive reflection is optimized before the successful device separation. Then, by associating the data sequences to multiple rank-one tensors and exploiting the angular sparsity of the RIS-BS channel, the detection problem is cast as a high-order coupled tensor decomposition problem without the need of exploiting pilot sequences. However, the inherent coupling among multiple sparse device-RIS channels, together with the unknown number of active devices make the detection problem at hand deviate from the widely-used coupled tensor decomposition format. To overcome this challenge, this paper judiciously devises a probabilistic model that captures both the element-wise sparsity from the angular channel model and the low-rank property due to the sporadic nature of URA. Then, based on such a probabilistic model, a iterative detection algorithm is developed under the framework of sparse variational inference, where each update iteration is obtained in a closed-form and the number of active devices can be automatically estimated for effectively avoiding the overfitting of noise. Extensive simulation results confirm the excellence of the proposed URA algorithm, especially for the case of a large number of reflecting elements for accommodating a significantly large number of devices. △ Less

Submitted 11 June, 2022; originally announced June 2022.

Comments: Accepted by IEEE Transactions on Wireless Communications. arXiv admin note: text overlap with arXiv:2108.11123

arXiv:2205.14283 [pdf, other]

doi 10.1109/MSP.2022.3198201

Rethinking Bayesian Learning for Data Analysis: The Art of Prior and Inference in Sparsity-Aware Modeling

Authors: Lei Cheng, Feng Yin, Sergios Theodoridis, Sotirios Chatzis, Tsung-Hui Chang

Abstract: Sparse modeling for signal processing and machine learning has been at the focus of scientific research for over two decades. Among others, supervised sparsity-aware learning comprises two major paths paved by: a) discriminative methods and b) generative methods. The latter, more widely known as Bayesian methods, enable uncertainty evaluation w.r.t. the performed predictions. Furthermore, they can… ▽ More Sparse modeling for signal processing and machine learning has been at the focus of scientific research for over two decades. Among others, supervised sparsity-aware learning comprises two major paths paved by: a) discriminative methods and b) generative methods. The latter, more widely known as Bayesian methods, enable uncertainty evaluation w.r.t. the performed predictions. Furthermore, they can better exploit related prior information and naturally introduce robustness into the model, due to their unique capacity to marginalize out uncertainties related to the parameter estimates. Moreover, hyper-parameters associated with the adopted priors can be learnt via the training data. To implement sparsity-aware learning, the crucial point lies in the choice of the function regularizer for discriminative methods and the choice of the prior distribution for Bayesian learning. Over the last decade or so, due to the intense research on deep learning, emphasis has been put on discriminative techniques. However, a come back of Bayesian methods is taking place that sheds new light on the design of deep neural networks, which also establish firm links with Bayesian models and inspire new paths for unsupervised learning, such as Bayesian tensor decomposition. The goal of this article is two-fold. First, to review, in a unified way, some recent advances in incorporating sparsity-promoting priors into three highly popular data modeling tools, namely deep neural networks, Gaussian processes, and tensor decomposition. Second, to review their associated inference techniques from different aspects, including: evidence maximization via optimization and variational inference methods. Challenges such as small data dilemma, automatic model structure search, and natural prediction uncertainty evaluation are also discussed. Typical signal processing and machine learning tasks are demonstrated. △ Less

Submitted 27 May, 2022; originally announced May 2022.

Comments: 64 pages, 16 figures, 6 tables, 98 references, submitted to IEEE Signal Processing Magazine

arXiv:2204.00779 [pdf, other]

doi 10.1109/TSP.2024.3352912

Downlink Channel Covariance Matrix Reconstruction for FDD Massive MIMO Systems with Limited Feedback

Authors: Kai Li, Ying Li, Lei Cheng, Qingjiang Shi, Zhi-Quan Luo

Abstract: The downlink channel covariance matrix (CCM) acquisition is the key step for the practical performance of massive multiple-input and multiple-output (MIMO) systems, including beamforming, channel tracking, and user scheduling. However, this task is challenging in the popular frequency division duplex massive MIMO systems with Type I codebook due to the limited channel information feedback. In this… ▽ More The downlink channel covariance matrix (CCM) acquisition is the key step for the practical performance of massive multiple-input and multiple-output (MIMO) systems, including beamforming, channel tracking, and user scheduling. However, this task is challenging in the popular frequency division duplex massive MIMO systems with Type I codebook due to the limited channel information feedback. In this paper, we propose a novel formulation that leverages the structure of the codebook and feedback values for an accurate estimation of the downlink CCM. Then, we design a cutting plane algorithm to consecutively shrink the feasible set containing the downlink CCM, enabled by the careful design of pilot weighting matrices. Theoretical analysis shows that as the number of communication rounds increases, the proposed cutting plane algorithm can recover the ground-truth CCM. Numerical results are presented to demonstrate the superior performance of the proposed algorithm over the existing benchmark in CCM reconstruction. △ Less

Submitted 12 September, 2023; v1 submitted 2 April, 2022; originally announced April 2022.

arXiv:2203.13991 [pdf]

Risk Assessment with Generic Energy Storage under Exogenous and Endogenous Uncertainty

Authors: Ning Qi, Lin Cheng, Yuxiang Wan, Yingrui Zhuang, Zeyu Liu

Abstract: Current risk assessment ignores the stochastic nature of energy storage availability itself and thus lead to potential risk during operation. This paper proposes the redefinition of generic energy storage (GES) that is allowed to offer probabilistic reserve. A data-driven unified model with exogenous and endogenous uncertainty (EXU & EDU) description is presented for four typical types of GES. Mor… ▽ More Current risk assessment ignores the stochastic nature of energy storage availability itself and thus lead to potential risk during operation. This paper proposes the redefinition of generic energy storage (GES) that is allowed to offer probabilistic reserve. A data-driven unified model with exogenous and endogenous uncertainty (EXU & EDU) description is presented for four typical types of GES. Moreover, risk indices are proposed to assess the impact of overlooking (EXU & EDU) of GES. Comparative results between EXU & EDU are illustrated in distribution system with day-ahead chance-constrained optimization (CCO) and more severe risks are observed for the latter, which indicate that system operator (SO) should adopt novel strategies for EDU uncertainty. △ Less

Submitted 26 March, 2022; originally announced March 2022.

Comments: PES GM2022-Exogenous and Endogenous Uncertainty

arXiv:2202.11490 [pdf, other]

Towards Tailored Models on Private AIoT Devices: Federated Direct Neural Architecture Search

Authors: Chunhui Zhang, Xiaoming Yuan, Qianyun Zhang, Guangxu Zhu, Lei Cheng, Ning Zhang

Abstract: Neural networks often encounter various stringent resource constraints while deploying on edge devices. To tackle these problems with less human efforts, automated machine learning becomes popular in finding various neural architectures that fit diverse Artificial Intelligence of Things (AIoT) scenarios. Recently, to prevent the leakage of private information while enable automated machine intelli… ▽ More Neural networks often encounter various stringent resource constraints while deploying on edge devices. To tackle these problems with less human efforts, automated machine learning becomes popular in finding various neural architectures that fit diverse Artificial Intelligence of Things (AIoT) scenarios. Recently, to prevent the leakage of private information while enable automated machine intelligence, there is an emerging trend to integrate federated learning and neural architecture search (NAS). Although promising as it may seem, the coupling of difficulties from both tenets makes the algorithm development quite challenging. In particular, how to efficiently search the optimal neural architecture directly from massive non-independent and identically distributed (non-IID) data among AIoT devices in a federated manner is a hard nut to crack. In this paper, to tackle this challenge, by leveraging the advances in ProxylessNAS, we propose a Federated Direct Neural Architecture Search (FDNAS) framework that allows for hardware-friendly NAS from non- IID data across devices. To further adapt to both various data distributions and different types of devices with heterogeneous embedded hardware platforms, inspired by meta-learning, a Cluster Federated Direct Neural Architecture Search (CFDNAS) framework is proposed to achieve device-aware NAS, in the sense that each device can learn a tailored deep learning model for its particular data distribution and hardware constraint. Extensive experiments on non-IID datasets have shown the state-of-the-art accuracy-efficiency trade-offs achieved by the proposed solution in the presence of both data and device heterogeneity. △ Less

Submitted 23 February, 2022; originally announced February 2022.

Comments: arXiv admin note: text overlap with arXiv:2011.03372

arXiv:2202.01630 [pdf, other]

A deep complex multi-frame filtering network for stereophonic acoustic echo cancellation

Authors: Linjuan Cheng, Chengshi Zheng, Andong Li, Yuquan Wu, Renhua Peng, Xiaodong Li

Abstract: In hands-free communication system, the coupling between loudspeaker and microphone generates echo signal, which can severely influence the quality of communication. Meanwhile, various types of noise in communication environments further reduce speech quality and intelligibility. It is difficult to extract the near-end signal from the microphone signal within one step, especially in low signal-to-… ▽ More In hands-free communication system, the coupling between loudspeaker and microphone generates echo signal, which can severely influence the quality of communication. Meanwhile, various types of noise in communication environments further reduce speech quality and intelligibility. It is difficult to extract the near-end signal from the microphone signal within one step, especially in low signal-to-noise ratio scenarios. In this paper, we propose a deep complex network approach to address this issue. Specially, we decompose the stereophonic acoustic echo cancellation into two stages, including linear stereophonic acoustic echo cancellation module and residual echo suppression module, where both modules are based on deep learning architectures. A multi-frame filtering strategy is introduced to benefit the estimation of linear echo by capturing more inter-frame information. Moreover, we decouple the complex spectral map** into magnitude estimation and complex spectrum refinement. Experimental results demonstrate that our proposed approach achieves stage-of-the-art performance over previous advanced algorithms under various conditions. △ Less

Submitted 5 May, 2022; v1 submitted 3 February, 2022; originally announced February 2022.

arXiv:2201.13143 [pdf, other]

CoTV: Cooperative Control for Traffic Light Signals and Connected Autonomous Vehicles using Deep Reinforcement Learning

Authors: Jiaying Guo, Long Cheng, Shen Wang

Abstract: The target of reducing travel time only is insufficient to support the development of future smart transportation systems. To align with the United Nations Sustainable Development Goals (UN-SDG), a further reduction of fuel and emissions, improvements of traffic safety, and the ease of infrastructure deployment and maintenance should also be considered. Different from existing work focusing on the… ▽ More The target of reducing travel time only is insufficient to support the development of future smart transportation systems. To align with the United Nations Sustainable Development Goals (UN-SDG), a further reduction of fuel and emissions, improvements of traffic safety, and the ease of infrastructure deployment and maintenance should also be considered. Different from existing work focusing on the optimization of the control in either traffic light signal (to improve the intersection throughput), or vehicle speed (to stabilize the traffic), this paper presents a multi-agent Deep Reinforcement Learning (DRL) system called CoTV, which Cooperatively controls both Traffic light signals and Connected Autonomous Vehicles (CAV). Therefore, our CoTV can well balance the achievement of the reduction of travel time, fuel, and emissions. In the meantime, CoTV can also be easy to deploy by cooperating with only one CAV that is the nearest to the traffic light controller on each incoming road. This enables more efficient coordination between traffic light controllers and CAV, thus leading to the convergence of training CoTV under the large-scale multi-agent scenario that is traditionally difficult to converge. We give the detailed system design of CoTV and demonstrate its effectiveness in a simulation study using SUMO under various grid maps and realistic urban scenarios with mixed-autonomy traffic. △ Less

Submitted 14 February, 2023; v1 submitted 31 January, 2022; originally announced January 2022.

arXiv:2201.11999 [pdf, other]

doi 10.1145/3474085.3475180

Dual Learning Music Composition and Dance Choreography

Authors: Shuang Wu, Zhenguang Li, Shijian Lu, Li Cheng

Abstract: Music and dance have always co-existed as pillars of human activities, contributing immensely to the cultural, social, and entertainment functions in virtually all societies. Notwithstanding the gradual systematization of music and dance into two independent disciplines, their intimate connection is undeniable and one art-form often appears incomplete without the other. Recent research works have… ▽ More Music and dance have always co-existed as pillars of human activities, contributing immensely to the cultural, social, and entertainment functions in virtually all societies. Notwithstanding the gradual systematization of music and dance into two independent disciplines, their intimate connection is undeniable and one art-form often appears incomplete without the other. Recent research works have studied generative models for dance sequences conditioned on music. The dual task of composing music for given dances, however, has been largely overlooked. In this paper, we propose a novel extension, where we jointly model both tasks in a dual learning approach. To leverage the duality of the two modalities, we introduce an optimal transport objective to align feature embeddings, as well as a cycle consistency loss to foster overall consistency. Experimental results demonstrate that our dual learning framework improves individual task performance, delivering generated music compositions and dance choreographs that are realistic and faithful to the conditioned inputs. △ Less

Submitted 28 January, 2022; originally announced January 2022.

Comments: ACMMM 2021 (Oral)

arXiv:2201.08583 [pdf, other]

doi 10.1121/10.0009280

Tensor-based Basis Function Learning for Three-dimensional Sound Speed Fields

Authors: Lei Cheng, Xingyu Ji, Hangfang Zhao, Jianlong Li, Wen Xu

Abstract: Basis function learning is the step** stone towards effective three-dimensional (3D) sound speed field (SSF) inversion for various acoustic signal processing tasks, including ocean acoustic tomography, underwater target localization/tracking, and underwater communications. Classical basis functions include the empirical orthogonal functions (EOFs), Fourier basis functions, and their combinations… ▽ More Basis function learning is the step** stone towards effective three-dimensional (3D) sound speed field (SSF) inversion for various acoustic signal processing tasks, including ocean acoustic tomography, underwater target localization/tracking, and underwater communications. Classical basis functions include the empirical orthogonal functions (EOFs), Fourier basis functions, and their combinations. The unsupervised machine learning method, e.g., the K-SVD algorithm, has recently tapped into the basis function design, showing better representation performance than the EOFs. However, existing methods do not consider basis function learning approaches that treat 3D SSF data as a third-order tensor, and thus cannot fully utilize the 3D interactions/correlations therein. To circumvent such a drawback, basis function learning is linked to tensor decomposition in this paper, which is the primary drive for recent multi-dimensional data mining. In particular, a tensor-based basis function learning framework is proposed, which can include the classical basis functions (using EOFs and/or Fourier basis functions) as its special cases. This provides a unified tensor perspective for understanding and representing 3D SSFs. Numerical results using the South China Sea 3D SSF data have demonstrated the excellent performance of the tensor-based basis functions. △ Less

Submitted 21 January, 2022; originally announced January 2022.

arXiv:2112.01806 [pdf, other]

Music-to-Dance Generation with Optimal Transport

Authors: Shuang Wu, Shijian Lu, Li Cheng

Abstract: Dance choreography for a piece of music is a challenging task, having to be creative in presenting distinctive stylistic dance elements while taking into account the musical theme and rhythm. It has been tackled by different approaches such as similarity retrieval, sequence-to-sequence modeling and generative adversarial networks, but their generated dance sequences are often short of motion reali… ▽ More Dance choreography for a piece of music is a challenging task, having to be creative in presenting distinctive stylistic dance elements while taking into account the musical theme and rhythm. It has been tackled by different approaches such as similarity retrieval, sequence-to-sequence modeling and generative adversarial networks, but their generated dance sequences are often short of motion realism, diversity and music consistency. In this paper, we propose a Music-to-Dance with Optimal Transport Network (MDOT-Net) for learning to generate 3D dance choreographies from music. We introduce an optimal transport distance for evaluating the authenticity of the generated dance distribution and a Gromov-Wasserstein distance to measure the correspondence between the dance distribution and the input music. This gives a well defined and non-divergent training objective that mitigates the limitation of standard GAN training which is frequently plagued with instability and divergent generator loss issues. Extensive experiments demonstrate that our MDOT-Net can synthesize realistic and diverse dances which achieve an organic unity with the input music, reflecting the shared intentionality and matching the rhythmic articulation. Sample results are found at https://www.youtube.com/watch?v=dErfBkrlUO8. △ Less

Submitted 4 May, 2022; v1 submitted 3 December, 2021; originally announced December 2021.

Comments: IJCAI 2022

arXiv:2111.11658 [pdf, other]

The RETA Benchmark for Retinal Vascular Tree Analysis

Authors: Xingzheng Lyu, Li Cheng, Sanyuan Zhang

Abstract: Topological and geometrical analysis of retinal blood vessel is a cost-effective way for early detection of many common diseases. Meanwhile, automated vessel segmentation and vascular tree analysis are still lacking in terms of generalization capability. In this work, we construct a novel benchmark RETA with 81 labeled vessel masks aiming to facilitate retinal vessel analysis. A semi-automated coa… ▽ More Topological and geometrical analysis of retinal blood vessel is a cost-effective way for early detection of many common diseases. Meanwhile, automated vessel segmentation and vascular tree analysis are still lacking in terms of generalization capability. In this work, we construct a novel benchmark RETA with 81 labeled vessel masks aiming to facilitate retinal vessel analysis. A semi-automated coarse-to-fine workflow is proposed to annotating vessel pixels. During dataset construction, we strived to control inter-annotator variability and intra-annotator variability by performing multi-stage annotation and label disambiguation on self-developed dedicated software. In addition to binary vessel masks, we obtained vessel annotations containing artery/vein masks, vascular skeletons, bifurcations, trees and abnormalities during vessel labelling. Both subjective and objective quality validation of labeled vessel masks have demonstrated significant improved quality over other publicly datasets. The annotation software is also made publicly available for vessel annotation visualization. Users could develop vessel segmentation algorithms or evaluate vessel segmentation performance with our dataset. Moreover, our dataset might be a good research source for cross-modality tubular structure segmentation. △ Less

Submitted 23 November, 2021; originally announced November 2021.

Comments: 13 pages,6 figures, 4 tables

arXiv:2110.12915 [pdf]

Revealing unforeseen diagnostic image features with deep learning by detecting cardiovascular diseases from apical four-chamber ultrasounds

Authors: Li-Hsin Cheng, Pablo B. J. Bosch, Rutger F. H. Hofman, Timo B. Brakenhoff, Eline F. Bruggemans, Rob J. van der Geest, Eduard R. Holman

Abstract: Background. With the rise of highly portable, wireless, and low-cost ultrasound devices and automatic ultrasound acquisition techniques, an automated interpretation method requiring only a limited set of views as input could make preliminary cardiovascular disease diagnoses more accessible. In this study, we developed a deep learning (DL) method for automated detection of impaired left ventricular… ▽ More Background. With the rise of highly portable, wireless, and low-cost ultrasound devices and automatic ultrasound acquisition techniques, an automated interpretation method requiring only a limited set of views as input could make preliminary cardiovascular disease diagnoses more accessible. In this study, we developed a deep learning (DL) method for automated detection of impaired left ventricular (LV) function and aortic valve (AV) regurgitation from apical four-chamber (A4C) ultrasound cineloops and investigated which anatomical structures or temporal frames provided the most relevant information for the DL model to enable disease classification. Methods and Results. A4C ultrasounds were extracted from 3,554 echocardiograms of patients with either impaired LV function (n=928), AV regurgitation (n=738), or no significant abnormalities (n=1,888). Two convolutional neural networks (CNNs) were trained separately to classify the respective disease cases against normal cases. The overall classification accuracy of the impaired LV function detection model was 86%, and that of the AV regurgitation detection model was 83%. Feature importance analyses demonstrated that the LV myocardium and mitral valve were important for detecting impaired LV function, while the tip of the mitral valve anterior leaflet, during opening, was considered important for detecting AV regurgitation. Conclusion. The proposed method demonstrated the feasibility of a 3D CNN approach in detection of impaired LV function and AV regurgitation using A4C ultrasound cineloops. The current research shows that DL methods can exploit large training data to detect diseases in a different way than conventionally agreed upon methods, and potentially reveal unforeseen diagnostic image features. △ Less

Submitted 25 October, 2021; originally announced October 2021.

arXiv:2110.07191 [pdf]

doi 10.1177/14759217211050012

CNN-DST: ensemble deep learning based on Dempster-Shafer theory for vibration-based fault recognition

Authors: Vahid Yaghoubi, Liangliang Cheng, Wim Van Paepegem, Mathias Kersemans

Abstract: Nowadays, using vibration data in conjunction with pattern recognition methods is one of the most common fault detection strategies for structures. However, their performances depend on the features extracted from vibration data, the features selected to train the classifier, and the classifier used for pattern recognition. Deep learning facilitates the fault detection procedure by automating the… ▽ More Nowadays, using vibration data in conjunction with pattern recognition methods is one of the most common fault detection strategies for structures. However, their performances depend on the features extracted from vibration data, the features selected to train the classifier, and the classifier used for pattern recognition. Deep learning facilitates the fault detection procedure by automating the feature extraction and selection, and classification procedure. Though, deep learning approaches have challenges in designing its structure and tuning its hyperparameters, which may result in a low generalization capability. Therefore, this study proposes an ensemble deep learning framework based on a convolutional neural network (CNN) and Dempster-Shafer theory (DST), called CNN-DST. In this framework, several CNNs with the proposed structure are first trained, and then, the outputs of the CNNs selected by the proposed technique are combined by using an improved DST-based method. To validate the proposed CNN-DST framework, it is applied to an experimental dataset created by the broadband vibrational responses of polycrystalline Nickel alloy first-stage turbine blades with different types and severities of damage. Through statistical analysis, it is shown that the proposed CNN-DST framework classifies the turbine blades with an average prediction accuracy of 97.19%. The proposed CNN-DST framework is benchmarked with other state-of-the-art classification methods, demonstrating its high performance. The robustness of the proposed CNN-DST framework with respect to measurement noise is investigated, showing its high noise-resistance. Further, bandwidth analysis reveals that most of the required information for detecting faulty samples is available in a small frequency range. △ Less

Submitted 14 October, 2021; originally announced October 2021.

Journal ref: Structural Health Monitoring, February 2022

arXiv:2110.06909 [pdf, other]

Reinforcement Learning for Standards Design

Authors: Shahrukh Khan Kasi, Sayandev Mukherjee, Lin Cheng, Bernardo A. Huberman

Abstract: Communications standards are designed via committees of humans holding repeated meetings over months or even years until consensus is achieved. This includes decisions regarding the modulation and coding schemes to be supported over an air interface. We propose a way to "automate" the selection of the set of modulation and coding schemes to be supported over a given air interface and thereby strea… ▽ More Communications standards are designed via committees of humans holding repeated meetings over months or even years until consensus is achieved. This includes decisions regarding the modulation and coding schemes to be supported over an air interface. We propose a way to "automate" the selection of the set of modulation and coding schemes to be supported over a given air interface and thereby streamline both the standards design process and the ease of extending the standard to support new modulation schemes applicable to new higher-level applications and services. Our scheme involves machine learning, whereby a constructor entity submits proposals to an evaluator entity, which returns a score for the proposal. The constructor employs reinforcement learning to iterate on its submitted proposals until a score is achieved that was previously agreed upon by both constructor and evaluator to be indicative of satisfying the required design criteria (including performance metrics for transmissions over the interface). △ Less

Submitted 13 October, 2021; originally announced October 2021.

arXiv:2108.11123 [pdf, ps, other]

A Bayesian Tensor Approach to Enable RIS for 6G Massive Unsourced Random Access

Authors: Xiaodan Shao, Lei Cheng, Xiaoming Chen, Chongwen Huang, Derrick Wing Kwan Ng

Abstract: This paper investigates the problem of joint massive devices separation and channel estimation for a reconfigurable intelligent surface (RIS)-aided unsourced random access (URA) scheme in the sixth-generation (6G) wireless networks. In particular, by associating the data sequences to a rank-one tensor and exploiting the angular sparsity of the channel, the detection problem is cast as a high-order… ▽ More This paper investigates the problem of joint massive devices separation and channel estimation for a reconfigurable intelligent surface (RIS)-aided unsourced random access (URA) scheme in the sixth-generation (6G) wireless networks. In particular, by associating the data sequences to a rank-one tensor and exploiting the angular sparsity of the channel, the detection problem is cast as a high-order coupled tensor decomposition problem. However, the coupling among multiple devices to RIS (device-RIS) channels together with their sparse structure make the problem intractable. By devising novel priors to incorporate problem structures, we design a novel probabilistic model to capture both the element-wise sparsity from the angular channel model and the low rank property due to the sporadic nature of URA. Based on the this probabilistic model, we develop a coupled tensor-based automatic detection (CTAD) algorithm under the framework of variational inference with fast convergence and low computational complexity. Moreover, the proposed algorithm can automatically learn the number of active devices and thus effectively avoid noise overfitting. Extensive simulation results confirm the effectiveness and improvements of the proposed URA algorithm in large-scale RIS regime. △ Less

Submitted 25 August, 2021; originally announced August 2021.

Comments: IEEE GLOBECOM 2021

arXiv:2108.03690 [pdf, other]

Enhanced Invertible Encoding for Learned Image Compression

Authors: Yueqi Xie, Ka Leong Cheng, Qifeng Chen

Abstract: Although deep learning based image compression methods have achieved promising progress these days, the performance of these methods still cannot match the latest compression standard Versatile Video Coding (VVC). Most of the recent developments focus on designing a more accurate and flexible entropy model that can better parameterize the distributions of the latent features. However, few efforts… ▽ More Although deep learning based image compression methods have achieved promising progress these days, the performance of these methods still cannot match the latest compression standard Versatile Video Coding (VVC). Most of the recent developments focus on designing a more accurate and flexible entropy model that can better parameterize the distributions of the latent features. However, few efforts are devoted to structuring a better transformation between the image space and the latent feature space. In this paper, instead of employing previous autoencoder style networks to build this transformation, we propose an enhanced Invertible Encoding Network with invertible neural networks (INNs) to largely mitigate the information loss problem for better compression. Experimental results on the Kodak, CLIC, and Tecnick datasets show that our method outperforms the existing learned image compression methods and compression standards, including VVC (VTM 12.1), especially for high-resolution images. Our source code is available at https://github.com/xyq7/InvCompress. △ Less

Submitted 8 August, 2021; originally announced August 2021.

Comments: Accepted to ACM Multimedia 2021 as Oral

arXiv:2106.15283 [pdf, other]

doi 10.1145/3448021

Similarity Embedding Networks for Robust Human Activity Recognition

Authors: Chenglin Li, Carrie Lu Tong, Di Niu, Bei Jiang, Xiao Zuo, Lei Cheng, Jian Xiong, Jianming Yang

Abstract: Deep learning models for human activity recognition (HAR) based on sensor data have been heavily studied recently. However, the generalization ability of deep models on complex real-world HAR data is limited by the availability of high-quality labeled activity data, which are hard to obtain. In this paper, we design a similarity embedding neural network that maps input sensor signals onto real vec… ▽ More Deep learning models for human activity recognition (HAR) based on sensor data have been heavily studied recently. However, the generalization ability of deep models on complex real-world HAR data is limited by the availability of high-quality labeled activity data, which are hard to obtain. In this paper, we design a similarity embedding neural network that maps input sensor signals onto real vectors through carefully designed convolutional and LSTM layers. The embedding network is trained with a pairwise similarity loss, encouraging the clustering of samples from the same class in the embedded real space, and can be effectively trained on a small dataset and even on a noisy dataset with mislabeled samples. Based on the learned embeddings, we further propose both nonparametric and parametric approaches for activity recognition. Extensive evaluation based on two public datasets has shown that the proposed similarity embedding network significantly outperforms state-of-the-art deep models on HAR classification tasks, is robust to mislabeled samples in the training set, and can also be used to effectively denoise a noisy dataset. △ Less

Submitted 31 May, 2021; originally announced June 2021.

arXiv:2104.03603 [pdf, other]

AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario

Authors: Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, Xin Xu, Jun Du, **gdong Chen

Abstract: In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical ap… ▽ More In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community. We also release a PyTorch-based training and evaluation framework as baseline system to promote reproducible research in this field. △ Less

Submitted 10 August, 2021; v1 submitted 8 April, 2021; originally announced April 2021.

Comments: Accepted by Interspeech 2021

arXiv:2102.05610 [pdf, other]

Searching for Fast Model Families on Datacenter Accelerators

Authors: Sheng Li, Mingxing Tan, Ruoming Pang, Andrew Li, Liqun Cheng, Quoc Le, Norman P. Jouppi

Abstract: Neural Architecture Search (NAS), together with model scaling, has shown remarkable progress in designing high accuracy and fast convolutional architecture families. However, as neither NAS nor model scaling considers sufficient hardware architecture details, they do not take full advantage of the emerging datacenter (DC) accelerators. In this paper, we search for fast and accurate CNN model famil… ▽ More Neural Architecture Search (NAS), together with model scaling, has shown remarkable progress in designing high accuracy and fast convolutional architecture families. However, as neither NAS nor model scaling considers sufficient hardware architecture details, they do not take full advantage of the emerging datacenter (DC) accelerators. In this paper, we search for fast and accurate CNN model families for efficient inference on DC accelerators. We first analyze DC accelerators and find that existing CNNs suffer from insufficient operational intensity, parallelism, and execution efficiency. These insights let us create a DC-accelerator-optimized search space, with space-to-depth, space-to-batch, hybrid fused convolution structures with vanilla and depthwise convolutions, and block-wise activation functions. On top of our DC accelerator optimized neural architecture search space, we further propose a latency-aware compound scaling (LACS), the first multi-objective compound scaling method optimizing both accuracy and latency. Our LACS discovers that network depth should grow much faster than image size and network width, which is quite different from previous compound scaling results. With the new search space and LACS, our search and scaling on datacenter accelerators results in a new model series named EfficientNet-X. EfficientNet-X is up to more than 2X faster than EfficientNet (a model series with state-of-the-art trade-off on FLOPs and accuracy) on TPUv3 and GPUv100, with comparable accuracy. EfficientNet-X is also up to 7X faster than recent RegNet and ResNeSt on TPUv3 and GPUv100. △ Less

Submitted 10 February, 2021; originally announced February 2021.

Showing 1–50 of 82 results for author: Cheng, L