Search | arXiv e-print repository

Coded Kalman Filtering over MIMO Gaussian Channels with Feedback

Authors: Barron Han, Oron Sabag, Victoria Kostina, Babak Hassibi

Abstract: We consider the problem of remotely stabilizing a linear dynamical system. In this setting, a sensor co-located with the system communicates the system's state to a controller over a noisy communication channel with feedback. The objective of the controller (decoder) is to use the channel outputs to estimate the vector state with finite zero-delay mean squared error (MSE) at the infinite horizon.… ▽ More We consider the problem of remotely stabilizing a linear dynamical system. In this setting, a sensor co-located with the system communicates the system's state to a controller over a noisy communication channel with feedback. The objective of the controller (decoder) is to use the channel outputs to estimate the vector state with finite zero-delay mean squared error (MSE) at the infinite horizon. It has been shown in [1] that for a vector Gauss-Markov source and either a single-input multiple-output (SIMO) or a multiple-input single-output (MISO) channel, linear codes require the minimum capacity to achieve finite MSE. This paper considers the more general problem of linear zero-delay joint-source channel coding (JSCC) of a vector-valued source over a multiple-input multiple-output (MIMO) Gaussian channel with feedback. We study sufficient and necessary conditions for linear codes to achieve finite MSE. For sufficiency, we introduce a coding scheme where each unstable source mode is allocated to a single channel for estimation. Our proof for the necessity of this scheme relies on a matrix-algebraic conjecture that we prove to be true if either the source or channel is scalar. We show that linear codes achieve finite MSE for a scalar source over a MIMO channel if and only if the best scalar sub-channel can achieve finite MSE. Finally, we provide a new counter-example demonstrating that linear codes are generally sub-optimal for coding over MIMO channels. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Accepted for presentation at the 2024 IEEE International Symposium on Information Theory

arXiv:2406.11364 [pdf, other]

AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection

Authors: Anbai Jiang, Bing Han, Zhiqiang Lv, Yufeng Deng, Wei-Qiang Zhang, Xie Chen, Yanmin Qian, Jia Liu, **yi Fan

Abstract: Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, res… ▽ More Large pre-trained models have demonstrated dominant performances in multiple areas, where the consistency between pre-training and fine-tuning is the key to success. However, few works reported satisfactory results of pre-trained models for the machine anomalous sound detection (ASD) task. This may be caused by the inconsistency of the pre-trained model and the inductive bias of machine audio, resulting in inconsistency in data and architecture. Thus, we propose AnoPatch which utilizes a ViT backbone pre-trained on AudioSet and fine-tunes it on machine audio. It is believed that machine audio is more related to audio datasets than speech datasets, and modeling it from patch level suits the sparsity of machine audio. As a result, AnoPatch showcases state-of-the-art (SOTA) performances on the DCASE 2020 ASD dataset and the DCASE 2023 ASD dataset. We also compare multiple pre-trained models and empirically demonstrate that better consistency yields considerable improvement. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH 2024

arXiv:2406.07855 [pdf, other]

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

Authors: Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, **yu Li, Furu Wei

Abstract: With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing diversity to generation, but also pose robustness issues such as typos, omissions and repetition. In addition, the high sampling rate of audio also brings h… ▽ More With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing diversity to generation, but also pose robustness issues such as typos, omissions and repetition. In addition, the high sampling rate of audio also brings huge computational overhead to the inference process of autoregression. To address these issues, we propose VALL-E R, a robust and efficient zero-shot TTS system, building upon the foundation of VALL-E. Specifically, we introduce a phoneme monotonic alignment strategy to strengthen the connection between phonemes and acoustic sequence, ensuring a more precise alignment by constraining the acoustic tokens to match their associated phonemes. Furthermore, we employ a codec-merging approach to downsample the discrete codes in shallow quantization layer, thereby accelerating the decoding speed while preserving the high quality of speech output. Benefiting from these strategies, VALL-E R obtains controllablity over phonemes and demonstrates its strong robustness by approaching the WER of ground truth. In addition, it requires fewer autoregressive steps, with over 60% time reduction during inference. This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia. Audio samples will be available at: https://aka.ms/valler. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 15 pages, 5 figures

arXiv:2405.09245 [pdf, other]

A Robust UAV-Based Approach for Power-Modulated Jammer Localization Using DoA

Authors: Zexin Fang, Bin Han, Hans D. Schotten

Abstract: Unmanned aerial vehicles (UAVs) are well-suited to localize jammers, particularly when jammers are at non-terrestrial locations, where conventional detection methods face challenges. In this work we propose a novel localization method, sample pruning gradient descend (SPGD), which offers robust performance against multiple power-modulated jammers with low computational complexity. Unmanned aerial vehicles (UAVs) are well-suited to localize jammers, particularly when jammers are at non-terrestrial locations, where conventional detection methods face challenges. In this work we propose a novel localization method, sample pruning gradient descend (SPGD), which offers robust performance against multiple power-modulated jammers with low computational complexity. △ Less

Submitted 21 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

Comments: Submitted to the 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall)

arXiv:2405.04867 [pdf, other]

MIPI 2024 Challenge on Demosaic for HybridEVS Camera: Methods and Results

Authors: Yaqi Wu, Zhihao Fan, Xiaofeng Chu, Jimmy S. Ren, Xiaoming Li, Zongsheng Yue, Chongyi Li, Shangcheng Zhou, Ruicheng Feng, Yuekun Dai, Peiqing Yang, Chen Change Loy, Senyan Xu, Zhi**g Sun, Jiaying Zhu, Yurui Zhu, Xueyang Fu, Zheng-Jun Zha, Jun Cao, Cheng Li, Shu Chen, Liang Ma, Shiyang Zhou, Hai** Zeng, Kai Feng , et al. (24 additional authors not shown)

Abstract: The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photogra… ▽ More The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Nighttime Flare Removal track on MIPI 2024. In total, 170 participants were successfully registered, and 14 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art performance on Nighttime Flare Removal. More details of this challenge and the link to the dataset can be found at https://mipi-challenge.org/MIPI2024/. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: MIPI@CVPR2024. Website: https://mipi-challenge.org/MIPI2024/

arXiv:2404.16484 [pdf, other]

Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, **shan Pan, Jiangxin Dong, **hui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi **, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF codec, instead of JPEG. All the proposed methods improve PSNR fidelity over Lanczos interpolation, and process images under 10ms. Out of the 160 participants, 25 teams submitted their code and models. The solutions present novel designs tailored for memory-efficiency and runtime on edge devices. This survey describes the best solutions for real-time SR of compressed high-resolution images. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: CVPR 2024, AI for Streaming (AIS) Workshop

arXiv:2404.10343 [pdf, other]

The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/. △ Less

Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

arXiv:2404.03088 [pdf, other]

Robust Federated Learning for Wireless Networks: A Demonstration with Channel Estimation

Authors: Zexin Fang, Bin Han, Hans D. Schotten

Abstract: Federated learning (FL) offers a privacy-preserving collaborative approach for training models in wireless networks, with channel estimation emerging as a promising application. Despite extensive studies on FL-empowered channel estimation, the security concerns associated with FL require meticulous attention. In a scenario where small base stations (SBSs) serve as local models trained on cached da… ▽ More Federated learning (FL) offers a privacy-preserving collaborative approach for training models in wireless networks, with channel estimation emerging as a promising application. Despite extensive studies on FL-empowered channel estimation, the security concerns associated with FL require meticulous attention. In a scenario where small base stations (SBSs) serve as local models trained on cached data, and a macro base station (MBS) functions as the global model setting, an attacker can exploit the vulnerability of FL, launching attacks with various adversarial attacks or deployment tactics. In this paper, we analyze such vulnerabilities, corresponding solutions were brought forth, and validated through simulation. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: Submitted to IEEE GLOBECOM 2024

arXiv:2404.02159 [pdf, other]

Fairness-aware Age-of-Information Minimization in WPT-Assisted Short-Packet THz Communications for mURLLC

Authors: Yao Zhu, Xiaopeng Yuan, Yulin Hu, Bo Ai, Ruikang Wang, Bin Han, Anke Schmeink

Abstract: The technological landscape is swiftly advancing towards large-scale systems, creating significant opportunities, particularly in the domain of Terahertz (THz) communications. Networks designed for massive connectivity, comprising numerous Internet of Things (IoT) devices, are at the forefront of this advancement. In this paper, we consider Wireless Power Transfer (WPT)-enabled networks that suppo… ▽ More The technological landscape is swiftly advancing towards large-scale systems, creating significant opportunities, particularly in the domain of Terahertz (THz) communications. Networks designed for massive connectivity, comprising numerous Internet of Things (IoT) devices, are at the forefront of this advancement. In this paper, we consider Wireless Power Transfer (WPT)-enabled networks that support these IoT devices with massive Ultra-Reliable and Low-Latency Communication (mURLLC) services.The focus of such networks is information freshness, with the Age-of-Information (AoI) serving as the pivotal performance metric. In particular, we aim to minimize the maximum AoI among IoT devices by optimizing the scheduling policy. Our analytical findings establish the convexity property of the problem, which can be solved efficiently. Furthermore, we introduce the concept of AoI-oriented cluster capacity, examining the relationship between the number of supported devices and the AoI performance in the network. Numerical simulations validate the advantage of our proposed approach in enhancing AoI performance, indicating its potential to guide the design of future THz communication systems for IoT applications requiring mURLLC services. △ Less

Submitted 15 February, 2024; originally announced April 2024.

arXiv:2402.09810 [pdf, other]

3D Cooperative Localization in UAV Systems: CRLB Analysis and Security Solutions

Authors: Zexin Fang, Bin Han, Hans D. Schotten

Abstract: This paper presents a robust and secure framework for achieving accurate and reliable cooperative localization in multiple unmanned aerial vehicle (UAV) systems. The Cramer-Rao low bound (CRLB) for the three-dimensional (3D) cooperative localization network is derived, with particular attention given to the non-uniform spatial distribution of anchor nodes. Challenges of mobility and security threa… ▽ More This paper presents a robust and secure framework for achieving accurate and reliable cooperative localization in multiple unmanned aerial vehicle (UAV) systems. The Cramer-Rao low bound (CRLB) for the three-dimensional (3D) cooperative localization network is derived, with particular attention given to the non-uniform spatial distribution of anchor nodes. Challenges of mobility and security threats are addressed, corresponding solutions are brought forth and numerically assessed . The proposed solution incorporates two key components: the Mobility Adaptive Gradient Descent (MAGD) and Time-evolving Anomaly Detection (TAD). The MAGD adapts the gradient descent algorithm to handle the configuration changes in cooperative localization systems, ensuring accurate localization in dynamic scenarios. The TAD cooperates with reputation propagation (RP) scheme to detect and mitigate potential attacks by identifying malicious data, enhancing the security and resilience of the cooperative localization. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: Submitted to IEEE Transactions on Wireless Communications

arXiv:2401.11902 [pdf, other]

A Training-Free Defense Framework for Robust Learned Image Compression

Authors: Myungseo Song, **young Choi, Bohyung Han

Abstract: We study the robustness of learned image compression models against adversarial attacks and present a training-free defense technique based on simple image transform functions. Recent learned image compression models are vulnerable to adversarial attacks that result in poor compression rate, low reconstruction quality, or weird artifacts. To address the limitations, we propose a simple but effecti… ▽ More We study the robustness of learned image compression models against adversarial attacks and present a training-free defense technique based on simple image transform functions. Recent learned image compression models are vulnerable to adversarial attacks that result in poor compression rate, low reconstruction quality, or weird artifacts. To address the limitations, we propose a simple but effective two-way compression algorithm with random input transforms, which is conveniently applicable to existing image compression models. Unlike the naïve approaches, our approach preserves the original rate-distortion performance of the models on clean images. Moreover, the proposed algorithm requires no additional training or modification of existing models, making it more practical. We demonstrate the effectiveness of the proposed techniques through extensive experiments under multiple compression models, evaluation metrics, and attack scenarios. △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: 10 pages and 14 figures

arXiv:2312.15946 [pdf, other]

EnchantDance: Unveiling the Potential of Music-Driven Dance Movement

Authors: Bo Han, Yi Ren, Hao Peng, Teng Zhang, Zeyu Ling, Xiang Yin, Feilin Han

Abstract: The task of music-driven dance generation involves creating coherent dance movements that correspond to the given music. While existing methods can produce physically plausible dances, they often struggle to generalize to out-of-set data. The challenge arises from three aspects: 1) the high diversity of dance movements and significant differences in the distribution of music modalities, which make… ▽ More The task of music-driven dance generation involves creating coherent dance movements that correspond to the given music. While existing methods can produce physically plausible dances, they often struggle to generalize to out-of-set data. The challenge arises from three aspects: 1) the high diversity of dance movements and significant differences in the distribution of music modalities, which make it difficult to generate music-aligned dance movements. 2) the lack of a large-scale music-dance dataset, which hinders the generation of generalized dance movements from music. 3) The protracted nature of dance movements poses a challenge to the maintenance of a consistent dance style. In this work, we introduce the EnchantDance framework, a state-of-the-art method for dance generation. Due to the redundancy of the original dance sequence along the time axis, EnchantDance first constructs a strong dance latent space and then trains a dance diffusion model on the dance latent space. To address the data gap, we construct a large-scale music-dance dataset, ChoreoSpectrum3D Dataset, which includes four dance genres and has a total duration of 70.32 hours, making it the largest reported music-dance dataset to date. To enhance consistency between music genre and dance style, we pre-train a music genre prediction network using transfer learning and incorporate music genre as extra conditional information in the training of the dance diffusion model. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on dance quality, diversity, and consistency. △ Less

Submitted 26 December, 2023; originally announced December 2023.

arXiv:2312.09576 [pdf, other]

SegRap2023: A Benchmark of Organs-at-Risk and Gross Tumor Volume Segmentation for Radiotherapy Planning of Nasopharyngeal Carcinoma

Authors: Xiangde Luo, Jia Fu, Yunxin Zhong, Shuolin Liu, Bing Han, Mehdi Astaraki, Simone Bendazzoli, Iuliana Toma-Dasu, Yiwen Ye, Ziyang Chen, Yong Xia, Yanzhou Su, ** Ye, Junjun He, Zhaohu Xing, Hongqiu Wang, Lei Zhu, Kaixiang Yang, Xin Fang, Zhiwei Wang, Chan Woong Lee, Sang Joon Park, Jaehee Chun, Constantin Ulrich, Klaus H. Maier-Hein , et al. (17 additional authors not shown)

Abstract: Radiation therapy is a primary and effective NasoPharyngeal Carcinoma (NPC) treatment strategy. The precise delineation of Gross Tumor Volumes (GTVs) and Organs-At-Risk (OARs) is crucial in radiation treatment, directly impacting patient prognosis. Previously, the delineation of GTVs and OARs was performed by experienced radiation oncologists. Recently, deep learning has achieved promising results… ▽ More Radiation therapy is a primary and effective NasoPharyngeal Carcinoma (NPC) treatment strategy. The precise delineation of Gross Tumor Volumes (GTVs) and Organs-At-Risk (OARs) is crucial in radiation treatment, directly impacting patient prognosis. Previously, the delineation of GTVs and OARs was performed by experienced radiation oncologists. Recently, deep learning has achieved promising results in many medical image segmentation tasks. However, for NPC OARs and GTVs segmentation, few public datasets are available for model development and evaluation. To alleviate this problem, the SegRap2023 challenge was organized in conjunction with MICCAI2023 and presented a large-scale benchmark for OAR and GTV segmentation with 400 Computed Tomography (CT) scans from 200 NPC patients, each with a pair of pre-aligned non-contrast and contrast-enhanced CT scans. The challenge's goal was to segment 45 OARs and 2 GTVs from the paired CT scans. In this paper, we detail the challenge and analyze the solutions of all participants. The average Dice similarity coefficient scores for all submissions ranged from 76.68\% to 86.70\%, and 70.42\% to 73.44\% for OARs and GTVs, respectively. We conclude that the segmentation of large-size OARs is well-addressed, and more efforts are needed for GTVs and small-size or thin-structure OARs. The benchmark will remain publicly available here: https://segrap2023.grand-challenge.org △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: A challenge report of SegRap2023 (organized in conjunction with MICCAI2023)

arXiv:2310.11747 [pdf, other]

Coded Kalman Filtering Over Gaussian Channels with Feedback

Authors: Barron Han, Oron Sabag, Victoria Kostina, Babak Hassibi

Abstract: This paper investigates the problem of zero-delay joint source-channel coding of a vector Gauss-Markov source over a multiple-input multiple-output (MIMO) additive white Gaussian noise (AWGN) channel with feedback. In contrast to the classical problem of causal estimation using noisy observations, we examine a system where the source can be encoded before transmission. An encoder, equipped with fe… ▽ More This paper investigates the problem of zero-delay joint source-channel coding of a vector Gauss-Markov source over a multiple-input multiple-output (MIMO) additive white Gaussian noise (AWGN) channel with feedback. In contrast to the classical problem of causal estimation using noisy observations, we examine a system where the source can be encoded before transmission. An encoder, equipped with feedback of past channel outputs, observes the source state and encodes the information in a causal manner as inputs to the channel while adhering to a power constraint. The objective of the code is to estimate the source state with minimum mean square error at the infinite horizon. This work shows a fundamental theorem for two scenarios: for the transmission of an unstable vector Gauss-Markov source over either a multiple-input single-output (MISO) or a single-input multiple-output (SIMO) AWGN channel, finite estimation error is achievable if and only if the sum of logs of the unstable eigenvalues of the state gain matrix is less than the Shannon channel capacity. We prove these results by showing an optimal linear innovations encoder that can be applied to sources and channels of any dimension and analyzing it together with the corresponding Kalman filter decoder. △ Less

Submitted 18 October, 2023; originally announced October 2023.

Comments: Presented at 59th Allerton Conference on Communication, Control, and Computing

arXiv:2309.11730 [pdf, other]

Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition

Authors: Shuai Wang, Qibing Bai, Qi Liu, Jianwei Yu, Zhengyang Chen, Bing Han, Yanmin Qian, Haizhou Li

Abstract: Current speaker recognition systems primarily rely on supervised approaches, constrained by the scale of labeled datasets. To boost the system performance, researchers leverage large pretrained models such as WavLM to transfer learned high-level features to the downstream speaker recognition task. However, this approach introduces extra parameters as the pretrained model remains in the inference s… ▽ More Current speaker recognition systems primarily rely on supervised approaches, constrained by the scale of labeled datasets. To boost the system performance, researchers leverage large pretrained models such as WavLM to transfer learned high-level features to the downstream speaker recognition task. However, this approach introduces extra parameters as the pretrained model remains in the inference stage. Another group of researchers directly apply self-supervised methods such as DINO to speaker embedding learning, yet they have not explored its potential on large-scale in-the-wild datasets. In this paper, we present the effectiveness of DINO training on the large-scale WenetSpeech dataset and its transferability in enhancing the supervised system performance on the CNCeleb dataset. Additionally, we introduce a confidence-based data filtering algorithm to remove unreliable data from the pretraining dataset, leading to better performance with less training data. The associated pretrained models, confidence files, pretraining and finetuning scripts will be made available in the Wespeaker toolkit. △ Less

Submitted 26 September, 2023; v1 submitted 20 September, 2023; originally announced September 2023.

Comments: submitted to ICASSP 2024

arXiv:2309.06672 [pdf, other]

Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer

Authors: Zhengyang Chen, Bing Han, Shuai Wang, Yanmin Qian

Abstract: Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers, while target speaker voice activity detection (TS-VAD) systems tend to be overly complex. In this paper, we propose a simple attention-based encoder-decoder netw… ▽ More Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers, while target speaker voice activity detection (TS-VAD) systems tend to be overly complex. In this paper, we propose a simple attention-based encoder-decoder network for end-to-end neural diarization (AED-EEND). In our training process, we introduce a teacher-forcing strategy to address the speaker permutation problem, leading to faster model convergence. For evaluation, we propose an iterative decoding method that outputs diarization results for each speaker sequentially. Additionally, we propose an Enhancer module to enhance the frame-level speaker embeddings, enabling the model to handle scenarios with an unseen number of speakers. We also explore replacing the transformer encoder with a Conformer architecture, which better models local information. Furthermore, we discovered that commonly used simulation datasets for speaker diarization have a much higher overlap ratio compared to real data. We found that using simulated training data that is more consistent with real data can achieve an improvement in consistency. Extensive experimental validation demonstrates the effectiveness of our proposed methodologies. Our best system achieved a new state-of-the-art diarization error rate (DER) performance on all the CALLHOME (10.08%), DIHARD II (24.64%), and AMI (13.00%) evaluation benchmarks, when no oracle voice activity detection (VAD) is used. Beyond speaker diarization, our AED-EEND system also shows remarkable competitiveness as a speech type detection model. △ Less

Submitted 12 September, 2023; originally announced September 2023.

Comments: IEEE/ACM Transactions on Audio Speech and Language Processing Under Review

arXiv:2309.04270 [pdf, other]

A Reliable and Resilient Framework for Multi-UAV Mutual Localization

Authors: Zexin Fang, Bin Han, Hans D. Schotten

Abstract: This paper presents a robust and secure framework for achieving accurate and reliable mutual localization in multiple unmanned aerial vehicle (UAV) systems. Challenges of accurate localization and security threats are addressed and corresponding solutions are brought forth and accessed in our paper with numerical simulations. The proposed solution incorporates two key components: the Mobility Adap… ▽ More This paper presents a robust and secure framework for achieving accurate and reliable mutual localization in multiple unmanned aerial vehicle (UAV) systems. Challenges of accurate localization and security threats are addressed and corresponding solutions are brought forth and accessed in our paper with numerical simulations. The proposed solution incorporates two key components: the Mobility Adaptive Gradient Descent (MAGD) and Time-evolving Anomaly Detectio (TAD). The MAGD adapts the gradient descent algorithm to handle the configuration changes in the mutual localization system, ensuring accurate localization in dynamic scenarios. The TAD cooperates with reputation propagation (RP) scheme to detect and mitigate potential attacks by identifying UAVs with malicious data, enhancing the security and resilience of the mutual localization △ Less

Submitted 8 September, 2023; originally announced September 2023.

Comments: Accepted by the 2023 IEEE 98th Vehicular Technology Conference (VTC2023-Fall), Hong Kong, 10-13 October 2023

arXiv:2308.14360 [pdf, other]

InstructME: An Instruction Guided Music Edit And Remix Framework with Latent Diffusion Models

Authors: Bing Han, Junyu Dai, Weituo Hao, Xinyan He, Dong Guo, Jitong Chen, Yuxuan Wang, Yanmin Qian, Xuchen Song

Abstract: Music editing primarily entails the modification of instrument tracks or remixing in the whole, which offers a novel reinterpretation of the original piece through a series of operations. These music processing methods hold immense potential across various applications but demand substantial expertise. Prior methodologies, although effective for image and audio modifications, falter when directly… ▽ More Music editing primarily entails the modification of instrument tracks or remixing in the whole, which offers a novel reinterpretation of the original piece through a series of operations. These music processing methods hold immense potential across various applications but demand substantial expertise. Prior methodologies, although effective for image and audio modifications, falter when directly applied to music. This is attributed to music's distinctive data nature, where such methods can inadvertently compromise the intrinsic harmony and coherence of music. In this paper, we develop InstructME, an Instruction guided Music Editing and remixing framework based on latent diffusion models. Our framework fortifies the U-Net with multi-scale aggregation in order to maintain consistency before and after editing. In addition, we introduce chord progression matrix as condition information and incorporate it in the semantic space to improve melodic harmony while editing. For accommodating extended musical pieces, InstructME employs a chunk transformer, enabling it to discern long-term temporal dependencies within music sequences. We tested InstructME in instrument-editing, remixing, and multi-round editing. Both subjective and objective evaluations indicate that our proposed method significantly surpasses preceding systems in music quality, text relevance and harmony. Demo samples are available at https://musicedit.github.io/ △ Less

Submitted 12 December, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

Comments: Demo samples are available at https://musicedit.github.io/

arXiv:2307.10321 [pdf, other]

Terahertz Communications and Sensing for 6G and Beyond: A Comprehensive Review

Authors: Wei Jiang, Qiuheng Zhou, Jiguang He, Mohammad Asif Habibi, Sergiy Melnyk, Mohammed El Absi, Bin Han, Marco Di Renzo, Hans Dieter Schotten, Fa-Long Luo, Tarek S. El-Bawab, Markku Juntti, Merouane Debbah, Victor C. M. Leung

Abstract: Next-generation cellular technologies, commonly referred to as the 6G, are envisioned to support a higher system capacity, better performance, and network sensing capabilities. The THz band is one potential enabler to this end due to the large unused frequency bands and the high spatial resolution enabled by the short signal wavelength and large bandwidth. Different from earlier surveys, this pape… ▽ More Next-generation cellular technologies, commonly referred to as the 6G, are envisioned to support a higher system capacity, better performance, and network sensing capabilities. The THz band is one potential enabler to this end due to the large unused frequency bands and the high spatial resolution enabled by the short signal wavelength and large bandwidth. Different from earlier surveys, this paper presents a comprehensive treatment and technology survey on THz communications and sensing in terms of advantages, applications, propagation characterization, channel modeling, measurement campaigns, antennas, transceiver devices, beamforming, networking, the integration of communications and sensing, and experimental testbeds. Starting from the motivation and use cases, we survey the development and historical perspective of THz communications and sensing with the anticipated 6G requirements. We explore the radio propagation, channel modeling, and measurement for the THz band. The transceiver requirements, architectures, technological challenges, and state-of-the-art approaches to compensate for the high propagation losses, including appropriate antenna design and beamforming solutions. We overview several related technologies that either are required by or are beneficial for THz systems and networks. The synergistic design of sensing and communications is explored in depth. Practical trials, demonstrations, and experiments are also summarized. The paper gives a holistic view of the current state of the art and highlights the open research challenges towards 6G and beyond. △ Less

Submitted 6 May, 2024; v1 submitted 19 July, 2023; originally announced July 2023.

Comments: 56 pages, 9 figures, 11 tables, IEEE Communications Surveys & Tutorials

arXiv:2307.08205 [pdf, ps, other]

Exploring Binary Classification Loss For Speaker Verification

Authors: Bing Han, Zhengyang Chen, Yanmin Qian

Abstract: The mismatch between close-set training and open-set testing usually leads to significant performance degradation for speaker verification task. For existing loss functions, metric learning-based objectives depend strongly on searching effective pairs which might hinder further improvements. And popular multi-classification methods are usually observed with degradation when evaluated on unseen spe… ▽ More The mismatch between close-set training and open-set testing usually leads to significant performance degradation for speaker verification task. For existing loss functions, metric learning-based objectives depend strongly on searching effective pairs which might hinder further improvements. And popular multi-classification methods are usually observed with degradation when evaluated on unseen speakers. In this work, we introduce SphereFace2 framework which uses several binary classifiers to train the speaker model in a pair-wise manner instead of performing multi-classification. Benefiting from this learning paradigm, it can efficiently alleviate the gap between training and evaluation. Experiments conducted on Voxceleb show that the SphereFace2 outperforms other existing loss functions, especially on hard trials. Besides, large margin fine-tuning strategy is proven to be compatible with it for further improvements. Finally, SphereFace2 also shows its strong robustness to class-wise noisy labels which has the potential to be applied in the semi-supervised training scenario with inaccurate estimated pseudo labels. Codes are available in https://github.com/Hunterhuan/sphereface2_speaker_verification △ Less

Submitted 16 July, 2023; originally announced July 2023.

Comments: Accepted by ICASSP 2023

arXiv:2306.15161 [pdf, other]

Wespeaker baselines for VoxSRC2023

Authors: Shuai Wang, Chengdong Liang, Xu Xiang, Bing Han, Zhengyang Chen, Hongji Wang, Wen Ding

Abstract: This report showcases the results achieved using the wespeaker toolkit for the VoxSRC2023 Challenge. Our aim is to provide participants, especially those with limited experience, with clear and straightforward guidelines to develop their initial systems. Via well-structured recipes and strong results, we hope to offer an accessible and good enough start point for all interested individuals. In thi… ▽ More This report showcases the results achieved using the wespeaker toolkit for the VoxSRC2023 Challenge. Our aim is to provide participants, especially those with limited experience, with clear and straightforward guidelines to develop their initial systems. Via well-structured recipes and strong results, we hope to offer an accessible and good enough start point for all interested individuals. In this report, we describe the results achieved on the VoxSRC2023 dev set using the pretrained models, you can check the CodaLab evaluation server for the results on the evaluation set. △ Less

Submitted 28 June, 2023; v1 submitted 26 June, 2023; originally announced June 2023.

arXiv:2305.12021 [pdf, other]

A Secure and Robust Approach for Distance-Based Mutual Positioning of Unmanned Aerial Vehicles

Authors: Bin Han, Hans D. Schotten

Abstract: Unmanned aerial vehicle (UAV) is becoming increasingly important in modern civilian and military applications. However, its novel use cases is bottlenecked by conventional satellite and terrestrial localization technologies, and calling for complementary solutions. Multi-UAV mutual positioning can be a potential answer, but its accuracy and security are challenged by inaccurate and/or malicious me… ▽ More Unmanned aerial vehicle (UAV) is becoming increasingly important in modern civilian and military applications. However, its novel use cases is bottlenecked by conventional satellite and terrestrial localization technologies, and calling for complementary solutions. Multi-UAV mutual positioning can be a potential answer, but its accuracy and security are challenged by inaccurate and/or malicious measurements. This paper proposes a novel, robust, and secure approach to address these issues. △ Less

Submitted 9 January, 2024; v1 submitted 19 May, 2023; originally announced May 2023.

Comments: Accepted for presentation at the IEEE WCNC 2024

arXiv:2305.10704 [pdf, other]

Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor

Authors: Zhengyang Chen, Bing Han, Shuai Wang, Yanmin Qian

Abstract: This paper proposes a novel Attention-based Encoder-Decoder network for End-to-End Neural speaker Diarization (AED-EEND). In AED-EEND system, we incorporate the target speaker enrollment information used in target speaker voice activity detection (TS-VAD) to calculate the attractor, which can mitigate the speaker permutation problem and facilitate easier model convergence. In the training process,… ▽ More This paper proposes a novel Attention-based Encoder-Decoder network for End-to-End Neural speaker Diarization (AED-EEND). In AED-EEND system, we incorporate the target speaker enrollment information used in target speaker voice activity detection (TS-VAD) to calculate the attractor, which can mitigate the speaker permutation problem and facilitate easier model convergence. In the training process, we propose a teacher-forcing strategy to obtain the enrollment information using the ground-truth label. Furthermore, we propose three heuristic decoding methods to identify the enrollment area for each speaker during the evaluation process. Additionally, we enhance the attractor calculation network LSTM used in the end-to-end encoder-decoder based attractor calculation (EEND-EDA) system by incorporating an attention-based model. By utilizing such an attention-based attractor decoder, our proposed AED-EEND system outperforms both the EEND-EDA and TS-VAD systems with only 0.5s of enrollment data. △ Less

Submitted 15 August, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted by InterSpeech 2023

arXiv:2305.08029 [pdf, other]

REMAST: Real-time Emotion-based Music Arrangement with Soft Transition

Authors: Zihao Wang, Le Ma, Chen Zhang, Bo Han, Yunfei Xu, Yikai Wang, Xinyi Chen, HaoRong Hong, Wenbo Liu, Xinda Wu, Kejun Zhang

Abstract: Music as an emotional intervention medium has important applications in scenarios such as music therapy, games, and movies. However, music needs real-time arrangement according to changing emotions, bringing challenges to balance emotion real-time fit and soft emotion transition due to the fine-grained and mutable nature of the target emotion. Existing studies mainly focus on achieving emotion rea… ▽ More Music as an emotional intervention medium has important applications in scenarios such as music therapy, games, and movies. However, music needs real-time arrangement according to changing emotions, bringing challenges to balance emotion real-time fit and soft emotion transition due to the fine-grained and mutable nature of the target emotion. Existing studies mainly focus on achieving emotion real-time fit, while the issue of smooth transition remains understudied, affecting the overall emotional coherence of the music. In this paper, we propose REMAST to address this trade-off. Specifically, we recognize the last timestep's music emotion and fuse it with the current timestep's input emotion. The fused emotion then guides REMAST to generate the music based on the input melody. To adjust music similarity and emotion real-time fit flexibly, we downsample the original melody and feed it into the generation model. Furthermore, we design four music theory features by domain knowledge to enhance emotion information and employ semi-supervised learning to mitigate the subjective bias introduced by manual dataset annotation. According to the evaluation results, REMAST surpasses the state-of-the-art methods in objective and subjective metrics. These results demonstrate that REMAST achieves real-time fit and smooth transition simultaneously, enhancing the coherence of the generated music. △ Less

Submitted 5 February, 2024; v1 submitted 13 May, 2023; originally announced May 2023.

ACM Class: H.5.5; F.2.2

arXiv:2304.05754 [pdf, other]

Self-Supervised Learning with Cluster-Aware-DINO for High-Performance Robust Speaker Verification

Authors: Bing Han, Zhengyang Chen, Yanmin Qian

Abstract: Automatic speaker verification task has made great achievements using deep learning approaches with the large-scale manually annotated dataset. However, it's very difficult and expensive to collect a large amount of well-labeled data for system building. In this paper, we propose a novel and advanced self-supervised learning framework which can construct a high performance speaker verification sys… ▽ More Automatic speaker verification task has made great achievements using deep learning approaches with the large-scale manually annotated dataset. However, it's very difficult and expensive to collect a large amount of well-labeled data for system building. In this paper, we propose a novel and advanced self-supervised learning framework which can construct a high performance speaker verification system without using any labeled data. To avoid the impact of false negative pairs, we adopt the self-distillation with no labels (DINO) framework as the initial model, which can be trained without exploiting negative pairs. Then, we introduce a cluster-aware training strategy for DINO to improve the diversity of data. In the iteration learning stage, due to a mass of unreliable labels from clustering, the quality of pseudo labels is important for the system training. This motivates us to propose dynamic loss-gate and label correction (DLG-LC) methods to alleviate the performance degradation caused by unreliable labels. More specifically, we model the loss distribution with GMM and obtain the loss-gate threshold dynamically to distinguish the reliable and unreliable labels. Besides, we adopt the model predictions to correct the unreliable label, for better utilizing the unreliable data rather than drop** them directly. Moreover, we extend the DLG-LC to multi-modality to further improve the performance. The experiments are performed on the commonly used Voxceleb dataset. Compared to the best-known self-supervised speaker verification system, our proposed method obtain 22.17%, 27.94% and 25.56% relative EER improvement on Vox-O, Vox-E and Vox-H test sets, even with fewer iterations, smaller models, and simpler clustering methods. More importantly, the newly proposed system even achieves comparable results with the fully supervised system, but without using any human labeled data. △ Less

Submitted 12 April, 2023; originally announced April 2023.

Comments: Submitted to TASLP in July 19, 2022

arXiv:2301.09080 [pdf, other]

Dance2MIDI: Dance-driven multi-instruments music generation

Authors: Bo Han, Yuheng Li, Yixuan Shen, Yi Ren, Feilin Han

Abstract: Dance-driven music generation aims to generate musical pieces conditioned on dance videos. Previous works focus on monophonic or raw audio generation, while the multi-instruments scenario is under-explored. The challenges associated with the dance-driven multi-instrument music (MIDI) generation are twofold: 1) no publicly available multi-instruments MIDI and video paired dataset and 2) the weak co… ▽ More Dance-driven music generation aims to generate musical pieces conditioned on dance videos. Previous works focus on monophonic or raw audio generation, while the multi-instruments scenario is under-explored. The challenges associated with the dance-driven multi-instrument music (MIDI) generation are twofold: 1) no publicly available multi-instruments MIDI and video paired dataset and 2) the weak correlation between music and video. To tackle these challenges, we build the first multi-instruments MIDI and dance paired dataset (D2MIDI). Based on our proposed dataset, we introduce a multi-instruments MIDI generation framework (Dance2MIDI) conditioned on dance video. Specifically, 1) to capture the relationship between dance and music, we employ the Graph Convolutional Network to encode the dance motion. This allows us to extract features related to dance movement and dance style, 2) to generate a harmonious rhythm, we utilize a Transformer model to decode the drum track sequence, leveraging a cross-attention mechanism, and 3) we model the task of generating the remaining tracks based on the drum track as a sequence understanding and completion task. A BERT-like model is employed to comprehend the context of the entire music piece through self-supervised learning. We evaluate the generated music of our framework trained on the D2MIDI dataset and demonstrate that our method achieves State-of-the-Art performance. △ Less

Submitted 27 February, 2024; v1 submitted 22 January, 2023; originally announced January 2023.

Comments: has been accepted by Computational Visual Media Journal

arXiv:2211.00815 [pdf, other]

Build a SRE Challenge System: Lessons from VoxSRC 2022 and CNSRC 2022

Authors: Zhengyang Chen, Bing Han, Xu Xiang, Houjun Huang, Bei Liu, Yanmin Qian

Abstract: Many speaker recognition challenges have been held to assess the speaker verification system in the wild and probe the performance limit. Voxceleb Speaker Recognition Challenge (VoxSRC), based on the voxceleb, is the most popular. Besides, another challenge called CN-Celeb Speaker Recognition Challenge (CNSRC) is also held this year, which is based on the Chinese celebrity multi-genre dataset CN-C… ▽ More Many speaker recognition challenges have been held to assess the speaker verification system in the wild and probe the performance limit. Voxceleb Speaker Recognition Challenge (VoxSRC), based on the voxceleb, is the most popular. Besides, another challenge called CN-Celeb Speaker Recognition Challenge (CNSRC) is also held this year, which is based on the Chinese celebrity multi-genre dataset CN-Celeb. This year, our team participated in both speaker verification closed tracks in CNSRC 2022 and VoxSRC 2022, and achieved the 1st place and 3rd place respectively. In most system reports, the authors usually only provide a description of their systems but lack an effective analysis of their methods. In this paper, we will outline how to build a strong speaker verification challenge system and give a detailed analysis of each method compared with some other popular technical means. △ Less

Submitted 1 June, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

Comments: Accepted by InterSpeech 2023

arXiv:2210.15936 [pdf, other]

A comprehensive study on self-supervised distillation for speaker representation learning

Authors: Zhengyang Chen, Yao Qian, Bing Han, Yanmin Qian, Michael Zeng

Abstract: In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In… ▽ More In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In this paper, we present a comprehensive study on self-distilled self-supervised speaker representation learning, especially on critical data augmentation. Our proposed strategy of audio perturbation augmentation has pushed the performance of the speaker representation to a new limit. The experimental results show that our model can achieve a new SoTA on Voxceleb1 speaker verification evaluation benchmark ( i.e., equal error rate (EER) 2.505%, 2.473%, and 4.791% for trial Vox1-O, Vox1-E and Vox1-H , respectively), discarding any speaker labels in the training phase. △ Less

Submitted 25 November, 2022; v1 submitted 28 October, 2022; originally announced October 2022.

Comments: Accepted by SLT2022

arXiv:2210.14321 [pdf, other]

Artificial ASMR: A Cyber-Psychological Approach

Authors: Zexin Fang, Bin Han, C. Clark Cao, Hans. D. Schotten

Abstract: The popularity of Autonomous Sensory Meridian Response (ASMR) has skyrockted over the past decade, but scientific studies on what exactly triggered ASMR effect remain few and immature, one most commonly acknowledged trigger is that ASMR clips typically provide rich semantic information. With our attention caught by the common acoustic patterns in ASMR audios, we investigate the correlation between… ▽ More The popularity of Autonomous Sensory Meridian Response (ASMR) has skyrockted over the past decade, but scientific studies on what exactly triggered ASMR effect remain few and immature, one most commonly acknowledged trigger is that ASMR clips typically provide rich semantic information. With our attention caught by the common acoustic patterns in ASMR audios, we investigate the correlation between the cyclic features of audio signals and their effectiveness in triggering ASMR effects. A cyber-psychological approach that combines signal processing, artificial intelligence, and experimental psychology is taken, with which we are able to quantize ASMR-related acoustic features, and therewith synthesize ASMR clips with random cyclic patterns but not delivering identifiably scenarios to the audience, which were proven to be effective in triggering ASMR effects. △ Less

Submitted 5 July, 2023; v1 submitted 25 October, 2022; originally announced October 2022.

Comments: Accepted by IEEE MLSP 2023

arXiv:2210.12361 [pdf]

doi 10.2147/JMDH.S417068

MS-DCANet: A Novel Segmentation Network For Multi-Modality COVID-19 Medical Images

Authors: Xiaoyu Pan, Huazheng Zhu, **glong Du, Guangtao Hu, Baoru Han, Yuanyuan Jia

Abstract: The Coronavirus Disease 2019 (COVID-19) pandemic has increased the public health burden and brought profound disaster to humans. For the particularity of the COVID-19 medical images with blurred boundaries, low contrast and different infection sites, some researchers have improved the accuracy by adding more complexity. Also, they overlook the complexity of lesions, which hinder their ability to c… ▽ More The Coronavirus Disease 2019 (COVID-19) pandemic has increased the public health burden and brought profound disaster to humans. For the particularity of the COVID-19 medical images with blurred boundaries, low contrast and different infection sites, some researchers have improved the accuracy by adding more complexity. Also, they overlook the complexity of lesions, which hinder their ability to capture the relationship between segmentation sites and the background, as well as the edge contours and global context. However, increasing the computational complexity, parameters and inference speed is unfavorable for model transfer from laboratory to clinic. A perfect segmentation network needs to balance the above three factors completely. To solve the above issues, this paper propose a symmetric automatic segmentation framework named MS-DCANet. We introduce Tokenized MLP block, a novel attention scheme that use a shift-window mechanism to conditionally fuse local and global features to get more continuous boundaries and spatial positioning capabilities. It has greater understanding of irregular lesions contours. MS-DCANet also uses several Dual Channel blocks and a Res-ASPP block to improve the ability to recognize small targets. On multi-modality COVID-19 tasks, MS-DCANet achieved state-of-the-art performance compared with other baselines. It can well trade off the accuracy and complexity. To prove the strong generalization ability of our proposed model, we apply it to other tasks (ISIC 2018 and BAA) and achieve satisfactory results. △ Less

Submitted 19 July, 2023; v1 submitted 22 October, 2022; originally announced October 2022.

Comments: 21pages,13 figures,9 tables

Journal ref: J Multidiscip Healthc. 2023;16:2023-2043

arXiv:2209.09076 [pdf, other]

SJTU-AISPEECH System for VoxCeleb Speaker Recognition Challenge 2022

Authors: Zhengyang Chen, Bing Han, Xu Xiang, Houjun Huang, Bei Liu, Yanmin Qian

Abstract: This report describes the SJTU-AISPEECH system for the Voxceleb Speaker Recognition Challenge 2022. For track1, we implemented two kinds of systems, the online system and the offline system. Different ResNet-based backbones and loss functions are explored. Our final fusion system achieved 3rd place in track1. For track3, we implemented statistic adaptation and jointly training based domain adaptat… ▽ More This report describes the SJTU-AISPEECH system for the Voxceleb Speaker Recognition Challenge 2022. For track1, we implemented two kinds of systems, the online system and the offline system. Different ResNet-based backbones and loss functions are explored. Our final fusion system achieved 3rd place in track1. For track3, we implemented statistic adaptation and jointly training based domain adaptation. In the jointly training based domain adaptation, we jointly trained the source and target domain dataset with different training objectives to do the domain adaptation. We explored two different training objectives for target domain data, self-supervised learning based angular proto-typical loss and semi-supervised learning based classification loss with estimated pseudo labels. Besides, we used the dynamic loss-gate and label correction (DLG-LC) strategy to improve the quality of pseudo labels when the target domain objective is a classification loss. Our final fusion system achieved 4th place (very close to 3rd place, relatively less than 1%) in track3. △ Less

Submitted 20 September, 2022; v1 submitted 19 September, 2022; originally announced September 2022.

Comments: System description of VoxSRC 2022

arXiv:2208.01933 [pdf, other]

The SJTU System for Short-duration Speaker Verification Challenge 2021

Authors: Bing Han, Zhengyang Chen, Zhikai Zhou, Yanmin Qian

Abstract: This paper presents the SJTU system for both text-dependent and text-independent tasks in short-duration speaker verification (SdSV) challenge 2021. In this challenge, we explored different strong embedding extractors to extract robust speaker embedding. For text-independent task, language-dependent adaptive snorm is explored to improve the system performance under the cross-lingual verification c… ▽ More This paper presents the SJTU system for both text-dependent and text-independent tasks in short-duration speaker verification (SdSV) challenge 2021. In this challenge, we explored different strong embedding extractors to extract robust speaker embedding. For text-independent task, language-dependent adaptive snorm is explored to improve the system performance under the cross-lingual verification condition. For text-dependent task, we mainly focus on the in-domain fine-tuning strategies based on the model pre-trained on large-scale out-of-domain data. In order to improve the distinction between different speakers uttering the same phrase, we proposed several novel phrase-aware fine-tuning strategies and phrase-aware neural PLDA. With such strategies, the system performance is further improved. Finally, we fused the scores of different systems, and our fusion systems achieved 0.0473 in Task1 (rank 3) and 0.0581 in Task2 (rank 8) on the primary evaluation metric. △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: Published by Interspeech 2021

arXiv:2208.01928 [pdf, other]

Self-Supervised Speaker Verification Using Dynamic Loss-Gate and Label Correction

Authors: Bing Han, Zhengyang Chen, Yanmin Qian

Abstract: For self-supervised speaker verification, the quality of pseudo labels decides the upper bound of the system due to the massive unreliable labels. In this work, we propose dynamic loss-gate and label correction (DLG-LC) to alleviate the performance degradation caused by unreliable estimated labels. In DLG, we adopt Gaussian Mixture Model (GMM) to dynamically model the loss distribution and use the… ▽ More For self-supervised speaker verification, the quality of pseudo labels decides the upper bound of the system due to the massive unreliable labels. In this work, we propose dynamic loss-gate and label correction (DLG-LC) to alleviate the performance degradation caused by unreliable estimated labels. In DLG, we adopt Gaussian Mixture Model (GMM) to dynamically model the loss distribution and use the estimated GMM to distinguish the reliable and unreliable labels automatically. Besides, to better utilize the unreliable data instead of drop** them directly, we correct the unreliable label with model predictions. Moreover, we apply the negative-pairs-free DINO framework in our experiments for further improvement. Compared to the best-known speaker verification system with self-supervised learning, our proposed DLG-LC converges faster and achieves 11.45%, 18.35% and 15.16% relative improvement on Vox-O, Vox-E and Vox-H trials of Voxceleb1 evaluation dataset. △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: Accepted by Interspeech 2022

arXiv:2206.11699 [pdf, ps, other]

The SJTU X-LANCE Lab System for CNSRC 2022

Authors: Zhengyang Chen, Bei Liu, Bing Han, Leying Zhang, Yanmin Qian

Abstract: This technical report describes the SJTU X-LANCE Lab system for the three tracks in CNSRC 2022. In this challenge, we explored the speaker embedding modeling ability of deep ResNet (Deeper r-vector). All the systems are only trained on the Cnceleb training set and we use the same systems for the three tracks in CNSRC 2022. In this challenge, our system ranks the first place in the fixed track of s… ▽ More This technical report describes the SJTU X-LANCE Lab system for the three tracks in CNSRC 2022. In this challenge, we explored the speaker embedding modeling ability of deep ResNet (Deeper r-vector). All the systems are only trained on the Cnceleb training set and we use the same systems for the three tracks in CNSRC 2022. In this challenge, our system ranks the first place in the fixed track of speaker verification task. Our best single system and fusion system achieve 0.3164 and 0.2975 minDCF respectively. Besides, we submit the result of ResNet221 to the speaker retrieval track and achieve 0.4626 mAP. More importantly, we have helped the wespeaker [1] toolkit reproduce our result: https://github.com/wenet-e2e/wespeaker. △ Less

Submitted 14 May, 2023; v1 submitted 23 June, 2022; originally announced June 2022.

arXiv:2206.11522 [pdf]

doi 10.3850/978-981-18-5184-1_MS-23-199-cd

Modeling the System-Level Reliability towards a Convergence of Communication, Computing and Control

Authors: Bin Han, Hans D. Schotten

Abstract: Enabled and driven by modern advances in wireless telecommunication and artificial intelligence, the convergence of communication, computing, and control is becoming inevitable in future industrial applications. Analytical and optimizing frameworks, however, are not yet readily developed for this new technical trend. In this work we discuss the necessity and typical scenarios of this convergence,… ▽ More Enabled and driven by modern advances in wireless telecommunication and artificial intelligence, the convergence of communication, computing, and control is becoming inevitable in future industrial applications. Analytical and optimizing frameworks, however, are not yet readily developed for this new technical trend. In this work we discuss the necessity and typical scenarios of this convergence, and propose a new approach to model the system-level reliability across all involved domainss △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: Accepted to appear in the 8th International Symposium on Reliability and Risk Management (ISRERM 2022)

Journal ref: in Proceedings of the 8th International Symposium on Reliability Engineering and Risk Management, 2022

arXiv:2204.10197 [pdf, other]

Flexible and dependable manufacturing beyond xURLLC: A novel framework for communication-control co-design

Authors: Bin Han, Mu-Xia Sun, Lai-Kan Muk, Yan-Fu Li, Hans D. Schotten

Abstract: Future Industrial 4.0 applications in the 6G era is calling for high dependability that goes far beyond the current ultra-reliable low latency communication (URLLC), and therewith proposed critical challenges to the communication technology. Instead of struggling against the physical and technical limits towards an extreme URLLC (xURLLC), communication-control co-design (CoCoCo) appears a more pro… ▽ More Future Industrial 4.0 applications in the 6G era is calling for high dependability that goes far beyond the current ultra-reliable low latency communication (URLLC), and therewith proposed critical challenges to the communication technology. Instead of struggling against the physical and technical limits towards an extreme URLLC (xURLLC), communication-control co-design (CoCoCo) appears a more promising solution. This work proposes a novel framework of CoCoCo, which is not only enhancing the dependability of 6G industrial applications such as remote control, but also exhibiting rich potential in revolutionizing the future industry per openness and flexibility of manufacturing systems. △ Less

Submitted 5 December, 2022; v1 submitted 18 March, 2022; originally announced April 2022.

Comments: To appear in the 22nd IEEE International Conference on Software Quality, Reliability, and Security (QRS 2022) Workshops

arXiv:2203.04398 [pdf]

Window Filtering Algorithm for Pulsed Light Coherent Combining of Low Repetition Frequency

Authors: Jiali Zhang, Jie Cao, Qun Hao, Yang Cheng, Liquan Dong, Bin Han, Xuesheng Liu

Abstract: The multi-dithering method has been well verified in phase locking of polarization coherent combination experiment. However, it is hard to apply to low repetition frequency pulsed lasers, since there exists an overlap frequency domain between pulse laser and the amplitude phase noise and traditional filters cannot effectively separate phase noise. Aiming to solve the problem in this paper, we prop… ▽ More The multi-dithering method has been well verified in phase locking of polarization coherent combination experiment. However, it is hard to apply to low repetition frequency pulsed lasers, since there exists an overlap frequency domain between pulse laser and the amplitude phase noise and traditional filters cannot effectively separate phase noise. Aiming to solve the problem in this paper, we propose a novel method of pulse noise detection, identification, and filtering based on the autocorrelation characteristics between noise signals. In the proposed algorithm, a self-designed window algorithm is used to identify the pulse, and then the pulse signal group in the window is replaced by interpolation, which effectively filter the pulse signal doped in the phase noise within 0.1 ms. After filtering the pulses in the phase noise, the phase difference of two pulsed beams (10 kHz) is successfully compensated to zero in 1 ms, and the coherent combination of closed-loop phase lock is realized. At the same time, the phase correction times are few, the phase lock effect is stable, and the final light intensity increases to the ideal value (0.9 Imax). △ Less

Submitted 3 March, 2022; originally announced March 2022.

arXiv:2201.02876 [pdf, other]

Defocus Deblur Microscopy via Head-to-Tail Cross-scale Fusion

Authors: Jiahe Wang, Boran Han

Abstract: Microscopy imaging is vital in biology research and diagnosis. When imaging at the scale of cell or molecule level, mechanical drift on the axial axis can be difficult to correct. Although multi-scale networks have been developed for deblurring, those cascade residual learning approaches fail to accurately capture the end-to-end non-linearity of deconvolution, a relation between in-focus images an… ▽ More Microscopy imaging is vital in biology research and diagnosis. When imaging at the scale of cell or molecule level, mechanical drift on the axial axis can be difficult to correct. Although multi-scale networks have been developed for deblurring, those cascade residual learning approaches fail to accurately capture the end-to-end non-linearity of deconvolution, a relation between in-focus images and their out-of-focus counterparts in microscopy. In our model, we adopt a structure of multi-scale U-Net without cascade residual leaning. Additionally, in contrast to the conventional coarse-to-fine model, our model strengthens the cross-scale interaction by fusing the features from the coarser sub-networks with the finer ones in a head-to-tail manner: the decoder from the coarser scale is fused with the encoder of the finer ones. Such interaction contributes to better feature learning as fusion happens across decoder and encoder at all scales. Numerous experiments demonstrate that our method yields better performance when compared with other existing models. △ Less

Submitted 30 May, 2023; v1 submitted 8 January, 2022; originally announced January 2022.

Comments: published on ICIP 2022

arXiv:2112.15399 [pdf, other]

InfoNeRF: Ray Entropy Minimization for Few-Shot Neural Volume Rendering

Authors: Mijeong Kim, Seonguk Seo, Bohyung Han

Abstract: We present an information-theoretic regularization technique for few-shot novel view synthesis based on neural implicit representation. The proposed approach minimizes potential reconstruction inconsistency that happens due to insufficient viewpoints by imposing the entropy constraint of the density in each ray. In addition, to alleviate the potential degenerate issue when all training images are… ▽ More We present an information-theoretic regularization technique for few-shot novel view synthesis based on neural implicit representation. The proposed approach minimizes potential reconstruction inconsistency that happens due to insufficient viewpoints by imposing the entropy constraint of the density in each ray. In addition, to alleviate the potential degenerate issue when all training images are acquired from almost redundant viewpoints, we further incorporate the spatially smoothness constraint into the estimated images by restricting information gains from a pair of rays with slightly different viewpoints. The main idea of our algorithm is to make reconstructed scenes compact along individual rays and consistent across rays in the neighborhood. The proposed regularizers can be plugged into most of existing neural volume rendering techniques based on NeRF in a straightforward way. Despite its simplicity, we achieve consistently improved performance compared to existing neural view synthesis methods by large margins on multiple standard benchmarks. △ Less

Submitted 10 April, 2022; v1 submitted 31 December, 2021; originally announced December 2021.

Comments: CVPR 2022, Website: http://cv.snu.ac.kr/research/InfoNeRF

arXiv:2111.12494 [pdf, other]

doi 10.1109/CSCN53733.2021.9686111.

Time-Energy-Constrained Closed-Loop FBL Communication for Dependable MEC

Authors: Bin Han, Yao Zhu, Anke Schmeink, Hans D. Schotten

Abstract: The deployment of multi-access edge computing (MEC) is paving the way towards pervasive intelligence in future 6G networks. This new paradigm also proposes emerging requirements of dependable communications, which goes beyond the ultra-reliable low latency communication (URLLC), focusing on the performance of a closed loop instead of that of an unidirectional link. This work studies the simple but… ▽ More The deployment of multi-access edge computing (MEC) is paving the way towards pervasive intelligence in future 6G networks. This new paradigm also proposes emerging requirements of dependable communications, which goes beyond the ultra-reliable low latency communication (URLLC), focusing on the performance of a closed loop instead of that of an unidirectional link. This work studies the simple but efficient one-shot transmission scheme, investigating the closed-loop-reliability-optimal policy of blocklength allocation under stringent time and energy constraints. △ Less

Submitted 10 December, 2021; v1 submitted 24 November, 2021; originally announced November 2021.

Comments: Accepted for publication at CSCN 2021 V1: accepted version V2: minor correction in the modulation order V3: corrections to resolve chaos caused by different normalizations of the FBL PER equation, model figure file updated in HQ

Journal ref: in 2021 IEEE Conference on Standards for Communications and Networking (CSCN), 2021, pp. 180-185

arXiv:2108.09551 [pdf, other]

Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform

Authors: Myungseo Song, **young Choi, Bohyung Han

Abstract: We propose a versatile deep image compression network based on Spatial Feature Transform (SFT arXiv:1804.02815), which takes a source image and a corresponding quality map as inputs and produce a compressed image with variable rates. Our model covers a wide range of compression rates using a single model, which is controlled by arbitrary pixel-wise quality maps. In addition, the proposed framework… ▽ More We propose a versatile deep image compression network based on Spatial Feature Transform (SFT arXiv:1804.02815), which takes a source image and a corresponding quality map as inputs and produce a compressed image with variable rates. Our model covers a wide range of compression rates using a single model, which is controlled by arbitrary pixel-wise quality maps. In addition, the proposed framework allows us to perform task-aware image compressions for various tasks, e.g., classification, by efficiently estimating optimized quality maps specific to target tasks for our encoding network. This is even possible with a pretrained network without learning separate models for individual tasks. Our algorithm achieves outstanding rate-distortion trade-off compared to the approaches based on multiple models that are optimized separately for several different target rates. At the same level of compression, the proposed approach successfully improves performance on image classification and text region quality preservation via task-aware quality map estimation without additional model training. The code is available at the project website: https://github.com/micmic123/QmapCompression △ Less

Submitted 21 August, 2021; originally announced August 2021.

Comments: ICCV 2021

arXiv:2106.08754 [pdf]

Conformal Three-Dimensional Interphase of Li Metal Anode Revealed by Low Dose Cryo-Electron Microscopy

Authors: Bing Han, Xiangyan Li, Shuang Bai, Yucheng Zou, Bingyu Lu, Minghao Zhang, Xiaomin Ma, Zhi Chang, Ying Shirley Meng, Meng Gu

Abstract: Using cryogenic transmission electron microscopy, we revealed three dimensional (3D) structural details of the electrochemically plated lithium (Li) flakes and their solid electrolyte interphase (SEI), including the composite SEI skin-layer and SEI fossil pieces buried inside the Li matrix. As the SEI skin-layer is largely comprised of nanocrystalline LiF and Li2O in amorphous polymeric matrix, wh… ▽ More Using cryogenic transmission electron microscopy, we revealed three dimensional (3D) structural details of the electrochemically plated lithium (Li) flakes and their solid electrolyte interphase (SEI), including the composite SEI skin-layer and SEI fossil pieces buried inside the Li matrix. As the SEI skin-layer is largely comprised of nanocrystalline LiF and Li2O in amorphous polymeric matrix, when complete Li strip** occurs, the compromised SEI three-dimensional framework buckles, forming nanoscale bends and wrinkles. We showed that the flexibility and resilience of the SEI skin-layer plays a vital role in preserving an intact SEI 3D framework after Li strip**. The intact SEI network enables the nucleation and growth of the newly plated Li inside the previously formed SEI network in the subsequent cycles, preventing additional large amount of SEI formation between newly plated Li metal and the electrolyte. In addition, cells cycled under the accurately controlled uniaxial pressure can further enhance the repeated utilization of the SEI framework and improve the coulombic efficiency (CE) by up to 97%, demonstrating an effective strategy of reducing the formation of additional SEI and inactive dead Li. The identification of such flexible and porous 3D SEI framework clarifies the working mechanism of SEI in lithium metal anode for batteries. The insights provided in this work will inspire researchers to design more functional artificial 3D SEI on other metal anodes to improve rechargeable metal battery with long cycle life. △ Less

Submitted 10 June, 2021; originally announced June 2021.

arXiv:2106.07417 [pdf, other]

doi 10.3850/978-981-18-2016-8_095-cd

Online Estimation of Resource Overload Risk in 5G Multi-Tenancy Network

Authors: Yasameen Shihab Hamad, Bin Han, Osman Nuri ucan

Abstract: The technology of network slicing, as the most characteristic feature of the fifth generation (5G) wireless networks, manages the resources and network functions in heterogeneous and logically isolated slices on the top of a shared physical infrastructure, where every slice can be independently customized to fulfill the specific requirements of its devoted service type. It enables a new paradigm o… ▽ More The technology of network slicing, as the most characteristic feature of the fifth generation (5G) wireless networks, manages the resources and network functions in heterogeneous and logically isolated slices on the top of a shared physical infrastructure, where every slice can be independently customized to fulfill the specific requirements of its devoted service type. It enables a new paradigm of multi-tenancy networking, where the network slices can be leased by the mobile network operator (MNO) to tenants in form of public cloud computing service, known as Slice-asa- Service (SlaaS). Similar to classical cloud computing scenarios, SlaaS benefits from overbooking its resources to numerous tenants, taking advantage of the resource elasticity and diversity, at a price of risking overloading network resources and violating the service-level agreements (SLAs), which stipulate the quality of service (QoS) that shall be guaranteed to the network slices. Thus, it becomes a critical challenge to the MNOs, accurately estimating the resource overload risk - especially under the sophisticated network dynamics - for monitoring and enhancing the reliability of SlaaS business. △ Less

Submitted 15 June, 2021; v1 submitted 14 June, 2021; originally announced June 2021.

Comments: To appear at ESREL 2021

Journal ref: Proceedings of the 31st European Safety and Reliability Conference, 2021

arXiv:2106.06237 [pdf, other]

KRADA: Known-region-aware Domain Alignment for Open-set Domain Adaptation in Semantic Segmentation

Authors: Chenhong Zhou, Feng Liu, Chen Gong, Rongfei Zeng, Tongliang Liu, William K. Cheung, Bo Han

Abstract: In semantic segmentation, we aim to train a pixel-level classifier to assign category labels to all pixels in an image, where labeled training images and unlabeled test images are from the same distribution and share the same label set. However, in an open world, the unlabeled test images probably contain unknown categories and have different distributions from the labeled images. Hence, in this p… ▽ More In semantic segmentation, we aim to train a pixel-level classifier to assign category labels to all pixels in an image, where labeled training images and unlabeled test images are from the same distribution and share the same label set. However, in an open world, the unlabeled test images probably contain unknown categories and have different distributions from the labeled images. Hence, in this paper, we consider a new, more realistic, and more challenging problem setting where the pixel-level classifier has to be trained with labeled images and unlabeled open-world images -- we name it open-set domain adaptation segmentation (OSDAS). In OSDAS, the trained classifier is expected to identify unknown-class pixels and classify known-class pixels well. To solve OSDAS, we first investigate which distribution that unknown-class pixels obey. Then, motivated by the goodness-of-fit test, we use statistical measurements to show how a pixel fits the distribution of an unknown class and select highly-fitted pixels to form the unknown region in each test image. Eventually, we propose an end-to-end learning framework, known-region-aware domain alignment (KRADA), to distinguish unknown classes while aligning the distributions of known classes in labeled and unlabeled open-world images. The effectiveness of KRADA has been verified on two synthetic tasks and one COVID-19 segmentation task. △ Less

Submitted 19 February, 2023; v1 submitted 11 June, 2021; originally announced June 2021.

Comments: 18 pages

Journal ref: Transactions on Machine Learning Research, 2023

arXiv:2104.12362 [pdf, other]

Underwater Target Recognition based on Multi-Decision LOFAR Spectrum Enhancement: A Deep Learning Approach

Authors: Jie Chen, Jie Liu, Chang Liu, Jian Zhang, Bing Han

Abstract: The Low frequency analysis and recording (LOFAR) spectrum is one of the key features of the under water target, which can be used for underwater target recognition. However, the underwater environment noise is complicated and the signal-to-noise ratio of the underwater target is rather low, which introduces the breakpoints to the LOFAR spectrum and thus hinders the underwater target recognition. T… ▽ More The Low frequency analysis and recording (LOFAR) spectrum is one of the key features of the under water target, which can be used for underwater target recognition. However, the underwater environment noise is complicated and the signal-to-noise ratio of the underwater target is rather low, which introduces the breakpoints to the LOFAR spectrum and thus hinders the underwater target recognition. To overcome this issue and to further improve the recognition performance, we adopt a deep learning approach for underwater target recognition and propose a LOFAR spectrum enhancement (LSE)-based underwater target recognition scheme, which consists of preprocessing, offline training, and online testing. In preprocessing, a LOFAR spectrum enhancement based on multi-step decision algorithm is specifically designed to recover the breakpoints in LOFAR spectrum. In offline training, we then adopt the enhanced LOFAR spectrum as the input of convolutional neural network (CNN) and develop a LOFAR-based CNN (LOFAR-CNN) for online recognition. Taking advantage of the powerful capability of CNN in feature extraction, the proposed LOFAR-CNN can further improve the recognition accuracy. Finally, extensive simulation results demonstrate that the LOFAR-CNN network can achieve a recognition accuracy of $95.22\%$, which outperforms the state-of-the-art methods. △ Less

Submitted 26 April, 2021; originally announced April 2021.

arXiv:2102.01420 [pdf, other]

doi 10.1109/OJCOMS.2021.3057679

The Road Towards 6G: A Comprehensive Survey

Authors: Wei Jiang, Bin Han, Mohammad Asif Habibi, Hans Dieter Schotten

Abstract: As of today, the fifth generation (5G) mobile communication system has been rolled out in many countries and the number of 5G subscribers already reaches a very large scale. It is time for academia and industry to shift their attention towards the next generation. At this crossroad, an overview of the current state of the art and a vision of future communications are definitely of interest. This a… ▽ More As of today, the fifth generation (5G) mobile communication system has been rolled out in many countries and the number of 5G subscribers already reaches a very large scale. It is time for academia and industry to shift their attention towards the next generation. At this crossroad, an overview of the current state of the art and a vision of future communications are definitely of interest. This article thus aims to provide a comprehensive survey to draw a picture of the sixth generation (6G) system in terms of drivers, use cases, usage scenarios, requirements, key performance indicators (KPIs), architecture, and enabling technologies. First, we attempt to answer the question of "Is there any need for 6G?" by shedding light on its key driving factors, in which we predict the explosive growth of mobile traffic until 2030, and envision potential use cases and usage scenarios. Second, the technical requirements of 6G are discussed and compared with those of 5G with respect to a set of KPIs in a quantitative manner. Third, the state-of-the-art 6G research efforts and activities from representative institutions and countries are summarized, and a tentative roadmap of definition, specification, standardization, and regulation is projected. Then, we identify a dozen of potential technologies and introduce their principles, advantages, challenges, and open research issues. Finally, the conclusions are drawn to paint a picture of "What 6G may look like?". This survey is intended to serve as an enlightening guideline to spur interests and further investigations for subsequent research and development of 6G communications systems. △ Less

Submitted 2 February, 2021; originally announced February 2021.

Comments: 30 Pages, 5 figures, IEEE open Journal

Journal ref: IEEE Open Journal of the Communications Society (OJCOMS), Vol. 2, 2021, pp. 334 - 366

arXiv:2010.08309 [pdf]

doi 10.1049/iet-gtd.2020.0703

Partial Discharge Direction of Arrival Estimation in Air-insulated Substation by UHF Wireless Array and RSSI Maximum Likelihood Estimator

Authors: Bei Han, Lingen Luo, Gehao Sheng, Xiuchen Jiang

Abstract: The quick detection and localization of partial discharge (PD) in an air-insulated substation (AIS) based on ultrahigh-frequency (UHF) sensor arrays are efficient for power equipment monitoring. The adopted UHF PD time difference of arrival (TDOA) methods mainly use the time difference of electromagnetic wave signals. Thus, the system requires both a high sampling rate and time synchronization acc… ▽ More The quick detection and localization of partial discharge (PD) in an air-insulated substation (AIS) based on ultrahigh-frequency (UHF) sensor arrays are efficient for power equipment monitoring. The adopted UHF PD time difference of arrival (TDOA) methods mainly use the time difference of electromagnetic wave signals. Thus, the system requires both a high sampling rate and time synchronization accuracy, leading to a high cost and large size. In this study, the feasibility and accuracy of PD DOA in an AIS were investigated using a UHF wireless sensor array and the received signal strength indicator. First, the power pattern of the designed UHF wireless sensor array was obtained via an offline experiment. Then, a statistical approach to the PD DOA method based on the maximum likelihood estimator was employed to obtain the preliminary DOA result. Finally, interpolation and clustering algorithms were used to improve the accuracy of the DOA. A laboratory test was conducted. The average error of the PD DOA was less than 6°, and the cost-effectiveness and portability were clearly improved. △ Less

Submitted 16 October, 2020; originally announced October 2020.

Comments: 8 pages, 14 figures

arXiv:2009.00197 [pdf, other]

Deep unsupervised learning for Microscopy-Based Malaria detection

Authors: Alexander Tao, Boran Han

Abstract: Malaria, a mosquito-borne disease caused by a parasite, kills over 1 million people globally each year. People, if left untreated, may develop severe complications, leading to death. Effective and accurate diagnosis is important for the management and control of malaria. Our research focuses on utilizing machine learning to improve the efficiency in Malaria diagnosis. We utilize a modified U-net a… ▽ More Malaria, a mosquito-borne disease caused by a parasite, kills over 1 million people globally each year. People, if left untreated, may develop severe complications, leading to death. Effective and accurate diagnosis is important for the management and control of malaria. Our research focuses on utilizing machine learning to improve the efficiency in Malaria diagnosis. We utilize a modified U-net architecture, as an unsupervised learning model, to conduct cell boundary detection. The blood cells infected by malaria are then identified in chromatic space by a Mahalanobis distance algorithm. Both the cell segmentation and Malaria detection process often requires intensive manual label, which we hope to eliminate via the unsupervised workflow. △ Less

Submitted 31 August, 2020; originally announced September 2020.

arXiv:2004.11536 [pdf, other]

doi 10.1016/j.apenergy.2020.114641

Leveraging inter-firm influence in the diffusion of energy efficiency technologies: An agent-based model

Authors: Yingying Shi, Yongchao Zeng, Jean Engo, Botang Han, Yang Li, Ralph T Muehleisen

Abstract: Energy efficiency technologies (EETs) are crucial for saving energy and reducing carbon dioxide emissions. However, the diffusion of EETs in small and medium-sized enterprises is rather slow. Literature shows the interactions between innovation adopters and potential adopters have significant impacts on innovation diffusion. Enterprises lack the motivation to share information, and EETs usually la… ▽ More Energy efficiency technologies (EETs) are crucial for saving energy and reducing carbon dioxide emissions. However, the diffusion of EETs in small and medium-sized enterprises is rather slow. Literature shows the interactions between innovation adopters and potential adopters have significant impacts on innovation diffusion. Enterprises lack the motivation to share information, and EETs usually lack observability, which suppress the inter-firm influence. Therefore, an information platform, together with proper policies encouraging or forcing enterprises to disclose EET-related information, should help harness inter-firm influence to accelerate EETs' diffusion. To explore whether and how such an information platform affects EETs' diffusion in small and medium-sized enterprises, this study builds an agent-based model to mimic EET diffusion processes. Based on a series of controlled numerical experiments, some counter-intuitive phenomena are discovered and explained. The results show that the information platform is a double-edged sword that notably accelerates EETs' diffusion by approximately 47% but may also boost negative information to diffuse even faster and delay massive adoption of EETs. Increasing network density and the intensity of inter-firm influence are effective to speed EET diffusion, but their impacts diminish drastically after reaching some critical values (0.05 and 0.15 respectively) and eventually harm the stability of the system. Hence, the findings implicate that EET suppliers should carefully launch their promising but immature products; policies that can reduce the perceived risk by enterprises and the effort to maintain an informative rather than judgmental information platform can prominently mitigate the negative side effects brought by high fluidity of information. △ Less

Submitted 24 April, 2020; originally announced April 2020.

Journal ref: Applied Energy 263 (2020) 114641

arXiv:2002.03714 [pdf, ps, other]

doi 10.1109/INFOCOMWKSHPS50562.2020.9162929

Robustness Analysis of Networked Control Systems with Aging Status

Authors: Bin Han, Siyu Yuan, Zhiyuan Jiang, Yao Zhu, Hans D. Schotten

Abstract: As an emerging metric of communication systems, Age of Information (AoI) has been derived to have a critical impact in networked control systems with unreliable information links. This work sets up a novel model of outage probability in a loosely constrained control system as a function of the feedback AoI, and conducts numerical simulations to validate the model. As an emerging metric of communication systems, Age of Information (AoI) has been derived to have a critical impact in networked control systems with unreliable information links. This work sets up a novel model of outage probability in a loosely constrained control system as a function of the feedback AoI, and conducts numerical simulations to validate the model. △ Less

Submitted 10 February, 2020; originally announced February 2020.

Comments: Submitted to IEEE INFOCOM 2020 poster session

Journal ref: IEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)

Showing 1–50 of 56 results for author: Han, B