Search | arXiv e-print repository

arXiv:2405.20055 [pdf, other]

Hypergraph-Aided Task-Resource Matching for Maximizing Value of Task Completion in Collaborative IoT Systems

Abstract: With the growing scale and intrinsic heterogeneity of Internet of Things (IoT) systems, distributed device collaboration becomes essential for effective task completion by dynamically utilizing limited communication and computing resources. However, the separated design and situation-agnostic operation of computing, communication and application layers create a fundamental challenge for rapid task… ▽ More With the growing scale and intrinsic heterogeneity of Internet of Things (IoT) systems, distributed device collaboration becomes essential for effective task completion by dynamically utilizing limited communication and computing resources. However, the separated design and situation-agnostic operation of computing, communication and application layers create a fundamental challenge for rapid task-resource matching, which further deteriorate the overall task completion effectiveness. To overcome this challenge, we utilize hypergraph as a new tool to vertically unify computing, communication, and task aspects of IoT systems for an effective matching by accurately capturing the relationships between tasks and communication and computing resources. Specifically, a state-of-the-art task-resource matching hypergraph (TRM-hypergraph) model is proposed in this paper, which is used to effectively transform the process of allocating complex heterogeneous resources to convoluted tasks into a hypergraph matching problem. Taking into account computational complexity and storage, a game-theoretic hypergraph matching algorithm is proposed via considering the hypergraph matching problem as a non-cooperative multi-player clustering game. Numerical results demonstrate that the proposed TRM-hypergraph model achieves superior performance in matching of tasks and resources compared with comparison algorithms. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: This paper has been published in IEEE Transactions on Mobile Computing, May 2024

arXiv:2403.15700 [pdf, ps, other]

doi 10.1109/JIOT.2020.3031272

Improved Soft-k-Means Clustering Algorithm for Balancing Energy Consumption in Wireless Sensor Networks

Authors: Botao Zhu, Ebrahim Bedeer, Ha H. Nguyen, Robert Barton, Jerome Henry

Abstract: Energy load balancing is an essential issue in designing wireless sensor networks (WSNs). Clustering techniques are utilized as energy-efficient methods to balance the network energy and prolong its lifetime. In this paper, we propose an improved soft-k-means (IS-k-means) clustering algorithm to balance the energy consumption of nodes in WSNs. First, we use the idea of ``clustering by fast search… ▽ More Energy load balancing is an essential issue in designing wireless sensor networks (WSNs). Clustering techniques are utilized as energy-efficient methods to balance the network energy and prolong its lifetime. In this paper, we propose an improved soft-k-means (IS-k-means) clustering algorithm to balance the energy consumption of nodes in WSNs. First, we use the idea of ``clustering by fast search and find of density peaks'' (CFSFDP) and kernel density estimation (KDE) to improve the selection of the initial cluster centers of the soft k-means clustering algorithm. Then, we utilize the flexibility of the soft-k-means and reassign member nodes considering their membership probabilities at the boundary of clusters to balance the number of nodes per cluster. Furthermore, the concept of multi-cluster heads is employed to balance the energy consumption within clusters. {Extensive simulation results under different network scenarios demonstrate that for small-scale WSNs with single-hop transmission}, the proposed algorithm can postpone the first node death, the half of nodes death, and the last node death on average when compared to various clustering algorithms from the literature. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Journal ref: Published in IEEE Internet of Things Journal, 2021

arXiv:2403.06423 [pdf, other]

LiDAR Point Cloud-based Multiple Vehicle Tracking with Probabilistic Measurement-Region Association

Authors: Guanhua Ding, Jianan Liu, Yuxuan Xia, Tao Huang, Bing Zhu, **** Sun

Abstract: Multiple extended target tracking (ETT) has gained increasing attention due to the development of high-precision LiDAR and radar sensors in automotive applications. For LiDAR point cloud-based vehicle tracking, this paper presents a probabilistic measurement-region association (PMRA) ETT model, which can describe the complex measurement distribution by partitioning the target extent into different… ▽ More Multiple extended target tracking (ETT) has gained increasing attention due to the development of high-precision LiDAR and radar sensors in automotive applications. For LiDAR point cloud-based vehicle tracking, this paper presents a probabilistic measurement-region association (PMRA) ETT model, which can describe the complex measurement distribution by partitioning the target extent into different regions. The PMRA model overcomes the drawbacks of previous data-region association (DRA) models by eliminating the approximation error of constrained estimation and using continuous integrals to more reliably calculate the association probabilities. Furthermore, the PMRA model is integrated with the Poisson multi-Bernoulli mixture (PMBM) filter for tracking multiple vehicles. Simulation results illustrate the superior estimation accuracy of the proposed PMRA-PMBM filter in terms of both positions and extents of the vehicles comparing with PMBM filters using the gamma Gaussian inverse Wishart and DRA implementations. △ Less

Submitted 18 May, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

Comments: 8 pages, 5 figures, accepted by the 27th International Conference on Information Fusion (FUSION 2024)

arXiv:2402.17785 [pdf, other]

ByteComposer: a Human-like Melody Composition Method based on Language Model Agent

Authors: Xia Liang, Xingjian Du, Jiaju Lin, Pei Zou, Yuan Wan, Bilei Zhu

Abstract: Large Language Models (LLM) have shown encouraging progress in multimodal understanding and generation tasks. However, how to design a human-aligned and interpretable melody composition system is still under-explored. To solve this problem, we propose ByteComposer, an agent framework emulating a human's creative pipeline in four separate steps : "Conception Analysis - Draft Composition - Self-Eval… ▽ More Large Language Models (LLM) have shown encouraging progress in multimodal understanding and generation tasks. However, how to design a human-aligned and interpretable melody composition system is still under-explored. To solve this problem, we propose ByteComposer, an agent framework emulating a human's creative pipeline in four separate steps : "Conception Analysis - Draft Composition - Self-Evaluation and Modification - Aesthetic Selection". This framework seamlessly blends the interactive and knowledge-understanding features of LLMs with existing symbolic music generation models, thereby achieving a melody composition agent comparable to human creators. We conduct extensive experiments on GPT4 and several open-source large language models, which substantiate our framework's effectiveness. Furthermore, professional music composers were engaged in multi-dimensional evaluations, the final results demonstrated that across various facets of music composition, ByteComposer agent attains the level of a novice melody composer. △ Less

Submitted 6 March, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.07485 [pdf, other]

MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning

Authors: Hang Zhao, Yifei Xin, Zhesong Yu, Bilei Zhu, Lu Lu, Zejun Ma

Abstract: In the realm of audio-language pre-training (ALP), the challenge of achieving cross-modal alignment is significant. Moreover, the integration of audio inputs with diverse distributions and task variations poses challenges in develo** generic audio-language models. In this study, we present MINT, a novel ALP framework boosting audio-language models through multi-target pre-training and instructio… ▽ More In the realm of audio-language pre-training (ALP), the challenge of achieving cross-modal alignment is significant. Moreover, the integration of audio inputs with diverse distributions and task variations poses challenges in develo** generic audio-language models. In this study, we present MINT, a novel ALP framework boosting audio-language models through multi-target pre-training and instruction tuning. MINT leverages the strength of frozen pre-trained audio encoders and large language models (LLM) to improve audio-language pre-training, enabling effective transferablility to both audio-text understanding and generation tasks. To address the modality gap, we introduce Bridge-Net, a trainable module that enhances cross-modality alignment and the model's ability to follow instructions for a variety of audio-text tasks. Bridge-Net is pivotal within MINT, initially enhancing audio-language representation learning through a multi-target pre-training approach. Subsequently, Bridge-Net further boosts audio-to-language generative learning by integrating a frozen language model with instruction tuning. This integration empowers MINT to extract features in a flexible and effective manner, specifically tailored to the provided instructions for diverse tasks. Experimental results demonstrate that MINT attains superior performance across various audio-language understanding and generation tasks, highlighting its robust generalization capabilities even in zero-shot scenarios. △ Less

Submitted 11 June, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

arXiv:2310.18767 [pdf, other]

Enhancing Epileptic Seizure Detection with EEG Feature Embeddings

Authors: Arman Zarei, Bingzhao Zhu, Mahsa Shoaran

Abstract: Epilepsy is one of the most prevalent brain disorders that disrupts the lives of millions worldwide. For patients with drug-resistant seizures, there exist implantable devices capable of monitoring neural activity, promptly triggering neurostimulation to regulate seizures, or alerting patients of potential episodes. Next-generation seizure detection systems heavily rely on high-accuracy machine le… ▽ More Epilepsy is one of the most prevalent brain disorders that disrupts the lives of millions worldwide. For patients with drug-resistant seizures, there exist implantable devices capable of monitoring neural activity, promptly triggering neurostimulation to regulate seizures, or alerting patients of potential episodes. Next-generation seizure detection systems heavily rely on high-accuracy machine learning-based classifiers to detect the seizure onset. Here, we propose to enhance the seizure detection performance by learning informative embeddings of the EEG signal. We empirically demonstrate, for the first time, that converting raw EEG signals to appropriate embeddings can significantly boost the performance of seizure detection algorithms. Importantly, we show that embedding features, which converts the raw EEG into an alternative representation, is beneficial for various machine learning models such as Logistic Regression, Multi-Layer Perceptron, Support Vector Machines, and Gradient Boosted Trees. The experiments were conducted on the CHB-MIT scalp EEG dataset. With the proposed EEG feature embeddings, we achieve significant improvements in sensitivity, specificity, and AUC score across multiple models. By employing this approach alongside an SVM classifier, we were able to attain state-of-the-art classification performance with a sensitivity of 100% and specificity of 99%, setting a new benchmark in the field. △ Less

Submitted 28 October, 2023; originally announced October 2023.

arXiv:2310.10159 [pdf, other]

Joint Music and Language Attention Models for Zero-shot Music Tagging

Authors: Xingjian Du, Zhesong Yu, Jiaju Lin, Bilei Zhu, Qiuqiang Kong

Abstract: Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (JMLA) model to address the open-set music tagging problem. The JMLA model consists of an audio… ▽ More Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (JMLA) model to address the open-set music tagging problem. The JMLA model consists of an audio encoder modeled by a pretrained masked autoencoder and a decoder modeled by a Falcon7B. We introduce preceiver resampler to convert arbitrary length audio into fixed length embeddings. We introduce dense attention connections between encoder and decoder layers to improve the information flow between the encoder and decoder layers. We collect a large-scale music and description dataset from the internet. We propose to use ChatGPT to convert the raw descriptions into formalized and diverse descriptions to train the JMLA models. Our proposed JMLA system achieves a zero-shot audio tagging accuracy of $ 64.82\% $ on the GTZAN dataset, outperforming previous zero-shot systems and achieves comparable results to previous systems on the FMA and the MagnaTagATune datasets. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: \begin{keywords} Music tagging, joint music and language attention models, Music Foundation Model. \end{keywords}

arXiv:2309.06036 [pdf, other]

Which Framework is Suitable for Online 3D Multi-Object Tracking for Autonomous Driving with Automotive 4D Imaging Radar?

Authors: Jianan Liu, Guanhua Ding, Yuxuan Xia, **** Sun, Tao Huang, Lihua Xie, Bing Zhu

Abstract: Online 3D multi-object tracking (MOT) has recently received significant research interests due to the expanding demand of 3D perception in advanced driver assistance systems (ADAS) and autonomous driving (AD). Among the existing 3D MOT frameworks for ADAS and AD, conventional point object tracking (POT) framework using the tracking-by-detection (TBD) strategy has been well studied and accepted for… ▽ More Online 3D multi-object tracking (MOT) has recently received significant research interests due to the expanding demand of 3D perception in advanced driver assistance systems (ADAS) and autonomous driving (AD). Among the existing 3D MOT frameworks for ADAS and AD, conventional point object tracking (POT) framework using the tracking-by-detection (TBD) strategy has been well studied and accepted for LiDAR and 4D imaging radar point clouds. In contrast, extended object tracking (EOT), another important framework which accepts the joint-detection-and-tracking (JDT) strategy, has rarely been explored for online 3D MOT applications. This paper provides the first systematic investigation of the EOT framework for online 3D MOT in real-world ADAS and AD scenarios. Specifically, the widely accepted TBD-POT framework, the recently investigated JDT-EOT framework, and our proposed TBD-EOT framework are compared via extensive evaluations on two open source 4D imaging radar datasets: View-of-Delft and TJ4DRadSet. Experiment results demonstrate that the conventional TBD-POT framework remains preferable for online 3D MOT with high tracking performance and low computational complexity, while the proposed TBD-EOT framework has the potential to outperform it in certain situations. However, the results also show that the JDT-EOT framework encounters multiple problems and performs inadequately in evaluation scenarios. After analyzing the causes of these phenomena based on various evaluation metrics and visualizations, we provide possible guidelines to improve the performance of these MOT frameworks on real-world data. These provide the first benchmark and important insights for the future development of 4D imaging radar-based online 3D MOT algorithms. △ Less

Submitted 25 May, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

Comments: 8 pages, 5 figures, accepted by IEEE 35th Intelligent Vehicles Symposium (IV 2024), oral presentation (top 5%), code is available at https://github.com/dinggh0817/4D_Radar_MOT

arXiv:2306.04970 [pdf, other]

Motion Planning for Aerial Pick-and-Place based on Geometric Feasibility Constraints

Authors: Huazi Cao, Jiahao Shen, Cunjia Liu, Bo Zhu, Shiyu Zhao

Abstract: This paper studies the motion planning problem of the pick-and-place of an aerial manipulator that consists of a quadcopter flying base and a Delta arm. We propose a novel partially decoupled motion planning framework to solve this problem. Compared to the state-of-the-art approaches, the proposed one has two novel features. First, it does not suffer from increased computation in high-dimensional… ▽ More This paper studies the motion planning problem of the pick-and-place of an aerial manipulator that consists of a quadcopter flying base and a Delta arm. We propose a novel partially decoupled motion planning framework to solve this problem. Compared to the state-of-the-art approaches, the proposed one has two novel features. First, it does not suffer from increased computation in high-dimensional configuration spaces. That is because it calculates the trajectories of the quadcopter base and the end-effector separately in the Cartesian space based on proposed geometric feasibility constraints. The geometric feasibility constraints can ensure the resulting trajectories satisfy the aerial manipulator's geometry. Second, collision avoidance for the Delta arm is achieved through an iterative approach based on a pinhole map** method, so that the feasible trajectory can be found in an efficient manner. The proposed approach is verified by three experiments on a real aerial manipulation platform. The experimental results show the effectiveness of the proposed method for the aerial pick-and-place task. △ Less

Submitted 8 June, 2023; originally announced June 2023.

arXiv:2306.02231 [pdf, other]

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

Authors: Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong, Chenguang Zhu, Michael I. Jordan, Jiantao Jiao

Abstract: Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most widely used methods. Despite its popularity, however, PPO may suffer from mode collapse, instability, and poor sample efficiency. We show that these issues can be… ▽ More Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most widely used methods. Despite its popularity, however, PPO may suffer from mode collapse, instability, and poor sample efficiency. We show that these issues can be alleviated by a novel algorithm that we refer to as Advantage-Induced Policy Alignment (APA), which leverages a squared error loss function based on the estimated advantages. We demonstrate empirically that APA consistently outperforms PPO in language tasks by a large margin, when a separate reward model is employed as the evaluator. In addition, compared with PPO, APA offers a more stable form of control over the deviation from the model's initial policy, ensuring that the model improves its performance without collapsing to deterministic output. In addition to empirical results, we also provide a theoretical justification supporting the design of our loss function. △ Less

Submitted 2 November, 2023; v1 submitted 3 June, 2023; originally announced June 2023.

arXiv:2306.02003 [pdf, other]

On Optimal Caching and Model Multiplexing for Large Model Inference

Authors: Banghua Zhu, Ying Sheng, Lianmin Zheng, Clark Barrett, Michael I. Jordan, Jiantao Jiao

Abstract: Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. In particular, the large-scale deployment of these models is hindered by the significant resource requirements during inference. In this paper, we study two approaches for mitigating these challenges: employing a cache to… ▽ More Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. In particular, the large-scale deployment of these models is hindered by the significant resource requirements during inference. In this paper, we study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing. Theoretically, we provide an optimal algorithm for jointly optimizing both approaches to reduce the inference cost in both offline and online tabular settings. By combining a caching algorithm, namely Greedy Dual Size with Frequency (GDSF) or Least Expected Cost (LEC), with a model multiplexer, we achieve optimal rates in both offline and online settings. Empirically, simulations show that the combination of our caching and model multiplexing algorithms greatly improves over the baselines, with up to $50\times$ improvement over the baseline when the ratio between the maximum cost and minimum cost is $100$. Experiments on real datasets show a $4.3\times$ improvement in FLOPs over the baseline when the ratio for FLOPs is $10$, and a $1.8\times$ improvement in latency when the ratio for average latency is $1.85$. △ Less

Submitted 28 August, 2023; v1 submitted 3 June, 2023; originally announced June 2023.

arXiv:2306.00265 [pdf, other]

Doubly Robust Self-Training

Authors: Banghua Zhu, Mingyu Ding, Philip Jacobson, Ming Wu, Wei Zhan, Michael Jordan, Jiantao Jiao

Abstract: Self-training is an important technique for solving semi-supervised learning problems. It leverages unlabeled data by generating pseudo-labels and combining them with a limited labeled dataset for training. The effectiveness of self-training heavily relies on the accuracy of these pseudo-labels. In this paper, we introduce doubly robust self-training, a novel semi-supervised algorithm that provabl… ▽ More Self-training is an important technique for solving semi-supervised learning problems. It leverages unlabeled data by generating pseudo-labels and combining them with a limited labeled dataset for training. The effectiveness of self-training heavily relies on the accuracy of these pseudo-labels. In this paper, we introduce doubly robust self-training, a novel semi-supervised algorithm that provably balances between two extremes. When the pseudo-labels are entirely incorrect, our method reduces to a training process solely using labeled data. Conversely, when the pseudo-labels are completely accurate, our method transforms into a training process utilizing all pseudo-labeled data and labeled data, thus increasing the effective sample size. Through empirical evaluations on both the ImageNet dataset for image classification and the nuScenes autonomous driving dataset for 3D object detection, we demonstrate the superiority of the doubly robust loss over the standard self-training baseline. △ Less

Submitted 2 November, 2023; v1 submitted 31 May, 2023; originally announced June 2023.

arXiv:2305.08247 [pdf]

A Fast and Robust Camera-IMU Online Calibration Method For Localization System

Authors: Xiaowen Tao, Pengxiang Meng, Bing Zhu, Jian Zhao

Abstract: Autonomous driving has spurred the development of sensor fusion techniques, which combine data from multiple sensors to improve system performance. In particular, localization system based on sensor fusion , such as Visual Simultaneous Localization and Map** (VSLAM), is an important component in environment perception, and is the basis of decision-making and motion control for intelligent vehicl… ▽ More Autonomous driving has spurred the development of sensor fusion techniques, which combine data from multiple sensors to improve system performance. In particular, localization system based on sensor fusion , such as Visual Simultaneous Localization and Map** (VSLAM), is an important component in environment perception, and is the basis of decision-making and motion control for intelligent vehicles. The accuracy of extrinsic calibration parameters between camera and IMU has significant effect on the positioning precision when performing VSLAM system. Currently, existing methods are time-consuming using complex optimization methods and sensitive to noise and outliers due to off-calibration, which can negatively impact system performance. To address these problems, this paper presents a fast and robust camera-IMU online calibration method based space coordinate transformation constraints and SVD (singular Value Decomposition) tricks. First, constraint equations are constructed based on equality of rotation and transformation matrices between camera frames and IMU coordinates at different moments. Secondly, the external parameters of the camera-IMU are solved using quaternion transformation and SVD techniques. Finally, the proposed method is validated using ROS platform, where images from the camera and velocity, acceleration, and angular velocity data from the IMU are recorded in a ROS bag file. The results showed that the proposed method can achieve robust and reliable camera-IMU online calibration parameters results with less tune consuming and less uncertainty. △ Less

Submitted 14 May, 2023; originally announced May 2023.

arXiv:2305.07618 [pdf]

Uncertainty Estimation and Out-of-Distribution Detection for Deep Learning-Based Image Reconstruction using the Local Lipschitz

Authors: Danyal F. Bhutto, Bo Zhu, Jeremiah Z. Liu, Neha Koonjoo, Hongwei B. Li, Bruce R. Rosen, Matthew S. Rosen

Abstract: Accurate image reconstruction is at the heart of diagnostics in medical imaging. Supervised deep learning-based approaches have been investigated for solving inverse problems including image reconstruction. However, these trained models encounter unseen data distributions that are widely shifted from training data during deployment. Therefore, it is essential to assess whether a given input falls… ▽ More Accurate image reconstruction is at the heart of diagnostics in medical imaging. Supervised deep learning-based approaches have been investigated for solving inverse problems including image reconstruction. However, these trained models encounter unseen data distributions that are widely shifted from training data during deployment. Therefore, it is essential to assess whether a given input falls within the training data distribution for diagnostic purposes. Uncertainty estimation approaches exist but focus on providing an uncertainty map to radiologists, rather than assessing the training distribution fit. In this work, we propose a method based on the local Lipschitz-based metric to distinguish out-of-distribution images from in-distribution with an area under the curve of 99.94%. Empirically, we demonstrate a very strong relationship between the local Lipschitz value and mean absolute error (MAE), supported by a high Spearman's rank correlation coefficient of 0.8475, which determines the uncertainty estimation threshold for optimal model performance. Through the identification of false positives, the local Lipschitz and MAE relationship was used to guide data augmentation and reduce model uncertainty. Our study was validated using the AUTOMAP architecture for sensor-to-image Magnetic Resonance Imaging (MRI) reconstruction. We compare our proposed approach with baseline methods: Monte-Carlo dropout and deep ensembles, and further analysis included MRI denoising and Computed Tomography (CT) sparse-to-full view reconstruction using UNET architectures. We show that our approach is applicable to various architectures and learned functions, especially in the realm of medical image reconstruction, where preserving the diagnostic accuracy of reconstructed images remains paramount. △ Less

Submitted 1 December, 2023; v1 submitted 12 May, 2023; originally announced May 2023.

arXiv:2303.11692 [pdf, other]

ByteCover3: Accurate Cover Song Identification on Short Queries

Authors: Xingjian Du, Zijie Wang, Xia Liang, Huidong Liang, Bilei Zhu, Zejun Ma

Abstract: Deep learning based methods have become a paradigm for cover song identification (CSI) in recent years, where the ByteCover systems have achieved state-of-the-art results on all the mainstream datasets of CSI. However, with the burgeon of short videos, many real-world applications require matching short music excerpts to full-length music tracks in the database, which is still under-explored and w… ▽ More Deep learning based methods have become a paradigm for cover song identification (CSI) in recent years, where the ByteCover systems have achieved state-of-the-art results on all the mainstream datasets of CSI. However, with the burgeon of short videos, many real-world applications require matching short music excerpts to full-length music tracks in the database, which is still under-explored and waiting for an industrial-level solution. In this paper, we upgrade the previous ByteCover systems to ByteCover3 that utilizes local features to further improve the identification performance of short music queries. ByteCover3 is designed with a local alignment loss (LAL) module and a two-stage feature retrieval pipeline, allowing the system to perform CSI in a more precise and efficient way. We evaluated ByteCover3 on multiple datasets with different benchmark settings, where ByteCover3 beat all the compared methods including its previous versions. △ Less

Submitted 21 March, 2023; originally announced March 2023.

Comments: Accepeted by ICASSP 2023

arXiv:2301.06784 [pdf, other]

On the Statistical Consistency of a Generalized Cepstral Estimator

Authors: Bin Zhu, Mattia Zorzi

Abstract: We consider the problem to estimate the generalized cepstral coefficients of a stationary stochastic process or stationary multidimensional random field. It turns out that a naive version of the periodogram-based estimator for the generalized cepstral coefficients is not consistent. We propose a consistent estimator for those coefficients. Moreover, we show that the latter can be used in order to… ▽ More We consider the problem to estimate the generalized cepstral coefficients of a stationary stochastic process or stationary multidimensional random field. It turns out that a naive version of the periodogram-based estimator for the generalized cepstral coefficients is not consistent. We propose a consistent estimator for those coefficients. Moreover, we show that the latter can be used in order to build a consistent estimator for a particular class of cascade linear stochastic systems. △ Less

Submitted 17 January, 2023; originally announced January 2023.

Comments: 11 pages in IEEE Transactions template, 4 figures. Submitted to IEEE Transactions on Automatic Control

arXiv:2212.05301 [pdf, other]

Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning

Authors: Chen Chen, Yuchen Hu, Qiang Zhang, Heqing Zou, Beier Zhu, Eng Siong Chng

Abstract: Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, th… ▽ More Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises. △ Less

Submitted 2 February, 2023; v1 submitted 10 December, 2022; originally announced December 2022.

Comments: Accepted by AAAI2023

arXiv:2212.03540 [pdf, other]

EASpace: Enhanced Action Space for Policy Transfer

Authors: Zheng Zhang, Qingrui Zhang, Bo Zhu, Xiaohan Wang, Tianjiang Hu

Abstract: Formulating expert policies as macro actions promises to alleviate the long-horizon issue via structured exploration and efficient credit assignment. However, traditional option-based multi-policy transfer methods suffer from inefficient exploration of macro action's length and insufficient exploitation of useful long-duration macro actions. In this paper, a novel algorithm named EASpace (Enhanced… ▽ More Formulating expert policies as macro actions promises to alleviate the long-horizon issue via structured exploration and efficient credit assignment. However, traditional option-based multi-policy transfer methods suffer from inefficient exploration of macro action's length and insufficient exploitation of useful long-duration macro actions. In this paper, a novel algorithm named EASpace (Enhanced Action Space) is proposed, which formulates macro actions in an alternative form to accelerate the learning process using multiple available sub-optimal expert policies. Specifically, EASpace formulates each expert policy into multiple macro actions with different execution {times}. All the macro actions are then integrated into the primitive action space directly. An intrinsic reward, which is proportional to the execution time of macro actions, is introduced to encourage the exploitation of useful macro actions. The corresponding learning rule that is similar to Intra-option Q-learning is employed to improve the data efficiency. Theoretical analysis is presented to show the convergence of the proposed learning rule. The efficiency of EASpace is illustrated by a grid-based game and a multi-agent pursuit problem. The proposed algorithm is also implemented in physical systems to validate its effectiveness. △ Less

Submitted 24 July, 2023; v1 submitted 7 December, 2022; originally announced December 2022.

Comments: 15 Pages

arXiv:2211.06175 [pdf, other]

doi 10.1002/rnc.6962

Control Lyapunov-Barrier Function Based Model Predictive Control for Stochastic Nonlinear Affine Systems

Authors: Weijiang Zheng, Bing Zhu

Abstract: A stochastic model predictive control (MPC) framework is presented in this paper for nonlinear affine systems with stability and feasibility guarantee. We first introduce the concept of stochastic control Lyapunov-barrier function (CLBF) and provide a method to construct CLBF by combining an unconstrained control Lyapunov function (CLF) and control barrier functions. The unconstrained CLF is obtai… ▽ More A stochastic model predictive control (MPC) framework is presented in this paper for nonlinear affine systems with stability and feasibility guarantee. We first introduce the concept of stochastic control Lyapunov-barrier function (CLBF) and provide a method to construct CLBF by combining an unconstrained control Lyapunov function (CLF) and control barrier functions. The unconstrained CLF is obtained from its corresponding semi-linear system through dynamic feedback linearization. Based on the constructed CLBF, we utilize sampled-data MPC framework to deal with states and inputs constraints, and to analyze stability of closed-loop systems. Moreover, event-triggering mechanisms are integrated into MPC framework to improve performance during sampling intervals. The proposed CLBF based stochastic MPC is validated via an obstacle avoidance example. △ Less

Submitted 26 June, 2023; v1 submitted 11 November, 2022; originally announced November 2022.

Comments: 21 pages, 6 figures

Journal ref: International Journal of Robust and Nonlinear Control, 2024

arXiv:2208.14372 [pdf, ps, other]

Dead-beat model predictive control for discrete-time linear systems

Authors: Bing Zhu

Abstract: In this paper, model predictive control (MPC) strategies are proposed for dead-beat control of linear systems with and without state and control constraints. In unconstrained MPC, deadbeat performance can be guaranteed by setting the control horizon to the system dimension, and adding an terminal equality constraint. It is proved that the unconstrained deadbeat MPC is equivalent to linear deadbeat… ▽ More In this paper, model predictive control (MPC) strategies are proposed for dead-beat control of linear systems with and without state and control constraints. In unconstrained MPC, deadbeat performance can be guaranteed by setting the control horizon to the system dimension, and adding an terminal equality constraint. It is proved that the unconstrained deadbeat MPC is equivalent to linear deadbeat control. The proposed constrained deadbeat MPC is designed by setting the control horizon equal to the system dimension and penalizing only the terminal cost. The recursive feasibility and deadbeat performance are proved theoretically. △ Less

Submitted 30 August, 2022; originally announced August 2022.

arXiv:2208.10059 [pdf, ps, other]

Sampling Gaussian Stationary Random Fields: A Stochastic Realization Approach

Authors: Bin Zhu, Jiahao Liu, Zhengshou Lai, Tao Qian

Abstract: Generating large-scale samples of stationary random fields is of great importance in the fields such as geomaterial modeling and uncertainty quantification. Traditional methodologies based on covariance matrix decomposition have the diffculty of being computationally expensive, which is even more serious when the dimension of the random field is large. This paper proposes an effcient stochastic re… ▽ More Generating large-scale samples of stationary random fields is of great importance in the fields such as geomaterial modeling and uncertainty quantification. Traditional methodologies based on covariance matrix decomposition have the diffculty of being computationally expensive, which is even more serious when the dimension of the random field is large. This paper proposes an effcient stochastic realization approach for sampling Gaussian stationary random fields from a systems and control point of view. Specifically, we take the exponential and Gaussian covariance functions as examples and make a decoupling assumption when there are multiple dimensions. Then a rational spectral density is constructed in each dimension using techniques from covariance extension, and the corresponding autoregressive moving-average (ARMA) model is obtained via spectral factorization. As a result, samples of the random field with a specific covariance function can be generated very effciently in the space domain by implementing the ARMA recursion using a white noise input. Such a procedure is computationally cheap due to the fact that the constructed ARMA model has a low order. Furthermore, the same method is integrated to multiscale simulations where interpolations of the generated samples are achieved when one zooms into finer scales. Both theoretical analysis and simulation results show that our approach performs favorably compared with covariance matrix decomposition methods. △ Less

Submitted 22 August, 2022; originally announced August 2022.

Comments: 17 pages, 9 figures

arXiv:2206.10255 [pdf, other]

GNN-PMB: A Simple but Effective Online 3D Multi-Object Tracker without Bells and Whistles

Authors: Jianan Liu, Li** Bai, Yuxuan Xia, Tao Huang, Bing Zhu, Qing-Long Han

Abstract: Multi-object tracking (MOT) is among crucial applications in modern advanced driver assistance systems (ADAS) and autonomous driving (AD) systems. The global nearest neighbor (GNN) filter, as the earliest random vector-based Bayesian tracking framework, has been adopted in most of state-of-the-arts trackers in the automotive industry. The development of random finite set (RFS) theory facilitates a… ▽ More Multi-object tracking (MOT) is among crucial applications in modern advanced driver assistance systems (ADAS) and autonomous driving (AD) systems. The global nearest neighbor (GNN) filter, as the earliest random vector-based Bayesian tracking framework, has been adopted in most of state-of-the-arts trackers in the automotive industry. The development of random finite set (RFS) theory facilitates a mathematically rigorous treatment of the MOT problem, and different variants of RFS-based Bayesian filters have then been proposed. However, their effectiveness in the real ADAS and AD application is still an open problem. In this paper, it is demonstrated that the latest RFS-based Bayesian tracking framework could be superior to typical random vector-based Bayesian tracking framework via a systematic comparative study of both traditional random vector-based Bayesian filters with rule-based heuristic track maintenance and RFS-based Bayesian filters on the nuScenes validation dataset. An RFS-based tracker, namely Poisson multi-Bernoulli filter using the global nearest neighbor (GNN-PMB), is proposed to LiDAR-based MOT tasks. This GNN-PMB tracker is simple to use, and it achieves competitive results on the nuScenes dataset. Specifically, the proposed GNN-PMB tracker outperforms most state-of-the-art LiDAR-only trackers and LiDAR and camera fusion-based trackers, ranking the $3^{rd}$ among all LiDAR-only trackers on nuScenes 3D tracking challenge leader board at the time of submission. △ Less

Submitted 8 February, 2023; v1 submitted 21 June, 2022; originally announced June 2022.

Comments: accepted by IEEE Transactions on Intelligent Vehicles

arXiv:2205.08143 [pdf, other]

doi 10.1016/j.ultrasmedbio.2023.11.009

Brachial Plexus Nerve Trunk Segmentation Using Deep Learning: A Comparative Study with Doctors' Manual Segmentation

Authors: Yu Wang, Binbin Zhu, Lingsi Kong, Jianlin Wang, Bin Gao, Jianhua Wang, Dingcheng Tian, Yudong Yao

Abstract: Ultrasound-guided nerve block anesthesia (UGNB) is a high-tech visual nerve block anesthesia method that can observe the target nerve and its surrounding structures, the puncture needle's advancement, and local anesthetics spread in real-time. The key in UGNB is nerve identification. With the help of deep learning methods, the automatic identification or segmentation of nerves can be realized, ass… ▽ More Ultrasound-guided nerve block anesthesia (UGNB) is a high-tech visual nerve block anesthesia method that can observe the target nerve and its surrounding structures, the puncture needle's advancement, and local anesthetics spread in real-time. The key in UGNB is nerve identification. With the help of deep learning methods, the automatic identification or segmentation of nerves can be realized, assisting doctors in completing nerve block anesthesia accurately and efficiently. Here, we establish a public dataset containing 320 ultrasound images of brachial plexus (BP). Three experienced doctors jointly produce the BP segmentation ground truth and label brachial plexus trunks. We design a brachial plexus segmentation system (BPSegSys) based on deep learning. BPSegSys achieves experienced-doctor-level nerve identification performance in various experiments. We evaluate BPSegSys' performance in terms of intersection-over-union (IoU), a commonly used performance measure for segmentation experiments. Considering three dataset groups in our established public dataset, the IoU of BPSegSys are 0.5238, 0.4715, and 0.5029, respectively, which exceed the IoU 0.5205, 0.4704, and 0.4979 of experienced doctors. In addition, we show that BPSegSys can help doctors identify brachial plexus trunks more accurately, with IoU improvement up to 27%, which has significant clinical application value. △ Less

Submitted 17 May, 2022; originally announced May 2022.

Comments: 9 pages

Journal ref: [J]. Ultrasound in Medicine & Biology, 2024, 50(3): 374-383

arXiv:2205.06090 [pdf, other]

doi 10.1109/JSSC.2022.3204508

NeuralTree: A 256-Channel 0.227-$μ$J/Class Versatile Neural Activity Classification and Closed-Loop Neuromodulation SoC

Authors: Uisub Shin, Cong Ding, Bingzhao Zhu, Yashwanth Vyza, Alix Trouillet, Emilie C. M. Revol, Stéphanie P. Lacour, Mahsa Shoaran

Abstract: Closed-loop neural interfaces with on-chip machine learning can detect and suppress disease symptoms in neurological disorders or restore lost functions in paralyzed patients. While high-density neural recording can provide rich neural activity information for accurate disease-state detection, existing systems have low channel counts and poor scalability, which could limit their therapeutic effica… ▽ More Closed-loop neural interfaces with on-chip machine learning can detect and suppress disease symptoms in neurological disorders or restore lost functions in paralyzed patients. While high-density neural recording can provide rich neural activity information for accurate disease-state detection, existing systems have low channel counts and poor scalability, which could limit their therapeutic efficacy. This work presents a highly scalable and versatile closed-loop neural interface SoC that can overcome these limitations. A 256-channel time-division multiplexed (TDM) front-end with a two-step fast-settling mixed-signal DC servo loop (DSL) is proposed to record high-spatial-resolution neural activity and perform channel-selective brain-state inference. A tree-structured neural network (NeuralTree) classification processor extracts a rich set of neural biomarkers in a patient- and disease-specific manner. Trained with an energy-aware learning algorithm, the NeuralTree classifier detects the symptoms of underlying disorders (e.g., epilepsy and movement disorders) at an optimal energy-accuracy tradeoff. A 16-channel high-voltage (HV) compliant neurostimulator closes the therapeutic loop by delivering charge-balanced biphasic current pulses to the brain. The proposed SoC was fabricated in 65-nm CMOS and achieved a 0.227-$μ$J/class energy efficiency in a compact area of 0.014mm$^2$/channel. The SoC was extensively verified on human electroencephalography (EEG) and intracranial EEG (iEEG) epilepsy datasets, obtaining 95.6%/94% sensitivity and 96.8%/96.9% specificity, respectively. In vivo neural recordings using soft $μ$ECoG arrays and multi-domain biomarker extraction were further performed on a rat model of epilepsy. In addition, for the first time in literature, on-chip classification of rest-state tremor in Parkinson's disease (PD) from human local field potentials (LFPs) was demonstrated. △ Less

Submitted 8 December, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

Journal ref: IEEE Journal of Solid-State Circuits, vol. 57, no. 11, pp. 3243-3257, Nov. 2022

arXiv:2205.02999 [pdf, ps, other]

Fast and Arbitrary Beam Pattern Design for RIS-Assisted Terahertz Wireless Communication

Authors: Jian Dang, Zaichen Zhang, Yewei Li, Liang Wu, Bingcheng Zhu, Lei Wang

Abstract: Reconfigurable intelligent surface (RIS) can assist terahertz wireless communication to restore the fragile line-of-sight links and facilitate beam steering. Arbitrary reflection beam patterns are desired to meet diverse requirements in different applications. This paper establishes relationship between RIS beam pattern design with two-dimensional finite impulse response filter design and proposes… ▽ More Reconfigurable intelligent surface (RIS) can assist terahertz wireless communication to restore the fragile line-of-sight links and facilitate beam steering. Arbitrary reflection beam patterns are desired to meet diverse requirements in different applications. This paper establishes relationship between RIS beam pattern design with two-dimensional finite impulse response filter design and proposes a fast non-iterative algorithm to solve the problem. Simulations show that the proposed method outperforms baseline method. Hence, it represents a promising solution for fast and arbitrary beam pattern design in RIS-assisted terahertz wireless communication. △ Less

Submitted 5 May, 2022; originally announced May 2022.

Comments: 5 pages, 5 figures

arXiv:2204.14057 [pdf, other]

Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast

Authors: Boqing Zhu, Kele Xu, Changjian Wang, Zheng Qin, Tao Sun, Huaimin Wang, Yuxing Peng

Abstract: We present an approach to learn voice-face representations from the talking face videos, without any identity labels. Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face. These methods neglect the semantic content of different videos, introducing false-negative pairs as training noise. Furthermore, the positive pairs are constructed based… ▽ More We present an approach to learn voice-face representations from the talking face videos, without any identity labels. Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face. These methods neglect the semantic content of different videos, introducing false-negative pairs as training noise. Furthermore, the positive pairs are constructed based on the natural correlation between audio clips and visual frames. However, this correlation might be weak or inaccurate in a large amount of real-world data, which leads to deviating positives into the contrastive paradigm. To address these issues, we propose the cross-modal prototype contrastive learning (CMPC), which takes advantage of contrastive methods and resists adverse effects of false negatives and deviate positives. On one hand, CMPC could learn the intra-class invariance by constructing semantic-wise positives via unsupervised clustering in different modalities. On the other hand, by comparing the similarities of cross-modal instances from that of cross-modal prototypes, we dynamically recalibrate the unlearnable instances' contribution to overall loss. Experiments show that the proposed approach outperforms state-of-the-art unsupervised methods on various voice-face association evaluation protocols. Additionally, in the low-shot supervision setting, our method also has a significant improvement compared to previous instance-wise contrastive learning. △ Less

Submitted 26 May, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

Comments: 8 pages, 4 figures. Accepted by IJCAI-2022

arXiv:2203.03863 [pdf, ps, other]

Amplitude-Constrained Constellation and Reflection Pattern Designs for Directional Backscatter Communications Using Programmable Metasurface

Authors: Wei Wang, Bincheng Zhu, Yongming Huang, Wei Zhang

Abstract: The large scale reflector array of programmable metasurfaces is capable of increasing the power efficiency of backscatter communications via passive beamforming and thus has the potential to revolutionize the low-data-rate nature of backscatter communications. In this paper, we propose to design the power-efficient higher-order constellation and reflection pattern under the amplitude constraint br… ▽ More The large scale reflector array of programmable metasurfaces is capable of increasing the power efficiency of backscatter communications via passive beamforming and thus has the potential to revolutionize the low-data-rate nature of backscatter communications. In this paper, we propose to design the power-efficient higher-order constellation and reflection pattern under the amplitude constraint brought by backscatter communications. For the constellation design, we adopt the amplitude and phase-shift keying (APSK) constellation and optimize the parameters of APSK such as ring number, ring radius, and inter-ring phase difference. Specifically, we derive closed-form solutions to the optimal ring radius and interring phase difference for an arbitrary modulation order in the decomposed subproblems. For the reflection pattern design, we propose to optimize the passive beamforming vector by solving a multi-objective optimization problem that maximizes reflection power and guarantees beam homogenization within the interested angle range. To solve the problem, we propose a constant-modulus power iteration method, which is proven to be monotonically increasing, to maximize the objective function in each iteration. Numerical results show that the proposed APSK constellation design and reflection pattern design outperform the existing modulation and beam pattern designs in programmable metasurface enabled backscatter communications. △ Less

Submitted 30 March, 2023; v1 submitted 8 March, 2022; originally announced March 2022.

Comments: Accepted in IEEE Transactions on Wireless Communications

arXiv:2202.10139 [pdf, other]

S3T: Self-Supervised Pre-training with Swin Transformer for Music Classification

Authors: Hang Zhao, Chen Zhang, Belei Zhu, Zejun Ma, Kejun Zhang

Abstract: In this paper, we propose S3T, a self-supervised pre-training method with Swin Transformer for music classification, aiming to learn meaningful music representations from massive easily accessible unlabeled music data. S3T introduces a momentum-based paradigm, MoCo, with Swin Transformer as its feature extractor to music time-frequency domain. For better music representations learning, S3T contrib… ▽ More In this paper, we propose S3T, a self-supervised pre-training method with Swin Transformer for music classification, aiming to learn meaningful music representations from massive easily accessible unlabeled music data. S3T introduces a momentum-based paradigm, MoCo, with Swin Transformer as its feature extractor to music time-frequency domain. For better music representations learning, S3T contributes a music data augmentation pipeline and two specially designed pre-processors. To our knowledge, S3T is the first method combining the Swin Transformer with a self-supervised learning method for music classification. We evaluate S3T on music genre classification and music tagging tasks with linear classifiers trained on learned representations. Experimental results show that S3T outperforms the previous self-supervised method (CLMR) by 12.5 percents top-1 accuracy and 4.8 percents PR-AUC on two tasks respectively, and also surpasses the task-specific state-of-the-art supervised methods. Besides, S3T shows advances in label efficiency using only 10% labeled data exceeding CLMR on both tasks with 100% labeled data. △ Less

Submitted 21 February, 2022; originally announced February 2022.

Comments: Accepted by ICASSP2022

arXiv:2202.05267 [pdf, other]

On Real-time Image Reconstruction with Neural Networks for MRI-guided Radiotherapy

Authors: David E. J. Waddington, Nicholas Hindley, Neha Koonjoo, Christopher Chiu, Tess Reynolds, Paul Z. Y. Liu, Bo Zhu, Danyal Bhutto, Chiara Paganelli, Paul J. Keall, Matthew S. Rosen

Abstract: MRI-guidance techniques that dynamically adapt radiation beams to follow tumor motion in real-time will lead to more accurate cancer treatments and reduced collateral healthy tissue damage. The gold-standard for reconstruction of undersampled MR data is compressed sensing (CS) which is computationally slow and limits the rate that images can be available for real-time adaptation. Here, we demonstr… ▽ More MRI-guidance techniques that dynamically adapt radiation beams to follow tumor motion in real-time will lead to more accurate cancer treatments and reduced collateral healthy tissue damage. The gold-standard for reconstruction of undersampled MR data is compressed sensing (CS) which is computationally slow and limits the rate that images can be available for real-time adaptation. Here, we demonstrate the use of automated transform by manifold approximation (AUTOMAP), a generalized framework that maps raw MR signal to the target image domain, to rapidly reconstruct images from undersampled radial k-space data. The AUTOMAP neural network was trained to reconstruct images from a golden-angle radial acquisition, a benchmark for motion-sensitive imaging, on lung cancer patient data and generic images from ImageNet. Model training was subsequently augmented with motion-encoded k-space data derived from videos in the YouTube-8M dataset to encourage motion robust reconstruction. We find that AUTOMAP-reconstructed radial k-space has equivalent accuracy to CS but with much shorter processing times after initial fine-tuning on retrospectively acquired lung cancer patient data. Validation of motion-trained models with a virtual dynamic lung tumor phantom showed that the generalized motion properties learned from YouTube lead to improved target tracking accuracy. Our work shows that AUTOMAP can achieve real-time, accurate reconstruction of radial data. These findings imply that neural-network-based reconstruction is potentially superior to existing approaches for real-time image guidance applications. △ Less

Submitted 18 May, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

Comments: 12 pages, 6 figures, 1 table. v2 has a typo in eqn 1 corrected and references added to the discussion

arXiv:2202.01269 [pdf, ps, other]

Robust Estimation for Nonparametric Families via Generative Adversarial Networks

Authors: Banghua Zhu, Jiantao Jiao, Michael I. Jordan

Abstract: We provide a general framework for designing Generative Adversarial Networks (GANs) to solve high dimensional robust statistics problems, which aim at estimating unknown parameter of the true distribution given adversarially corrupted samples. Prior work focus on the problem of robust mean and covariance estimation when the true distribution lies in the family of Gaussian distributions or elliptic… ▽ More We provide a general framework for designing Generative Adversarial Networks (GANs) to solve high dimensional robust statistics problems, which aim at estimating unknown parameter of the true distribution given adversarially corrupted samples. Prior work focus on the problem of robust mean and covariance estimation when the true distribution lies in the family of Gaussian distributions or elliptical distributions, and analyze depth or scoring rule based GAN losses for the problem. Our work extend these to robust mean estimation, second moment estimation, and robust linear regression when the true distribution only has bounded Orlicz norms, which includes the broad family of sub-Gaussian, sub-Exponential and bounded moment distributions. We also provide a different set of sufficient conditions for the GAN loss to work: we only require its induced distance function to be a cumulative density function of some light-tailed distribution, which is easily satisfied by neural networks with sigmoid activation. In terms of techniques, our proposed GAN losses can be viewed as a smoothed and generalized Kolmogorov-Smirnov distance, which overcomes the computational intractability of the original Kolmogorov-Smirnov distance used in the prior work. △ Less

Submitted 2 February, 2022; originally announced February 2022.

arXiv:2202.00874 [pdf, other]

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Authors: Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov

Abstract: Audio classification is an important task of map** audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in au… ▽ More Audio classification is an important task of map** audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT. △ Less

Submitted 1 February, 2022; originally announced February 2022.

Comments: Preprint version for ICASSP 2022, Singapore

arXiv:2201.08563 [pdf, other]

Performance Analysis of Hybrid RF-Reconfigurable Intelligent Surfaces Assisted FSO Communication

Authors: Haibo Wang, Zaichen Zhang, Bingcheng Zhu, Yidi Zhang

Abstract: Optical reconfigurable intelligent surface (ORIS) is an emerging technology that can achieve reconfigurable optical propagation environments by precisely adjusting signal's reflection and shape through a large number of passive reflecting elements. In this paper, we investigate the performance of an ORIS-assisted dual-hop hybrid radio frequency (RF) and free space optics (FSO) communication system… ▽ More Optical reconfigurable intelligent surface (ORIS) is an emerging technology that can achieve reconfigurable optical propagation environments by precisely adjusting signal's reflection and shape through a large number of passive reflecting elements. In this paper, we investigate the performance of an ORIS-assisted dual-hop hybrid radio frequency (RF) and free space optics (FSO) communication system. By jointly considering the physical models of ORIS, RF channel, atmospheric turbulence, and pointing error, the closed-form solutions of the system's precise outage probability, asymptotic outage probability and BER have been derived. It is shown through numerical results that the derivation results are accurate and the RF-FSO links with ORISs show a slightly worse performance than the traditional RF-FSO links. Based on theoretical analysis and simulation results, the system design and effect of each parameter have been discussed. △ Less

Submitted 21 January, 2022; originally announced January 2022.

arXiv:2112.07891 [pdf, other]

Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data

Authors: Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov

Abstract: Deep learning techniques for separating audio into different sound sources face several challenges. Standard architectures require training separate models for different types of audio sources. Although some universal separators employ a single model to target multiple sources, they have difficulty generalizing to unseen sources. In this paper, we propose a three-component pipeline to train a univ… ▽ More Deep learning techniques for separating audio into different sound sources face several challenges. Standard architectures require training separate models for different types of audio sources. Although some universal separators employ a single model to target multiple sources, they have difficulty generalizing to unseen sources. In this paper, we propose a three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet. First, we propose a transformer-based sound event detection system for processing weakly-labeled training data. Second, we devise a query-based audio separation model that leverages this data for model training. Third, we design a latent embedding processor to encode queries that specify audio targets for separation, allowing for zero-shot generalization. Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training. In addition, the proposed audio separator can be used in a zero-shot setting, learning to separate types of audio sources that were never seen in training. To evaluate the separation performance, we test our model on MUSDB18, while training on the disjoint AudioSet. We further verify the zero-shot performance by conducting another experiment on audio source types that are held-out from training. The model achieves comparable Source-to-Distortion Ratio (SDR) performance to current supervised models in both cases. △ Less

Submitted 12 February, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

Comments: Preprint version for Association for the Advancement of Artificial Intelligence Conference, AAAI 2022

arXiv:2112.00333 [pdf, ps, other]

Joint Cluster Head Selection and Trajectory Planning in UAV-Aided IoT Networks by Reinforcement Learning with Sequential Model

Authors: Botao Zhu, Ebrahim Bedeer, Ha H. Nguyen, Robert Barton, Jerome Henry

Abstract: Employing unmanned aerial vehicles (UAVs) has attracted growing interests and emerged as the state-of-the-art technology for data collection in Internet-of-Things (IoT) networks. In this paper, with the objective of minimizing the total energy consumption of the UAV-IoT system, we formulate the problem of jointly designing the UAV's trajectory and selecting cluster heads in the IoT network as a co… ▽ More Employing unmanned aerial vehicles (UAVs) has attracted growing interests and emerged as the state-of-the-art technology for data collection in Internet-of-Things (IoT) networks. In this paper, with the objective of minimizing the total energy consumption of the UAV-IoT system, we formulate the problem of jointly designing the UAV's trajectory and selecting cluster heads in the IoT network as a constrained combinatorial optimization problem which is classified as NP-hard and challenging to solve. We propose a novel deep reinforcement learning (DRL) with a sequential model strategy that can effectively learn the policy represented by a sequence-to-sequence neural network for the UAV's trajectory design in an unsupervised manner. Through extensive simulations, the obtained results show that the proposed DRL method can find the UAV's trajectory that requires much less energy consumption when compared to other baseline algorithms and achieves close-to-optimal performance. In addition, simulation results show that the trained model by our proposed DRL algorithm has an excellent generalization ability to larger problem sizes without the need to retrain the model. △ Less

Submitted 1 December, 2021; originally announced December 2021.

Comments: This paper has been accepted in IEEE IoT-J

arXiv:2110.01775 [pdf, other]

doi 10.1109/TIV.2022.3168899

Deep Instance Segmentation with Automotive Radar Detection Points

Authors: Jianan Liu, Weiyi Xiong, Li** Bai, Yuxuan Xia, Tao Huang, Wanli Ouyang, Bing Zhu

Abstract: Automotive radar provides reliable environmental perception in all-weather conditions with affordable cost, but it hardly supplies semantic and geometry information due to the sparsity of radar detection points. With the development of automotive radar technologies in recent years, instance segmentation becomes possible by using automotive radar. Its data contain contexts such as radar cross secti… ▽ More Automotive radar provides reliable environmental perception in all-weather conditions with affordable cost, but it hardly supplies semantic and geometry information due to the sparsity of radar detection points. With the development of automotive radar technologies in recent years, instance segmentation becomes possible by using automotive radar. Its data contain contexts such as radar cross section and micro-Doppler effects, and sometimes can provide detection when the field of view is obscured. The outcome from instance segmentation could be potentially used as the input of trackers for tracking targets. The existing methods often utilize a clustering-based classification framework, which fits the need of real-time processing but has limited performance due to minimum information provided by sparse radar detection points. In this paper, we propose an efficient method based on clustering of estimated semantic information to achieve instance segmentation for the sparse radar detection points. In addition, we show that the performance of the proposed approach can be further enhanced by incorporating the visual multi-layer perceptron. The effectiveness of the proposed method is verified by experimental results on the popular RadarScenes dataset, achieving 89.53% mean coverage and 86.97% mean average precision with the IoU threshold of 0.5, which is superior to other approaches in the literature. More significantly, the consumed memory is around 1MB, and the inference time is less than 40ms, indicating that our proposed algorithm is storage and time efficient. These two criteria ensure the practicality of the proposed method in real-world systems. △ Less

Submitted 5 February, 2023; v1 submitted 4 October, 2021; originally announced October 2021.

Comments: 11 pages, 9 figures, 3 tables, accepted by IEEE Transactions on Intelligent Vehicles

arXiv:2109.14926 [pdf, other]

A Fast Robust Numerical Continuation Solver to a Two-Dimensional Spectral Estimation Problem

Authors: Bin Zhu, Jiahao Liu

Abstract: This paper presents a fast algorithm to solve a spectral estimation problem for two-dimensional random fields. The latter is formulated as a convex optimization problem with the Itakura-Saito pseudodistance as the objective function subject to the constraints of moment equations. We exploit the structure of the Hessian of the dual objective function in order to make possible a fast Newton solver.… ▽ More This paper presents a fast algorithm to solve a spectral estimation problem for two-dimensional random fields. The latter is formulated as a convex optimization problem with the Itakura-Saito pseudodistance as the objective function subject to the constraints of moment equations. We exploit the structure of the Hessian of the dual objective function in order to make possible a fast Newton solver. Then we incorporate the Newton solver to a predictor-corrector numerical continuation method which is able to produce a parametrized family of solutions to the moment equations. We have performed two sets of numerical simulations to test our algorithm and spectral estimator. The simulations on the frequency estimation problem shows that our spectral estimator outperforms the classical windowed periodograms in the case of two hidden frequencies and has a higher resolution. The other set of simulations on system identification indicates that the numerical continuation method is more robust than Newton's method alone in ill-conditioned instances. △ Less

Submitted 30 September, 2021; originally announced September 2021.

Comments: 13 pages, 8 figures

arXiv:2109.05848 [pdf, other]

Closed-Loop Neural Prostheses with On-Chip Intelligence: A Review and A Low-Latency Machine Learning Model for Brain State Detection

Authors: Bingzhao Zhu, Uisub Shin, Mahsa Shoaran

Abstract: The application of closed-loop approaches in systems neuroscience and therapeutic stimulation holds great promise for revolutionizing our understanding of the brain and for develo** novel neuromodulation therapies to restore lost functions. Neural prostheses capable of multi-channel neural recording, on-site signal processing, rapid symptom detection, and closed-loop stimulation are critical to… ▽ More The application of closed-loop approaches in systems neuroscience and therapeutic stimulation holds great promise for revolutionizing our understanding of the brain and for develo** novel neuromodulation therapies to restore lost functions. Neural prostheses capable of multi-channel neural recording, on-site signal processing, rapid symptom detection, and closed-loop stimulation are critical to enabling such novel treatments. However, the existing closed-loop neuromodulation devices are too simplistic and lack sufficient on-chip processing and intelligence. In this paper, we first discuss both commercial and investigational closed-loop neuromodulation devices for brain disorders. Next, we review state-of-the-art neural prostheses with on-chip machine learning, focusing on application-specific integrated circuits (ASIC). System requirements, performance and hardware comparisons, design trade-offs, and hardware optimization techniques are discussed. To facilitate a fair comparison and guide design choices among various on-chip classifiers, we propose a new energy-area (E-A) efficiency figure of merit that evaluates hardware efficiency and multi-channel scalability. Finally, we present several techniques to improve the key design metrics of tree-based on-chip classifiers, both in the context of ensemble methods and oblique structures. △ Less

Submitted 13 September, 2021; originally announced September 2021.

arXiv:2109.03990 [pdf, other]

A Novel Method to Estimate the Coordinates of LEDs in Wireless Optical Positioning Systems

Authors: Kehan Zhang, Zaichen Zhang, Bingcheng Zhu

Abstract: Traditional visible light positioning (VLP) systems estimate receivers' coordinates based on the known light-emitting diode (LED) coordinates. However, the LED coordinates are not always known accurately. Because of the structural changes of the buildings due to temperature, humidity or material aging, even measured by highly accurate laser range finders, the LED coordinates may change unpredictab… ▽ More Traditional visible light positioning (VLP) systems estimate receivers' coordinates based on the known light-emitting diode (LED) coordinates. However, the LED coordinates are not always known accurately. Because of the structural changes of the buildings due to temperature, humidity or material aging, even measured by highly accurate laser range finders, the LED coordinates may change unpredictably. In this paper, we propose an easy and low-cost method to update the position information of the LEDs. We use two optical angle-of-arrival (AOA) estimators to detect the beam directions of the LEDs. Each AOA estimator has four differently oriented photodiodes (PDs). Considering the additive noises of the PDs, we derive the closed-form error expression for the proposed LED coordinates estimator. Both analytical and Monte Carlo experimental results show that the layout of the AOA estimators could affect the estimation error. These results may provide intuitive insights for the design of the optical indoor positioning systems. △ Less

Submitted 8 September, 2021; originally announced September 2021.

Comments: 5 pages, 4 figures, conference

arXiv:2109.00354 [pdf, ps, other]

Outage Analysis and Beamwidth Optimization for Positioning-Assisted Beamforming

Authors: Bingcheng Zhu, Zaichen Zhang, Julian Cheng

Abstract: Conventional beamforming is based on channel estimation, which can be computationally intensive and inaccurate when the antenna array is large. In this work, we study the outage probability of positioning-assisted beamforming systems. Closed-form outage probability bounds are derived by considering positioning error, link distance and beamwidth. Based on the analytical result, we show that the bea… ▽ More Conventional beamforming is based on channel estimation, which can be computationally intensive and inaccurate when the antenna array is large. In this work, we study the outage probability of positioning-assisted beamforming systems. Closed-form outage probability bounds are derived by considering positioning error, link distance and beamwidth. Based on the analytical result, we show that the beamwidth should be optimized with respect to the link distance and the transmit power, and such optimization significantly suppresses the outage probability. △ Less

Submitted 9 April, 2022; v1 submitted 1 September, 2021; originally announced September 2021.

arXiv:2108.00354 [pdf, ps, other]

UAV Trajectory Planning in Wireless Sensor Networks for Energy Consumption Minimization by Deep Reinforcement Learning

Authors: Botao Zhu, Ebrahim Bedeer, Ha H. Nguyen, Robert Barton, Jerome Henry

Abstract: Unmanned aerial vehicles (UAVs) have emerged as a promising candidate solution for data collection of large-scale wireless sensor networks (WSNs). In this paper, we investigate a UAV-aided WSN, where cluster heads (CHs) receive data from their member nodes, and a UAV is dispatched to collect data from CHs along the planned trajectory. We aim to minimize the total energy consumption of the UAV-WSN… ▽ More Unmanned aerial vehicles (UAVs) have emerged as a promising candidate solution for data collection of large-scale wireless sensor networks (WSNs). In this paper, we investigate a UAV-aided WSN, where cluster heads (CHs) receive data from their member nodes, and a UAV is dispatched to collect data from CHs along the planned trajectory. We aim to minimize the total energy consumption of the UAV-WSN system in a complete round of data collection. Toward this end, we formulate the energy consumption minimization problem as a constrained combinatorial optimization problem by jointly selecting CHs from nodes within clusters and planning the UAV's visiting order to the selected CHs. The formulated energy consumption minimization problem is NP-hard, and hence, hard to solve optimally. In order to tackle this challenge, we propose a novel deep reinforcement learning (DRL) technique, pointer network-A* (Ptr-A*), which can efficiently learn from experiences the UAV trajectory policy for minimizing the energy consumption. The UAV's start point and the WSN with a set of pre-determined clusters are fed into the Ptr-A*, and the Ptr-A* outputs a group of CHs and the visiting order to these CHs, i.e., the UAV's trajectory. The parameters of the Ptr-A* are trained on small-scale clusters problem instances for faster training by using the actor-critic algorithm in an unsupervised manner. At inference, three search strategies are also proposed to improve the quality of solutions. Simulation results show that the trained models based on 20-clusters and 40-clusters have a good generalization ability to solve the UAV's trajectory planning problem in WSNs with different numbers of clusters, without the need to retrain the models. Furthermore, the results show that our proposed DRL algorithm outperforms two baseline techniques. △ Less

Submitted 31 July, 2021; originally announced August 2021.

Journal ref: IEEE TVT, 2021

arXiv:2106.11411 [pdf, other]

Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams

Authors: Yuanbo Hou, Zhesong Yu, Xia Liang, Xingjian Du, Bilei Zhu, Zejun Ma, Dick Botteldooren

Abstract: Many previous audio-visual voice-related works focus on speech, ignoring the singing voice in the growing number of musical video streams on the Internet. For processing diverse musical video data, voice activity detection is a necessary step. This paper attempts to detect the speech and singing voices of target performers in musical video streams using audiovisual information. To integrate inform… ▽ More Many previous audio-visual voice-related works focus on speech, ignoring the singing voice in the growing number of musical video streams on the Internet. For processing diverse musical video data, voice activity detection is a necessary step. This paper attempts to detect the speech and singing voices of target performers in musical video streams using audiovisual information. To integrate information of audio and visual modalities, a multi-branch network is proposed to learn audio and image representations, and the representations are fused by attention based on semantic similarity to shape the acoustic representations through the probability of anchor vocalization. Experiments show the proposed audio-visual multi-branch network far outperforms the audio-only model in challenging acoustic environments, indicating the cross-modal information fusion based on semantic correlation is sensible and successful. △ Less

Submitted 21 June, 2021; originally announced June 2021.

Comments: Accepted by INTERSPEECH 2021

arXiv:2105.12151 [pdf, other]

AutoReCon: Neural Architecture Search-based Reconstruction for Data-free Compression

Authors: Baozhou Zhu, Peter Hofstee, Johan Peltenburg, **ho Lee, Zaid Alars

Abstract: Data-free compression raises a new challenge because the original training dataset for a pre-trained model to be compressed is not available due to privacy or transmission issues. Thus, a common approach is to compute a reconstructed training dataset before compression. The current reconstruction methods compute the reconstructed training dataset with a generator by exploiting information from the… ▽ More Data-free compression raises a new challenge because the original training dataset for a pre-trained model to be compressed is not available due to privacy or transmission issues. Thus, a common approach is to compute a reconstructed training dataset before compression. The current reconstruction methods compute the reconstructed training dataset with a generator by exploiting information from the pre-trained model. However, current reconstruction methods focus on extracting more information from the pre-trained model but do not leverage network engineering. This work is the first to consider network engineering as an approach to design the reconstruction method. Specifically, we propose the AutoReCon method, which is a neural architecture search-based reconstruction method. In the proposed AutoReCon method, the generator architecture is designed automatically given the pre-trained model for reconstruction. Experimental results show that using generators discovered by the AutoRecon method always improve the performance of data-free compression. △ Less

Submitted 25 May, 2021; originally announced May 2021.

arXiv:2103.03612 [pdf, other]

An Optimized H.266/VVC Software Decoder On Mobile Platform

Authors: Yiming Li, Shan Liu, Yu Chen, Yushan Zheng, Sijia Chen, Bin Zhu, Jian Lou

Abstract: As the successor of H.265/HEVC, the new versatile video coding standard (H.266/VVC) can provide up to 50% bitrate saving with the same subjective quality, at the cost of increased decoding complexity. To accelerate the application of the new coding standard, a real-time H.266/VVC software decoder that can support various platforms is implemented, where SIMD technologies, parallelism optimization,… ▽ More As the successor of H.265/HEVC, the new versatile video coding standard (H.266/VVC) can provide up to 50% bitrate saving with the same subjective quality, at the cost of increased decoding complexity. To accelerate the application of the new coding standard, a real-time H.266/VVC software decoder that can support various platforms is implemented, where SIMD technologies, parallelism optimization, and the acceleration strategies based on the characteristics of each coding tool are applied. As the mobile devices have become an essential carrier for video services nowadays, the mentioned optimization efforts are not only implemented for the x86 platform, but more importantly utilized to highly optimize the decoding performance on the ARM platform in this work. The experimental results show that when running on the Apple A14 SoC (iPhone 12pro), the average single-thread decoding speed of the present implementation can achieve 53fps (RA and LB) for full HD (1080p) bitstreams generated by VTM-11.0 reference software using 8bit Common Test Conditions (CTC). When multi-threading is enabled, an average of 32 fps (RA) can be achieved when decoding the 4K bitstreams. △ Less

Submitted 5 March, 2021; originally announced March 2021.

arXiv:2102.08430 [pdf]

Multi-Stage Transmission Line Flow Control Using Centralized and Decentralized Reinforcement Learning Agents

Authors: Xiumin Shang, **** Yang, Bingquan Zhu, Lin Ye, **g Zhang, Jian** Xu, Qin Lyu, Ruisheng Diao

Abstract: Planning future operational scenarios of bulk power systems that meet security and economic constraints typically requires intensive labor efforts in performing massive simulations. To automate this process and relieve engineers' burden, a novel multi-stage control approach is presented in this paper to train centralized and decentralized reinforcement learning agents that can automatically adjust… ▽ More Planning future operational scenarios of bulk power systems that meet security and economic constraints typically requires intensive labor efforts in performing massive simulations. To automate this process and relieve engineers' burden, a novel multi-stage control approach is presented in this paper to train centralized and decentralized reinforcement learning agents that can automatically adjust grid controllers for regulating transmission line flows at normal condition and under contingencies. The power grid flow control problem is formulated as Markov Decision Process (MDP). At stage one, centralized soft actor-critic (SAC) agent is trained to control generator active power outputs in a wide area to control transmission line flows against specified security limits. If line overloading issues remain unresolved, stage two is used to train decentralized SAC agent via load throw-over at local substations. The effectiveness of the proposed approach is verified on a series of actual planning cases used for operating the power grid of SGCC Zhejiang Electric Power Company. △ Less

Submitted 16 February, 2021; originally announced February 2021.

Comments: This work is accepted by NeurIPS ML4Eng workshop 2020, please refer to https://ml4eng.github.io/camera_readys/56.pdf

arXiv:2012.15398 [pdf, other]

Two New Approaches to Optical IRSs: Schemes and Comparative Analysis

Authors: Haibo Wang, Zaichen Zhang, Bingcheng Zhu, Jian Dang, Liang Wu

Abstract: Oriented to the point-to-multipoint free space optical communication (FSO) scenarios, this paper analyzes the micro-mirror array and phased array-type optical intelligent reflecting surface (OIRS) in terms of control mode, power efficiency, and beam splitting. We build the physical models of the two types of OIRSs. Based on the models, the closed form solution of OIRSs' output power density distri… ▽ More Oriented to the point-to-multipoint free space optical communication (FSO) scenarios, this paper analyzes the micro-mirror array and phased array-type optical intelligent reflecting surface (OIRS) in terms of control mode, power efficiency, and beam splitting. We build the physical models of the two types of OIRSs. Based on the models, the closed form solution of OIRSs' output power density distribution and power efficiency, along with their control algorithms have been derived. Then we propose the algorithms of beam splitting and multi-beam power allocation for two types of OIRSs. The channel fading in FSO system and the comparison of two types of OIRSs in actual systems are discussed according to the analytical results. Experiments and simulations are both presented to verify the feasibility of models and algorithms. △ Less

Submitted 30 December, 2020; originally announced December 2020.

Comments: 26 pages,11 figures

arXiv:2012.02832 [pdf]

A software decoder implementation for H.266/VVC video coding standard

Authors: Bin Zhu, Shan Liu, Yuan Liu, Yi Luo, **g Ye, Haiyan Xu, Ying Huang, Hualong Jiao, Xiaozhong Xu, Xianguo Zhang, Chenchen Gu

Abstract: Versatile Video Coding Standard (H.266/VVC) was completed by Joint Video Expert Team (JVET) of ITU-T and ISO/IEC, in July 2020. This new ITU recommendation/international standard is a successor to the well-known H.265/HEVC video coding standard with roughly doubled compression efficiency, but also at the cost of an increased computational complexity. The complexity of H.266/VVC decoder processing… ▽ More Versatile Video Coding Standard (H.266/VVC) was completed by Joint Video Expert Team (JVET) of ITU-T and ISO/IEC, in July 2020. This new ITU recommendation/international standard is a successor to the well-known H.265/HEVC video coding standard with roughly doubled compression efficiency, but also at the cost of an increased computational complexity. The complexity of H.266/VVC decoder processing modules is studied in this paper. An optimized decoder implementation using SIMD instruction extensions and additional parallel processing including data and task level parallelism is presented, which can achieve real-time decoding of 4K 60fps VVC bitstreams on an x86 based CPU. △ Less

Submitted 7 December, 2020; v1 submitted 4 December, 2020; originally announced December 2020.

arXiv:2010.14168 [pdf, other]

Rule-embedded network for audio-visual voice activity detection in live musical video streams

Authors: Yuanbo Hou, Yi Deng, Bilei Zhu, Zejun Ma, Dick Botteldooren

Abstract: Detecting anchor's voice in live musical streams is an important preprocessing for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. With the help of visual information, this paper proposes a rule-embedded network to fuse the audio-v… ▽ More Detecting anchor's voice in live musical streams is an important preprocessing for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. With the help of visual information, this paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs to help the model better detect target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as the mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion by the proposed rule, the detection result of A-V branch outperforms that of audio branch; 2) the performance of bi-modal model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attract more attention to the cross-modal music and audio signal processing, a new live musical video corpus with frame-level label is introduced. △ Less

Submitted 31 October, 2020; v1 submitted 27 October, 2020; originally announced October 2020.

Comments: Submitted to ICASSP 2021

arXiv:2010.14022 [pdf, other]

ByteCover: Cover Song Identification via Multi-Loss Training

Authors: Xingjian Du, Zhesong Yu, Bilei Zhu, Xiaoou Chen, Zejun Ma

Abstract: We present in this paper ByteCover, which is a new feature learning method for cover song identification (CSI). ByteCover is built based on the classical ResNet model, and two major improvements are designed to further enhance the capability of the model for CSI. In the first improvement, we introduce the integration of instance normalization (IN) and batch normalization (BN) to build IBN blocks,… ▽ More We present in this paper ByteCover, which is a new feature learning method for cover song identification (CSI). ByteCover is built based on the classical ResNet model, and two major improvements are designed to further enhance the capability of the model for CSI. In the first improvement, we introduce the integration of instance normalization (IN) and batch normalization (BN) to build IBN blocks, which are major components of our ResNet-IBN model. With the help of the IBN blocks, our CSI model can learn features that are invariant to the changes of musical attributes such as key, tempo, timbre and genre, while preserving the version information. In the second improvement, we employ the BNNeck method to allow a multi-loss training and encourage our method to jointly optimize a classification loss and a triplet loss, and by this means, the inter-class discrimination and intra-class compactness of cover songs, can be ensured at the same time. A set of experiments demonstrated the effectiveness and efficiency of ByteCover on multiple datasets, and in the Da-TACOS dataset, ByteCover outperformed the best competitive system by 20.9\%. △ Less

Submitted 23 April, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

arXiv:2010.13540 [pdf, other]

Contrastive Unsupervised Learning for Audio Fingerprinting

Authors: Zhesong Yu, Xingjian Du, Bilei Zhu, Zejun Ma

Abstract: The rise of video-sharing platforms has attracted more and more people to shoot videos and upload them to the Internet. These videos mostly contain a carefully-edited background audio track, where serious speech change, pitch shifting and various types of audio effects may involve, and existing audio identification systems may fail to recognize the audio. To solve this problem, in this paper, we i… ▽ More The rise of video-sharing platforms has attracted more and more people to shoot videos and upload them to the Internet. These videos mostly contain a carefully-edited background audio track, where serious speech change, pitch shifting and various types of audio effects may involve, and existing audio identification systems may fail to recognize the audio. To solve this problem, in this paper, we introduce the idea of contrastive learning to the task of audio fingerprinting (AFP). Contrastive learning is an unsupervised approach to learn representations that can effectively group similar samples and discriminate dissimilar ones. In our work, we consider an audio track and its differently distorted versions as similar while considering different audio tracks as dissimilar. Based on the momentum contrast (MoCo) framework, we devise a contrastive learning method for AFP, which can generate fingerprints that are both discriminative and robust. A set of experiments showed that our AFP method is effective for audio identification, with robustness to serious audio distortions, including the challenging speed change and pitch shifting. △ Less

Submitted 26 October, 2020; originally announced October 2020.

Comments: 5 pages

arXiv:2007.08165 [pdf, other]

doi 10.1109/TASLP.2020.3008832

Audio Tagging by Cross Filtering Noisy Labels

Authors: Boqing Zhu, Kele Xu, Qiuqiang Kong, Huaimin Wang, Yuxing Peng

Abstract: High quality labeled datasets have allowed deep learning to achieve impressive results on many sound analysis tasks. Yet, it is labor-intensive to accurately annotate large amount of audio data, and the dataset may contain noisy labels in the practical settings. Meanwhile, the deep neural networks are susceptive to those incorrect labeled data because of their outstanding memorization ability. In… ▽ More High quality labeled datasets have allowed deep learning to achieve impressive results on many sound analysis tasks. Yet, it is labor-intensive to accurately annotate large amount of audio data, and the dataset may contain noisy labels in the practical settings. Meanwhile, the deep neural networks are susceptive to those incorrect labeled data because of their outstanding memorization ability. In this paper, we present a novel framework, named CrossFilter, to combat the noisy labels problem for audio tagging. Multiple representations (such as, Logmel and MFCC) are used as the input of our framework for providing more complementary information of the audio. Then, though the cooperation and interaction of two neural networks, we divide the dataset into curated and noisy subsets by incrementally pick out the possibly correctly labeled data from the noisy data. Moreover, our approach leverages the multi-task learning on curated and noisy subsets with different loss function to fully utilize the entire dataset. The noisy-robust loss function is employed to alleviate the adverse effects of incorrect labels. On both the audio tagging datasets FSDKaggle2018 and FSDKaggle2019, empirical results demonstrate the performance improvement compared with other competing approaches. On FSDKaggle2018 dataset, our method achieves state-of-the-art performance and even surpasses the ensemble models. △ Less

Submitted 16 July, 2020; originally announced July 2020.

Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing

Showing 1–50 of 67 results for author: Zhu, B