Search | arXiv e-print repository

arXiv:2406.19072 [pdf, other]

Scatterer Recognition from LiDAR Point Clouds for Environment-Embedded Vehicular Channel Modeling via Synesthesia of Machines

Authors: Ziwei Huang, Lu Bai, Zengrui Han, Xiang Cheng

Abstract: In this paper, a novel environment-embedded vehicular channel model is proposed by scatterer recognition from light detection and ranging (LiDAR) point clouds via Synesthesia of Machines (SoM). To provide a robust data foundation, a new intelligent sensing-communication integration dataset in vehicular urban scenarios is constructed. Based on the constructed dataset, the complex SoM mechanism, i.e… ▽ More In this paper, a novel environment-embedded vehicular channel model is proposed by scatterer recognition from light detection and ranging (LiDAR) point clouds via Synesthesia of Machines (SoM). To provide a robust data foundation, a new intelligent sensing-communication integration dataset in vehicular urban scenarios is constructed. Based on the constructed dataset, the complex SoM mechanism, i.e., map** relationship between scatterers in electromagnetic space and LiDAR point clouds in physical environment, is explored via multilayer perceptron (MLP) with electromagnetic propagation mechanism. By using LiDAR point clouds to implement scatterer recognition, channel non-stationarity and consistency are modeled in an environment-embedded manner. Using ray-tracing (RT)-based results as the ground truth, the scatterer recognition accuracy exceeds 90%. The accuracy of the proposed model is further verified by the close fit between simulation results and RT results. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.14440 [pdf, other]

LLM4CP: Adapting Large Language Models for Channel Prediction

Authors: Boxun Liu, Xuanyu Liu, Shijian Gao, Xiang Cheng, Liuqing Yang

Abstract: Channel prediction is an effective approach for reducing the feedback or estimation overhead in massive multi-input multi-output (m-MIMO) systems. However, existing channel prediction methods lack precision due to model mismatch errors or network generalization issues. Large language models (LLMs) have demonstrated powerful modeling and generalization abilities, and have been successfully applied… ▽ More Channel prediction is an effective approach for reducing the feedback or estimation overhead in massive multi-input multi-output (m-MIMO) systems. However, existing channel prediction methods lack precision due to model mismatch errors or network generalization issues. Large language models (LLMs) have demonstrated powerful modeling and generalization abilities, and have been successfully applied to cross-modal tasks, including the time series analysis. Leveraging the expressive power of LLMs, we propose a pre-trained LLM-empowered channel prediction method (LLM4CP) to predict the future downlink channel state information (CSI) sequence based on the historical uplink CSI sequence. We fine-tune the network while freezing most of the parameters of the pre-trained LLM for better cross-modality knowledge transfer. To bridge the gap between the channel data and the feature space of the LLM, preprocessor, embedding, and output modules are specifically tailored by taking into account unique channel characteristics. Simulations validate that the proposed method achieves SOTA prediction performance on full-sample, few-shot, and generalization tests with low training and inference costs. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.01605 [pdf, other]

An Enhanced Encoder-Decoder Network Architecture for Reducing Information Loss in Image Semantic Segmentation

Authors: Zijun Gao, Qi Wang, Taiyuan Mei, Xiaohan Cheng, Yun Zi, Haowei Yang

Abstract: The traditional SegNet architecture commonly encounters significant information loss during the sampling process, which detrimentally affects its accuracy in image semantic segmentation tasks. To counter this challenge, we introduce an innovative encoder-decoder network structure enhanced with residual connections. Our approach employs a multi-residual connection strategy designed to preserve the… ▽ More The traditional SegNet architecture commonly encounters significant information loss during the sampling process, which detrimentally affects its accuracy in image semantic segmentation tasks. To counter this challenge, we introduce an innovative encoder-decoder network structure enhanced with residual connections. Our approach employs a multi-residual connection strategy designed to preserve the intricate details across various image scales more effectively, thus minimizing the information loss inherent to down-sampling procedures. Additionally, to enhance the convergence rate of network training and mitigate sample imbalance issues, we have devised a modified cross-entropy loss function incorporating a balancing factor. This modification optimizes the distribution between positive and negative samples, thus improving the efficiency of model training. Experimental evaluations of our model demonstrate a substantial reduction in information loss and improved accuracy in semantic segmentation. Notably, our proposed network architecture demonstrates a substantial improvement in the finely annotated mean Intersection over Union (mIoU) on the dataset compared to the conventional SegNet. The proposed network structure not only reduces operational costs by decreasing manual inspection needs but also scales up the deployment of AI-driven image analysis across different sectors. △ Less

Submitted 26 May, 2024; originally announced June 2024.

arXiv:2406.01205 [pdf, other]

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

Authors: Shengpeng Ji, Jialong Zuo, Minghui Fang, Siqi Zheng, Qian Chen, Wen Wang, Ziyue Jiang, Hai Huang, Xize Cheng, Rongjie Huang, Zhou Zhao

Abstract: In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual style description prompt. Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and… ▽ More In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual style description prompt. Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and adjustment capabilities or were unrelated to speaker-specific voice generation. Therefore, ControlSpeech focuses on a more challenging new task-a TTS system with controllable timbre, content, and style at the same time. ControlSpeech takes speech prompts, content prompts, and style prompts as inputs and utilizes bidirectional attention and mask-based parallel decoding to capture corresponding codec representations in a discrete decoupling codec space. Moreover, we discovered the issue of text style controllability in a many-to-many map** fashion and proposed the Style Mixture Semantic Density (SMSD) model to resolve this problem. SMSD module which is based on Gaussian mixture density networks, is designed to enhance the fine-grained partitioning and sampling capabilities of style semantic information and generate speech with more diverse styles. In terms of experiments, we make available a controllable model toolkit called ControlToolkit with a new style controllable dataset, some replicated baseline models and propose new metrics to evaluate both the control capability and the quality of generated audio in ControlSpeech. The relevant ablation studies validate the necessity of each component in ControlSpeech is necessary. We hope that ControlSpeech can establish the next foundation paradigm of controllable speech synthesis. The relevant code and demo are available at https://github.com/jishengpeng/ControlSpeech . △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.00356 [pdf, other]

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Authors: Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize Cheng, Siqi Zheng, Zhou Zhao

Abstract: Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficie… ▽ More Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. AudioLCM integrates Consistency Models into the generation process, facilitating rapid inference through a map** from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-sound generation and text-to-music synthesis tasks demonstrate that AudioLCM needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. AudioLCM enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in AudioLCM is effective. △ Less

Submitted 1 June, 2024; originally announced June 2024.

arXiv:2405.19015 [pdf, other]

Distributed Management of Fluctuating Energy Resources in Dynamic Networked Systems

Authors: Xiaotong Cheng, Ioannis Tsetis, Setareh Maghsudi

Abstract: Modern power systems integrate renewable distributed energy resources (DERs) as an environment-friendly enhancement to meet the ever-increasing demands. However, the inherent unreliability of renewable energy renders develo** DER management algorithms imperative. We study the energy-sharing problem in a system consisting of several DERs. Each agent harvests and distributes renewable energy in it… ▽ More Modern power systems integrate renewable distributed energy resources (DERs) as an environment-friendly enhancement to meet the ever-increasing demands. However, the inherent unreliability of renewable energy renders develo** DER management algorithms imperative. We study the energy-sharing problem in a system consisting of several DERs. Each agent harvests and distributes renewable energy in its neighborhood to optimize the network's performance while minimizing energy waste. We model this problem as a bandit convex optimization problem with constraints that correspond to each node's limitations for energy production. We propose distributed decision-making policies to solve the formulated problem, where we utilize the notion of dynamic regret as the performance metric. We also include an adjustment strategy in our developed algorithm to reduce the constraint violations. Besides, we design a policy that deals with the non-stationary environment. Theoretical analysis shows the effectiveness of our proposed algorithm. Numerical experiments using a real-world dataset show superior performance of our proposal compared to state-of-the-art methods. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.14347 [pdf, other]

Doubly-Dynamic ISAC Precoding for Vehicular Networks: A Constrained Deep Reinforcement Learning (CDRL) Approach

Authors: Zonghui Yang, Shijian Gao, Xiang Cheng

Abstract: Integrated sensing and communication (ISAC) technology is essential for enabling the vehicular networks. However, the communication channel in this scenario exhibits time-varying characteristics, and the potential targets may move rapidly, creating a doubly-dynamic phenomenon. This nature poses a challenge for real-time precoder design. While optimization-based solutions are widely researched, the… ▽ More Integrated sensing and communication (ISAC) technology is essential for enabling the vehicular networks. However, the communication channel in this scenario exhibits time-varying characteristics, and the potential targets may move rapidly, creating a doubly-dynamic phenomenon. This nature poses a challenge for real-time precoder design. While optimization-based solutions are widely researched, they are complex and heavily rely on perfect prior information, which is impractical in double dynamics. To address this challenge, we propose using constrained deep reinforcement learning (CDRL) to facilitate dynamic updates to the ISAC precoder design. Additionally, the primal dual-deep deterministic policy gradient (PD-DDPG) and Wolpertinger architecture are tailored to efficiently train the algorithm under complex constraints and variable numbers of users. The proposed scheme not only adapts to the dynamics based on observations but also leverages environmental information to enhance performance and reduce complexity. Its superiority over existing candidates has been validated through experiments. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.09778 [pdf, other]

Beam Pattern Modulation Embedded Hybrid Transceiver Optimization for Integrated Sensing and Communication

Authors: Boxun Liu, Shijian Gao, Zonghui Yang, Xiang Cheng, Liuqing Yang

Abstract: Integrated sensing and communication (ISAC) emerges as a promising technology for B5G/6G, particularly in the millimeter-wave (mmWave) band. However, the widely utilized hybrid architecture in mmWave systems compromises multiplexing gain due to the constraints of limited radio frequency chains. Moreover, additional sensing functionalities exacerbate the impairment of spectrum efficiency (SE). In t… ▽ More Integrated sensing and communication (ISAC) emerges as a promising technology for B5G/6G, particularly in the millimeter-wave (mmWave) band. However, the widely utilized hybrid architecture in mmWave systems compromises multiplexing gain due to the constraints of limited radio frequency chains. Moreover, additional sensing functionalities exacerbate the impairment of spectrum efficiency (SE). In this paper, we present an optimized beam pattern modulation-embedded ISAC (BPM-ISAC) transceiver design, which spares one RF chain for sensing and the others for communication. To compensate for the reduced SE, index modulation across communication beams is applied. We formulate an optimization problem aimed at minimizing the mean squared error (MSE) of the sensing beampattern, subject to a symbol MSE constraint. This problem is then solved by sequentially optimizing the analog and digital parts. Both the multi-aperture structure (MAS) and the multi-beam structure (MBS) are considered for the design of the analog part. We conduct theoretical analysis on the asymptotic pairwise error probability (APEP) and the Cramér-Rao bound (CRB) of direction of arrival (DoA) estimation. Numerical simulations validate the overall enhanced ISAC performance over existing alternatives. △ Less

Submitted 15 May, 2024; originally announced May 2024.

arXiv:2405.08306 [pdf, other]

Flight Path Optimization with Optimal Control Method

Authors: Gaofeng Su, Xi Cheng, Siyuan Feng, Ke Liu, Jilin Song, Jianan Chen, Chen Zhu, Hui Lin

Abstract: This paper is based on a crucial issue in the aviation world: how to optimize the trajectory and controls given to the aircraft in order to optimize flight time and fuel consumption. This study aims to provide elements of a response to this problem and to define, under certain simplifying assumptions, an optimal response, using Constrained Finite Time Optimal Control(CFTOC). The first step is to d… ▽ More This paper is based on a crucial issue in the aviation world: how to optimize the trajectory and controls given to the aircraft in order to optimize flight time and fuel consumption. This study aims to provide elements of a response to this problem and to define, under certain simplifying assumptions, an optimal response, using Constrained Finite Time Optimal Control(CFTOC). The first step is to define the dynamic model of the aircraft in accordance with the controllable inputs and wind disturbances. Then we will identify a precise objective in terms of optimization and implement an optimization program to solve it under the circumstances of simulated real flight situation. Finally, the optimization result is validated and discussed by different scenarios. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2404.16825 [pdf, other]

ResVR: Joint Rescaling and Viewport Rendering of Omnidirectional Images

Authors: Weiqi Li, Shijie Zhao, Bin Chen, Xinhua Cheng, Junlin Li, Li Zhang, Jian Zhang

Abstract: With the advent of virtual reality technology, omnidirectional image (ODI) rescaling techniques are increasingly embraced for reducing transmitted and stored file sizes while preserving high image quality. Despite this progress, current ODI rescaling methods predominantly focus on enhancing the quality of images in equirectangular projection (ERP) format, which overlooks the fact that the content… ▽ More With the advent of virtual reality technology, omnidirectional image (ODI) rescaling techniques are increasingly embraced for reducing transmitted and stored file sizes while preserving high image quality. Despite this progress, current ODI rescaling methods predominantly focus on enhancing the quality of images in equirectangular projection (ERP) format, which overlooks the fact that the content viewed on head mounted displays (HMDs) is actually a rendered viewport instead of an ERP image. In this work, we emphasize that focusing solely on ERP quality results in inferior viewport visual experiences for users. Thus, we propose ResVR, which is the first comprehensive framework for the joint Rescaling and Viewport Rendering of ODIs. ResVR allows obtaining LR ERP images for transmission while rendering high-quality viewports for users to watch on HMDs. In our ResVR, a novel discrete pixel sampling strategy is developed to tackle the complex map** between the viewport and ERP, enabling end-to-end training of ResVR pipeline. Furthermore, a spherical pixel shape representation technique is innovatively derived from spherical differentiation to significantly improve the visual quality of rendered viewports. Extensive experiments demonstrate that our ResVR outperforms existing methods in viewport rendering tasks across different fields of view, resolutions, and view directions while kee** a low transmission overhead. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.09313 [pdf, other]

Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment

Authors: Zhiqing Hong, Rongjie Huang, Xize Cheng, Yongqi Wang, Ruiqi Li, Fuming You, Zhou Zhao, Zhimeng Zhang

Abstract: A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consi… ▽ More A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/. △ Less

Submitted 20 May, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

Comments: ACL 2024 Main

arXiv:2403.14185 [pdf, other]

A LiDAR-Aided Channel Model for Vehicular Intelligent Sensing-Communication Integration

Authors: Ziwei Huang, Lu Bai, Mingran Sun, Xiang Cheng

Abstract: In this paper, a novel channel modeling approach, named light detection and ranging (LiDAR)-aided geometry-based stochastic modeling (LA-GBSM), is developed. Based on the developed LA-GBSM approach, a new millimeter wave (mmWave) channel model for sixth-generation (6G) vehicular intelligent sensing-communication integration is proposed, which can support the design of intelligent transportation sy… ▽ More In this paper, a novel channel modeling approach, named light detection and ranging (LiDAR)-aided geometry-based stochastic modeling (LA-GBSM), is developed. Based on the developed LA-GBSM approach, a new millimeter wave (mmWave) channel model for sixth-generation (6G) vehicular intelligent sensing-communication integration is proposed, which can support the design of intelligent transportation systems (ITSs). The proposed LA-GBSM is accurately parameterized under high, medium, and low vehicular traffic density (VTD) conditions via a sensing-communication simulation dataset with LiDAR point clouds and scatterer information for the first time. Specifically, by detecting dynamic vehicles and static building/tress through LiDAR point clouds via machine learning, scatterers are divided into static and dynamic scatterers. Furthermore, statistical distributions of parameters, e.g., distance, angle, number, and power, related to static and dynamic scatterers are quantified under high, medium, and low VTD conditions. To mimic channel non-stationarity and consistency, based on the quantified statistical distributions, a new visibility region (VR)-based algorithm in consideration of newly generated static/dynamic scatterers is developed. Key channel statistics are derived and simulated. By comparing simulation results and ray-tracing (RT)-based results, the utility of the proposed LA-GBSM is verified. △ Less

Submitted 21 March, 2024; originally announced March 2024.

arXiv:2403.13314 [pdf, other]

Superposed IM-OFDM (S-IM-OFDM): An Enhanced OFDM for Integrated Sensing and Communications

Authors: Zonghui Yang, Shijian Gao, Xiang Cheng, Liuqing Yang

Abstract: Integrated sensing and communications (ISAC) is a critical enabler for emerging 6G applications, and at its core lies in the dual-functional waveform design. While orthogonal frequency division multiplexing (OFDM) has been a popular basic waveform, its primitive version falls short in sensing due to the inherent unregulated auto-correlation properties. Furthermore, the sensitivity to Doppler shift… ▽ More Integrated sensing and communications (ISAC) is a critical enabler for emerging 6G applications, and at its core lies in the dual-functional waveform design. While orthogonal frequency division multiplexing (OFDM) has been a popular basic waveform, its primitive version falls short in sensing due to the inherent unregulated auto-correlation properties. Furthermore, the sensitivity to Doppler shift hinders its broader applications in dynamic scenarios. To address these issues, we propose a superposed index-modulated OFDM (S-IM-OFDM). The proposed scheme improves the sensing performance without excess power consumption by translating the energy efficiency of IM-OFDM onto sensing-oriented signals over OFDM. Also, it maintains excellent communication performance in time-varying channels by leveraging the sensed parameters to compensate for Doppler. Compared to conventional OFDM, the proposed S-IM-OFDM waveform exhibits better sensing capabilities and wider applicability in dynamic scenarios. Both theoretical analyses and simulations corroborate its dual benefits. △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2403.10629 [pdf, other]

Virtual Elastic Tether: a New Approach for Multi-agent Navigation in Confined Aquatic Environments

Authors: Kanzhong Yao, Xueliang Cheng, Keir Groves, Barry Lennox, Ognjen Marjanovic, Simon Watson

Abstract: Underwater navigation is a challenging area in the field of mobile robotics due to inherent constraints in self-localisation and communication in underwater environments. Some of these challenges can be mitigated by using collaborative multi-agent teams. However, when applied underwater, the robustness of traditional multi-agent collaborative control approaches is highly limited due to the unavail… ▽ More Underwater navigation is a challenging area in the field of mobile robotics due to inherent constraints in self-localisation and communication in underwater environments. Some of these challenges can be mitigated by using collaborative multi-agent teams. However, when applied underwater, the robustness of traditional multi-agent collaborative control approaches is highly limited due to the unavailability of reliable measurements. In this paper, the concept of a Virtual Elastic Tether (VET) is introduced in the context of incomplete state measurements, which represents an innovative approach to underwater navigation in confined spaces. The concept of VET is formulated and validated using the Cooperative Aquatic Vehicle Exploration System (CAVES), which is a sim-to-real multi-agent aquatic robotic platform. Within this framework, a vision-based Autonomous Underwater Vehicle-Autonomous Surface Vehicle leader-follower formulation is developed. Experiments were conducted in both simulation and on a physical platform, benchmarked against a traditional Image-Based Visual Servoing approach. Results indicate that the formation of the baseline approach fails under discrete disturbances, when induced distances between the robots exceeds 0.6 m in simulation and 0.3 m in the real world. In contrast, the VET-enhanced system recovers to pre-perturbation distances within 5 seconds. Furthermore, results illustrate the successful navigation of VET-enhanced CAVES in a confined water pond where the baseline approach fails to perform adequately. △ Less

Submitted 15 March, 2024; originally announced March 2024.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2403.10417 [pdf, other]

Beam Pattern Modulation Embedded mmWave Hybrid Transceiver Design Towards ISAC

Authors: Boxun Liu, Shijian Gao, Zonghui Yang, Xiang Cheng

Abstract: Integrated Sensing and Communication (ISAC) emerges as a promising technology for B5G/6G, particularly in the millimeter-wave (mmWave) band. However, the widespread adoption of hybrid architecture in mmWave systems compromises multiplexing gain due to limited radio-frequency chains, resulting in mediocre performance when embedding sensing functionality. To avoid sacrificing the spectrum efficiency… ▽ More Integrated Sensing and Communication (ISAC) emerges as a promising technology for B5G/6G, particularly in the millimeter-wave (mmWave) band. However, the widespread adoption of hybrid architecture in mmWave systems compromises multiplexing gain due to limited radio-frequency chains, resulting in mediocre performance when embedding sensing functionality. To avoid sacrificing the spectrum efficiency in hybrid structures while addressing performance bottlenecks in its extension to ISAC, we present an optimized beam pattern modulation-embedded ISAC (BPM-ISAC). BPM-ISAC applies index modulation over beamspace by selectively activating communication beams, aiming to minimize sensing beampattern mean squared error (MSE) under communication MSE constraints through dedicated hybrid transceiver design. Optimization involves the analog part through a min-MSE-based beam selection algorithm, followed by the digital part using an alternating optimization algorithm. Convergence and asymptotic pairwise error probability (APEP) analyses accompany numerical simulations, validating its overall enhanced ISAC performance over existing alternatives. △ Less

Submitted 15 March, 2024; originally announced March 2024.

arXiv:2403.07444 [pdf, other]

A Survey on Federated Learning in Intelligent Transportation Systems

Authors: Rongqing Zhang, Hanqiu Wang, Bing Li, Xiang Cheng, Liuqing Yang

Abstract: The development of Intelligent Transportation System (ITS) has brought about comprehensive urban traffic information that not only provides convenience to urban residents in their daily lives but also enhances the efficiency of urban road usage, leading to a more harmonious and sustainable urban life. Typical scenarios in ITS mainly include traffic flow prediction, traffic target recognition, and… ▽ More The development of Intelligent Transportation System (ITS) has brought about comprehensive urban traffic information that not only provides convenience to urban residents in their daily lives but also enhances the efficiency of urban road usage, leading to a more harmonious and sustainable urban life. Typical scenarios in ITS mainly include traffic flow prediction, traffic target recognition, and vehicular edge computing. However, most current ITS applications rely on a centralized training approach where users upload source data to a cloud server with high computing power for management and centralized training. This approach has limitations such as poor real-time performance, data silos, and difficulty in guaranteeing data privacy. To address these limitations, federated learning (FL) has been proposed as a promising solution. In this paper, we present a comprehensive review of the application of FL in ITS, with a particular focus on three key scenarios: traffic flow prediction, traffic target recognition, and vehicular edge computing. For each scenario, we provide an in-depth analysis of its key characteristics, current challenges, and specific manners in which FL is leveraged. Moreover, we discuss the benefits that FL can offer as a potential solution to the limitations of the centralized training approach currently used in ITS applications. △ Less

Submitted 14 March, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

arXiv:2402.09434 [pdf, other]

Disentangling Imperfect: A Wavelet-Infused Multilevel Heterogeneous Network for Human Activity Recognition in Flawed Wearable Sensor Data

Authors: Mengna Liu, Dong Xiang, Xu Cheng, Xiufeng Liu, Dalin Zhang, Shengyong Chen, Christian S. Jensen

Abstract: The popularity and diffusion of wearable devices provides new opportunities for sensor-based human activity recognition that leverages deep learning-based algorithms. Although impressive advances have been made, two major challenges remain. First, sensor data is often incomplete or noisy due to sensor placement and other issues as well as data transmission failure, calling for imputation of missin… ▽ More The popularity and diffusion of wearable devices provides new opportunities for sensor-based human activity recognition that leverages deep learning-based algorithms. Although impressive advances have been made, two major challenges remain. First, sensor data is often incomplete or noisy due to sensor placement and other issues as well as data transmission failure, calling for imputation of missing values, which also introduces noise. Second, human activity has multi-scale characteristics. Thus, different groups of people and even the same person may behave differently under different circumstances. To address these challenges, we propose a multilevel heterogeneous neural network, called MHNN, for sensor data analysis. We utilize multilevel discrete wavelet decomposition to extract multi-resolution features from sensor data. This enables distinguishing signals with different frequencies, thereby suppressing noise. As the components resulting from the decomposition are heterogeneous, we equip the proposed model with heterogeneous feature extractors that enable the learning of multi-scale features. Due to the complementarity of these features, we also include a cross aggregation module for enhancing their interactions. An experimental study using seven publicly available datasets offers evidence that MHNN can outperform other cutting-edge models and offers evidence of robustness to missing values and noise. An ablation study confirms the importance of each module. △ Less

Submitted 26 January, 2024; originally announced February 2024.

Comments: 14 pages, 7 figures

arXiv:2402.03585 [pdf, other]

Decoder-Only Image Registration

Authors: Xi Jia, Wenqi Lu, Xinxing Cheng, **ming Duan

Abstract: In unsupervised medical image registration, the predominant approaches involve the utilization of a encoder-decoder network architecture, allowing for precise prediction of dense, full-resolution displacement fields from given paired images. Despite its widespread use in the literature, we argue for the necessity of making both the encoder and decoder learnable in such an architecture. For this, w… ▽ More In unsupervised medical image registration, the predominant approaches involve the utilization of a encoder-decoder network architecture, allowing for precise prediction of dense, full-resolution displacement fields from given paired images. Despite its widespread use in the literature, we argue for the necessity of making both the encoder and decoder learnable in such an architecture. For this, we propose a novel network architecture, termed LessNet in this paper, which contains only a learnable decoder, while entirely omitting the utilization of a learnable encoder. LessNet substitutes the learnable encoder with simple, handcrafted features, eliminating the need to learn (optimize) network parameters in the encoder altogether. Consequently, this leads to a compact, efficient, and decoder-only architecture for 3D medical image registration. Evaluated on two publicly available brain MRI datasets, we demonstrate that our decoder-only LessNet can effectively and efficiently learn both dense displacement and diffeomorphic deformation fields in 3D. Furthermore, our decoder-only LessNet can achieve comparable registration performance to state-of-the-art methods such as VoxelMorph and TransMorph, while requiring significantly fewer computational resources. Our code and pre-trained models are available at https://github.com/xi-jia/LessNet. △ Less

Submitted 5 February, 2024; originally announced February 2024.

arXiv:2401.16712 [pdf, other]

LF Tracy: A Unified Single-Pipeline Approach for Salient Object Detection in Light Field Cameras

Authors: Fei Teng, Jiaming Zhang, Jiawei Liu, Kunyu Peng, Xina Cheng, Zhiyong Li, Kailun Yang

Abstract: Leveraging the rich information extracted from light field (LF) cameras is instrumental for dense prediction tasks. However, adapting light field data to enhance Salient Object Detection (SOD) still follows the traditional RGB methods and remains under-explored in the community. Previous approaches predominantly employ a custom two-stream design to discover the implicit angular feature within ligh… ▽ More Leveraging the rich information extracted from light field (LF) cameras is instrumental for dense prediction tasks. However, adapting light field data to enhance Salient Object Detection (SOD) still follows the traditional RGB methods and remains under-explored in the community. Previous approaches predominantly employ a custom two-stream design to discover the implicit angular feature within light field cameras, leading to significant information isolation between different LF representations. In this study, we propose an efficient paradigm (LF Tracy) to address this limitation. We eschew the conventional specialized fusion and decoder architecture for a dual-stream backbone in favor of a unified, single-pipeline approach. This comprises firstly a simple yet effective data augmentation strategy called MixLD to bridge the connection of spatial, depth, and implicit angular information under different LF representations. A highly efficient information aggregation (IA) module is then introduced to boost asymmetric feature-wise information fusion. Owing to this innovative approach, our model surpasses the existing state-of-the-art methods, particularly demonstrating a 23% improvement over previous results on the latest large-scale PKU dataset. By utilizing only 28.9M parameters, the model achieves a 10% increase in accuracy with 3M additional parameters compared to its backbone using RGB images and an 86% rise to its backbone using LF images. The source code will be made publicly available at https://github.com/FeiBryantkit/LF-Tracy. △ Less

Submitted 29 January, 2024; originally announced January 2024.

Comments: The source code will be made publicly available at https://github.com/FeiBryantkit/LF-Tracy

arXiv:2401.16700 [pdf, other]

Towards Precise 3D Human Pose Estimation with Multi-Perspective Spatial-Temporal Relational Transformers

Authors: Jianbin Jiao, Xina Cheng, Weijie Chen, Xiaoting Yin, Hao Shi, Kailun Yang

Abstract: 3D human pose estimation captures the human joint points in three-dimensional space while kee** the depth information and physical structure. That is essential for applications that require precise pose information, such as human-computer interaction, scene understanding, and rehabilitation training. Due to the challenges in data collection, mainstream datasets of 3D human pose estimation are pr… ▽ More 3D human pose estimation captures the human joint points in three-dimensional space while kee** the depth information and physical structure. That is essential for applications that require precise pose information, such as human-computer interaction, scene understanding, and rehabilitation training. Due to the challenges in data collection, mainstream datasets of 3D human pose estimation are primarily composed of multi-view video data collected in laboratory environments, which contains rich spatial-temporal correlation information besides the image frame content. Given the remarkable self-attention mechanism of transformers, capable of capturing the spatial-temporal correlation from multi-view video datasets, we propose a multi-stage framework for 3D sequence-to-sequence (seq2seq) human pose detection. Firstly, the spatial module represents the human pose feature by intra-image content, while the frame-image relation module extracts temporal relationships and 3D spatial positional relationship features between the multi-perspective images. Secondly, the self-attention mechanism is adopted to eliminate the interference from non-human body parts and reduce computing resources. Our method is evaluated on Human3.6M, a popular 3D human pose detection dataset. Experimental results demonstrate that our approach achieves state-of-the-art performance on this dataset. The source code will be available at https://github.com/WUJINHUAN/3D-human-pose. △ Less

Submitted 25 March, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

Comments: Accepted to IJCNN 2024. The source code will be available at https://github.com/WUJINHUAN/3D-human-pose

arXiv:2312.15197 [pdf, other]

TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation

Authors: Xize Cheng, Rongjie Huang, Linjun Li, Tao **, Zehan Wang, Aoxiong Yin, Minglei Li, Xinyu Duan, changpeng yang, Zhou Zhao

Abstract: Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning. This approach circumvents delays and cascading errors associated with model cascading. However, talking head translation, converting audio-visual speech (i.e., talking head video) from one language into another, still confronts several challenges comp… ▽ More Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning. This approach circumvents delays and cascading errors associated with model cascading. However, talking head translation, converting audio-visual speech (i.e., talking head video) from one language into another, still confronts several challenges compared to audio speech: (1) Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors. (2) Talking head translation has a limited set of reference frames. If the generated translation exceeds the length of the original speech, the video sequence needs to be supplemented by repeating frames, leading to jarring video transitions. In this work, we propose a model for talking head translation, \textbf{TransFace}, which can directly translate audio-visual speech into audio-visual speech in other languages. It consists of a speech-to-unit translation model to convert audio speech into discrete units and a unit-based audio-visual speech synthesizer, Unit2Lip, to re-synthesize synchronized audio-visual speech from discrete units in parallel. Furthermore, we introduce a Bounded Duration Predictor, ensuring isometric talking head translation and preventing duplicate reference frames. Experiments demonstrate that our proposed Unit2Lip model significantly improves synchronization (1.601 and 0.982 on LSE-C for the original and generated audio speech, respectively) and boosts inference speed by a factor of 4.35 on LRS2. Additionally, TransFace achieves impressive BLEU scores of 61.93 and 47.55 for Es-En and Fr-En on LRS3-T and 100% isochronous translations. △ Less

Submitted 23 December, 2023; originally announced December 2023.

arXiv:2312.01544 [pdf, other]

KEEC: Embed to Control on An Equivariant Geometry

Authors: Xiaoyuan Cheng, Yiming Yang, Wei Jiang, Yukun Hu

Abstract: This paper investigates how representation learning can enable optimal control in unknown and complex dynamics, such as chaotic and non-linear systems, without relying on prior domain knowledge of the dynamics. The core idea is to establish an equivariant geometry that is diffeomorphic to the manifold defined by a dynamical system and to perform optimal control within this corresponding geometry,… ▽ More This paper investigates how representation learning can enable optimal control in unknown and complex dynamics, such as chaotic and non-linear systems, without relying on prior domain knowledge of the dynamics. The core idea is to establish an equivariant geometry that is diffeomorphic to the manifold defined by a dynamical system and to perform optimal control within this corresponding geometry, which is a non-trivial task. To address this challenge, Koopman Embed to Equivariant Control (KEEC) is proposed for model learning and control. Inspired by Lie theory, KEEC begins by learning a non-linear dynamical system defined on a manifold and embedding trajectories into a Lie group. Subsequently, KEEC formulates an equivariant value function equation in reinforcement learning on the equivariant geometry, ensuring an invariant effect as the value function on the original manifold. By deriving analytical-form optimal actions on the equivariant value function, KEEC theoretically achieves quadratic convergence for the optimal equivariant value function by leveraging the differential information on the equivariant geometry. The effectiveness of KEEC is demonstrated in challenging dynamical systems, including chaotic ones like Lorenz-63. Notably, our results show that isometric functions, which maintain the compactness and completeness of geometry while preserving metric and differential information, consistently outperform loss functions lacking these characteristics. △ Less

Submitted 10 December, 2023; v1 submitted 3 December, 2023; originally announced December 2023.

arXiv:2312.00727 [pdf, other]

Safe Reinforcement Learning in Tensor Reproducing Kernel Hilbert Space

Authors: Xiaoyuan Cheng, Boli Chen, Liz Varga, Yukun Hu

Abstract: This paper delves into the problem of safe reinforcement learning (RL) in a partially observable environment with the aim of achieving safe-reachability objectives. In traditional partially observable Markov decision processes (POMDP), ensuring safety typically involves estimating the belief in latent states. However, accurately estimating an optimal Bayesian filter in POMDP to infer latent states… ▽ More This paper delves into the problem of safe reinforcement learning (RL) in a partially observable environment with the aim of achieving safe-reachability objectives. In traditional partially observable Markov decision processes (POMDP), ensuring safety typically involves estimating the belief in latent states. However, accurately estimating an optimal Bayesian filter in POMDP to infer latent states from observations in a continuous state space poses a significant challenge, largely due to the intractable likelihood. To tackle this issue, we propose a stochastic model-based approach that guarantees RL safety almost surely in the face of unknown system dynamics and partial observation environments. We leveraged the Predictive State Representation (PSR) and Reproducing Kernel Hilbert Space (RKHS) to represent future multi-step observations analytically, and the results in this context are provable. Furthermore, we derived essential operators from the kernel Bayes' rule, enabling the recursive estimation of future observations using various operators. Under the assumption of \textit{undercompleness}, a polynomial sample complexity is established for the RL algorithm for the infinite size of observation and action spaces, ensuring an $ε-$suboptimal safe policy guarantee. △ Less

Submitted 1 December, 2023; originally announced December 2023.

arXiv:2312.00550 [pdf, ps, other]

Novel 3D Geometry-Based Stochastic Models for Non-Isotropic MIMO Vehicle-to-Vehicle Channels

Authors: Yi Yuan, Cheng-Xiang Wang, Xiang Cheng, Bo Ai, David I. Laurenson

Abstract: This paper proposes a novel three-dimensional (3D) theoretical regular-shaped geometry-based stochastic model (RS-GBSM) and the corresponding sum-of-sinusoids (SoS) simulation model for non-isotropic multiple-input multiple-output (MIMO) vehicle-to-vehicle (V2V) Ricean fading channels. The proposed RS-GBSM, combining line-of-sight (LoS) components, a two-sphere model, and an elliptic-cylinder mode… ▽ More This paper proposes a novel three-dimensional (3D) theoretical regular-shaped geometry-based stochastic model (RS-GBSM) and the corresponding sum-of-sinusoids (SoS) simulation model for non-isotropic multiple-input multiple-output (MIMO) vehicle-to-vehicle (V2V) Ricean fading channels. The proposed RS-GBSM, combining line-of-sight (LoS) components, a two-sphere model, and an elliptic-cylinder model, has the ability to study the impact of the vehicular traffic density (VTD) on channel statistics, and jointly considers the azimuth and elevation angles by using the von Mises Fisher distribution. Moreover, a novel parameter computation method is proposed for jointly calculating the azimuth and elevation angles in the SoS channel simulator. Based on the proposed 3D theoretical RS-GBSM and its SoS simulation model, statistical properties are derived and thoroughly investigated. The impact of the elevation angle in the 3D model on key statistical properties is investigated by comparing with those of the corresponding two-dimensional (2D) model. It is demonstrated that the 3D model is more accurate to characterize real V2V channels, in particular for pico cell scenarios. Finally, close agreement is achieved between the theoretical model, SoS simulation model, and simulation results, demonstrating the utility of the proposed models. △ Less

Submitted 1 December, 2023; originally announced December 2023.

arXiv:2311.14264 [pdf, ps, other]

An ADMM-Based Geometric Configuration Optimization in RSSD-Based Source Localization By UAVs with Spread Angle Constraint

Authors: Xin Cheng, Weiqiang Zhu, Feng Shu, Jiangzhou Wang

Abstract: Deploying multiple unmanned aerial vehicles (UAVs) to locate a signal-emitting source covers a wide range of military and civilian applications like rescue and target tracking. It is well known that the UAVs-source (sensors-target) geometry, namely geometric configuration, significantly affects the final localization accuracy. This paper focuses on the geometric configuration optimization for rece… ▽ More Deploying multiple unmanned aerial vehicles (UAVs) to locate a signal-emitting source covers a wide range of military and civilian applications like rescue and target tracking. It is well known that the UAVs-source (sensors-target) geometry, namely geometric configuration, significantly affects the final localization accuracy. This paper focuses on the geometric configuration optimization for received signal strength difference (RSSD)-based passive source localization by drone swarm. Different from prior works, this paper considers a general measuring condition where the spread angle of drone swarm centered on the source is constrained. Subject to this constraint, a geometric configuration optimization problem with the aim of maximizing the determinant of Fisher information matrix (FIM) is formulated. After transforming this problem using matrix theory, an alternating direction method of multipliers (ADMM)-based optimization framework is proposed. To solve the subproblems in this framework, two global optimal solutions based on the Von Neumann matrix trace inequality theorem and majorize-minimize (MM) algorithm are proposed respectively. Finally, the effectiveness as well as the practicality of the proposed ADMM-based optimization algorithm are demonstrated by extensive simulations. △ Less

Submitted 23 November, 2023; originally announced November 2023.

arXiv:2310.02561 [pdf, other]

Integrated Sensing and Communications Towards Proactive Beamforming in mmWave V2I via Multi-Modal Feature Fusion (MMFF)

Authors: Haotian Zhang, Shijian Gao, Xiang Cheng, Liuqing Yang

Abstract: The future of vehicular communication networks relies on mmWave massive multi-input-multi-output antenna arrays for intensive data transfer and massive vehicle access. However, reliable vehicle-to-infrastructure links require exact alignment between the narrow beams, which traditionally involves excessive signaling overhead. To address this issue, we propose a novel proactive beamforming scheme th… ▽ More The future of vehicular communication networks relies on mmWave massive multi-input-multi-output antenna arrays for intensive data transfer and massive vehicle access. However, reliable vehicle-to-infrastructure links require exact alignment between the narrow beams, which traditionally involves excessive signaling overhead. To address this issue, we propose a novel proactive beamforming scheme that integrates multi-modal sensing and communications via Multi-Modal Feature Fusion Network (MMFF-Net), which is composed of multiple neural network components with distinct functions. Unlike existing methods that rely solely on communication processing, our approach obtains comprehensive environmental features to improve beam alignment accuracy. We verify our scheme on the Vision-Wireless (ViWi) dataset, which we enriched with realistic vehicle drifting behavior. Our proposed MMFF-Net achieves more accurate and stable angle prediction, which in turn increases the achievable rates and reduces the communication system outage probability. Even in complex dynamic scenarios with adverse environment conditions, robust prediction results can be guaranteed, demonstrating the feasibility and practicality of the proposed proactive beamforming approach. △ Less

Submitted 26 March, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: 14 pages, 12 figures, 5 tables

arXiv:2309.14341 [pdf, other]

Extreme Parkour with Legged Robots

Authors: Xuxin Cheng, Kexin Shi, Ananye Agarwal, Deepak Pathak

Abstract: Humans can perform parkour by traversing obstacles in a highly dynamic fashion requiring precise eye-muscle coordination and movement. Getting robots to do the same task requires overcoming similar challenges. Classically, this is done by independently engineering perception, actuation, and control systems to very low tolerances. This restricts them to tightly controlled settings such as a predete… ▽ More Humans can perform parkour by traversing obstacles in a highly dynamic fashion requiring precise eye-muscle coordination and movement. Getting robots to do the same task requires overcoming similar challenges. Classically, this is done by independently engineering perception, actuation, and control systems to very low tolerances. This restricts them to tightly controlled settings such as a predetermined obstacle course in labs. In contrast, humans are able to learn parkour through practice without significantly changing their underlying biology. In this paper, we take a similar approach to develo** robot parkour on a small low-cost robot with imprecise actuation and a single front-facing depth camera for perception which is low-frequency, jittery, and prone to artifacts. We show how a single neural net policy operating directly from a camera image, trained in simulation with large-scale RL, can overcome imprecise sensing and actuation to output highly precise control behavior end-to-end. We show our robot can perform a high jump on obstacles 2x its height, long jump across gaps 2x its length, do a handstand and run across tilted ramps, and generalize to novel obstacle courses with different physical properties. Parkour videos at https://extreme-parkour.github.io/ △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: Website and videos at https://extreme-parkour.github.io/

arXiv:2308.13381 [pdf, ps, other]

Deep Unfolding-Based Channel Estimation for Wideband TeraHertz Near-Field Massive MIMO Systems

Authors: Jiabao Gao, Xiaoming Cheng, Geoffrey Ye Li

Abstract: The combination of Terahertz (THz) and massive multiple-input multiple-output (MIMO) is promising to meet the increasing data rate demand of future wireless communication systems thanks to the huge bandwidth and spatial degrees of freedom. However, unique channel features such as the near-field beam split effect make channel estimation particularly challenging in THz massive MIMO systems. On one h… ▽ More The combination of Terahertz (THz) and massive multiple-input multiple-output (MIMO) is promising to meet the increasing data rate demand of future wireless communication systems thanks to the huge bandwidth and spatial degrees of freedom. However, unique channel features such as the near-field beam split effect make channel estimation particularly challenging in THz massive MIMO systems. On one hand, adopting the conventional angular domain transformation dictionary designed for low-frequency far-field channels will result in degraded channel sparsity and destroyed sparsity structure in the transformed domain. On the other hand, most existing compressive sensing-based channel estimation algorithms cannot achieve high performance and low complexity simultaneously. To alleviate these issues, in this paper, we first adopt frequency-dependent near-field dictionaries to maintain good channel sparsity and sparsity structure in the transformed domain under the near-field beam split effect. Then, a deep unfolding-based wideband THz massive MIMO channel estimation algorithm is proposed. In each iteration of the unitary approximate message passing-sparse Bayesian learning algorithm, the optimal update rule is learned by a deep neural network (DNN), whose structure is customized to effectively exploit the inherent channel patterns. Furthermore, a mixed training method based on novel designs of the DNN structure and the loss function is developed to effectively train data from different system configurations. Simulation results validate the superiority of the proposed algorithm in terms of performance, complexity, and robustness. △ Less

Submitted 25 August, 2023; originally announced August 2023.

arXiv:2307.15374 [pdf]

Leveraging Optical Communication Fiber and AI for Distributed Water Pipe Leak Detection

Authors: Huan Wu, Huan-Feng Duan, Wallace W. L. Lai, Kun Zhu, Xin Cheng, Hao Yin, Bin Zhou, Chun-Cheung Lai, Chao Lu, Xiaoli Ding

Abstract: Detecting leaks in water networks is a costly challenge. This article introduces a practical solution: the integration of optical network with water networks for efficient leak detection. Our approach uses a fiber-optic cable to measure vibrations, enabling accurate leak identification and localization by an intelligent algorithm. We also propose a method to access leak severity for prioritized re… ▽ More Detecting leaks in water networks is a costly challenge. This article introduces a practical solution: the integration of optical network with water networks for efficient leak detection. Our approach uses a fiber-optic cable to measure vibrations, enabling accurate leak identification and localization by an intelligent algorithm. We also propose a method to access leak severity for prioritized repairs. Our solution detects even small leaks with flow rates as low as 0.027 L/s. It offers a cost-effective way to improve leak detection, enhance water management, and increase operational efficiency. △ Less

Submitted 28 July, 2023; originally announced July 2023.

Comments: Accepted

Journal ref: IEEE Communications Magazine, 2023

arXiv:2307.05362 [pdf, other]

SleepEGAN: A GAN-enhanced Ensemble Deep Learning Model for Imbalanced Classification of Sleep Stages

Authors: Xuewei Cheng, Ke Huang, Yi Zou, Shujie Ma

Abstract: Deep neural networks have played an important role in automatic sleep stage classification because of their strong representation and in-model feature transformation abilities. However, class imbalance and individual heterogeneity which typically exist in raw EEG signals of sleep data can significantly affect the classification performance of any machine learning algorithms. To solve these two pro… ▽ More Deep neural networks have played an important role in automatic sleep stage classification because of their strong representation and in-model feature transformation abilities. However, class imbalance and individual heterogeneity which typically exist in raw EEG signals of sleep data can significantly affect the classification performance of any machine learning algorithms. To solve these two problems, this paper develops a generative adversarial network (GAN)-powered ensemble deep learning model, named SleepEGAN, for the imbalanced classification of sleep stages. To alleviate class imbalance, we propose a new GAN (called EGAN) architecture adapted to the features of EEG signals for data augmentation. The generated samples for the minority classes are used in the training process. In addition, we design a cost-free ensemble learning strategy to reduce the model estimation variance caused by the heterogeneity between the validation and test sets, so as to enhance the accuracy and robustness of prediction performance. We show that the proposed method can improve classification accuracy compared to several existing state-of-the-art methods using three public sleep datasets. △ Less

Submitted 3 July, 2023; originally announced July 2023.

Comments: 20 pages, 6 figures

arXiv:2307.00583 [pdf, other]

A region and category confidence-based multi-task network for carotid ultrasound image segmentation and classification

Authors: Haitao Gan, Ran Zhou, Yanghan Ou, Furong Wang, Xinyao Cheng, Aaron Fenster

Abstract: The segmentation and classification of carotid plaques in ultrasound images play important roles in the treatment of atherosclerosis and assessment for the risk of stroke. Although deep learning methods have been used for carotid plaque segmentation and classification, two-stage methods will increase the complexity of the overall analysis and the existing multi-task methods ignored the relationshi… ▽ More The segmentation and classification of carotid plaques in ultrasound images play important roles in the treatment of atherosclerosis and assessment for the risk of stroke. Although deep learning methods have been used for carotid plaque segmentation and classification, two-stage methods will increase the complexity of the overall analysis and the existing multi-task methods ignored the relationship between the segmentation and classification. These will lead to suboptimal performance as valuable information might not be fully leveraged across all tasks. Therefore, we propose a multi-task learning framework (RCCM-Net) for ultrasound carotid plaque segmentation and classification, which utilizes a region confidence module (RCM) and a sample category confidence module (CCM) to exploit the correlation between these two tasks. The RCM provides knowledge from the probability of plaque regions to the classification task, while the CCM is designed to learn the categorical sample weight for the segmentation task. A total of 1270 2D ultrasound images of carotid plaques were collected from Zhongnan Hospital (Wuhan, China) for our experiments. The results showed that the proposed method can improve both segmentation and classification performance compared to existing single-task networks (i.e., SegNet, Deeplabv3+, UNet++, EfficientNet, Res2Net, RepVGG, DPN) and multi-task algorithms (i.e., HRNet, MTANet), with an accuracy of 85.82% for classification and a Dice-similarity-coefficient of 84.92% for segmentation. In the ablation study, the results demonstrated that both the designed RCM and CCM were beneficial in improving the network's performance. Therefore, we believe that the proposed method could be useful for carotid plaque analysis in clinical trials and practice. △ Less

Submitted 18 November, 2023; v1 submitted 2 July, 2023; originally announced July 2023.

arXiv:2306.14143 [pdf, other]

Intelligent Multi-Modal Sensing-Communication Integration: Synesthesia of Machines

Authors: Xiang Cheng, Haotian Zhang, Jianan Zhang, Shijian Gao, Sijiang Li, Ziwei Huang, Lu Bai, Zonghui Yang, Xinhu Zheng, Liuqing Yang

Abstract: In the era of sixth-generation (6G) wireless communications, integrated sensing and communications (ISAC) is recognized as a promising solution to upgrade the physical system by endowing wireless communications with sensing capability. Existing ISAC is mainly oriented to static scenarios with radio-frequency (RF) sensors being the primary participants, thus lacking a comprehensive environment feat… ▽ More In the era of sixth-generation (6G) wireless communications, integrated sensing and communications (ISAC) is recognized as a promising solution to upgrade the physical system by endowing wireless communications with sensing capability. Existing ISAC is mainly oriented to static scenarios with radio-frequency (RF) sensors being the primary participants, thus lacking a comprehensive environment feature characterization and facing a severe performance bottleneck in dynamic environments. To date, extensive surveys on ISAC have been conducted but are limited to summarizing RF-based radar sensing. Currently, some research efforts have been devoted to exploring multi-modal sensing-communication integration but still lack a comprehensive review. Therefore, we generalize the concept of ISAC inspired by human synesthesia to establish a unified framework of intelligent multi-modal sensing-communication integration and provide a comprehensive review under such a framework in this paper. The so-termed Synesthesia of Machines (SoM) gives the clearest cognition of such intelligent integration and details its paradigm for the first time. We commence by justifying the necessity of the new paradigm. Subsequently, we offer a definition of SoM and zoom into the detailed paradigm, which is summarized as three operation modes. To facilitate SoM research, we overview the prerequisite of SoM research, i.e., mixed multi-modal (MMM) datasets. Then, we introduce the map** relationships between multi-modal sensing and communications. Afterward, we cover the technological review on SoM-enhance-based and SoM-concert-based applications. To corroborate the superiority of SoM, we also present simulation results related to dual-function waveform and predictive beamforming design. Finally, we propose some potential directions to inspire future research efforts. △ Less

Submitted 20 November, 2023; v1 submitted 25 June, 2023; originally announced June 2023.

Comments: This paper has been accepted by IEEE Communications Surveys & Tutorials

arXiv:2306.14125 [pdf, other]

M$^3$SC: A Generic Dataset for Mixed Multi-Modal (MMM) Sensing and Communication Integration

Authors: Xiang Cheng, Ziwei Huang, Lu Bai, Haotian Zhang, Mingran Sun, Boxun Liu, Sijiang Li, Jianan Zhang, Minson Lee

Abstract: The sixth generation (6G) of mobile communication system is witnessing a new paradigm shift, i.e., integrated sensing-communication system. A comprehensive dataset is a prerequisite for 6G integrated sensing-communication research. This paper develops a novel simulation dataset, named M3SC, for mixed multi-modal (MMM) sensing-communication integration, and the generation framework of the M3SC data… ▽ More The sixth generation (6G) of mobile communication system is witnessing a new paradigm shift, i.e., integrated sensing-communication system. A comprehensive dataset is a prerequisite for 6G integrated sensing-communication research. This paper develops a novel simulation dataset, named M3SC, for mixed multi-modal (MMM) sensing-communication integration, and the generation framework of the M3SC dataset is further given. To obtain multi-modal sensory data in physical space and communication data in electromagnetic space, we utilize AirSim and WaveFarer to collect multi-modal sensory data and exploit Wireless InSite to collect communication data. Furthermore, the in-depth integration and precise alignment of AirSim, WaveFarer, and Wireless InSite are achieved. The M3SC dataset covers various weather conditions, various frequency bands, and different times of the day. Currently, the M3SC dataset contains 1500 snapshots, including 80 RGB images, 160 depth maps, 80 LiDAR point clouds, 256 sets of mmWave waveforms with 8 radar point clouds, and 72 channel impulse response (CIR) matrices per snapshot, thus totaling 120,000 RGB images, 240,000 depth maps, 120,000 LiDAR point clouds, 384,000 sets of mmWave waveforms with 12,000 radar point clouds, and 108,000 CIR matrices. The data processing result presents the multi-modal sensory information and communication channel statistical properties. Finally, the MMM sensing-communication application, which can be supported by the M3SC dataset, is discussed. △ Less

Submitted 25 June, 2023; originally announced June 2023.

Comments: 12 pages, 12 figures

arXiv:2306.12042 [pdf, ps, other]

Block-Wise Index Modulation and Receiver Design for High-Mobility OTFS Communications

Authors: Mi Qian, Fei Ji, Yao Ge, Miaowen Wen, Xiang Cheng, H. Vincent Poor

Abstract: As a promising technique for high-mobility wireless communications, orthogonal time frequency space (OTFS) has been proved to enjoy excellent advantages with respect to traditional orthogonal frequency division multiplexing (OFDM). Although multiple studies have considered index modulation (IM) based OTFS (IM-OTFS) schemes to further improve system performance, a challenging and open problem is th… ▽ More As a promising technique for high-mobility wireless communications, orthogonal time frequency space (OTFS) has been proved to enjoy excellent advantages with respect to traditional orthogonal frequency division multiplexing (OFDM). Although multiple studies have considered index modulation (IM) based OTFS (IM-OTFS) schemes to further improve system performance, a challenging and open problem is the development of effective IM schemes and efficient receivers for practical OTFS systems that must operate in the presence of channel delays and Doppler shifts. In this paper, we propose two novel block-wise IM schemes for OTFS systems, named delay-IM with OTFS (DeIM-OTFS) and Doppler-IM with OTFS (DoIM-OTFS), where a block of delay/Doppler resource bins are activated simultaneously. Based on a maximum likelihood (ML) detector, we analyze upper bounds on the average bit error rates for the proposed DeIM-OTFS and DoIM-OTFS schemes, and verify their performance advantages over the existing IM-OTFS systems. We also develop a multi-layer joint symbol and activation pattern detection (MLJSAPD) algorithm and a customized message passing detection (CMPD) algorithm for our proposed DeIMOTFS and DoIM-OTFS systems with low complexity. Simulation results demonstrate that our proposed MLJSAPD and CMPD algorithms can achieve desired performance with robustness to the imperfect channel state information (CSI). △ Less

Submitted 21 June, 2023; originally announced June 2023.

Comments: arXiv admin note: text overlap with arXiv:2210.13454

arXiv:2306.02982 [pdf, other]

PolyVoice: Language Models for Speech to Speech Translation

Authors: Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yu** Wang, Mingxuan Wang, Yuxuan Wang

Abstract: We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt… ▽ More We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice. △ Less

Submitted 13 June, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

arXiv:2305.15403 [pdf, other]

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Authors: Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren, Linjun Li, Zhenhui Ye, **zheng He, Lichao Zhang, **glin Liu, Xiang Yin, Zhou Zhao

Abstract: Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Despite the recent success, current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech (i.e., the movement of lips and teeth). In this work, we present AV-TranSpeech, the first audio-visual spe… ▽ More Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Despite the recent success, current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech (i.e., the movement of lips and teeth). In this work, we present AV-TranSpeech, the first audio-visual speech-to-speech (AV-S2ST) translation model without relying on intermediate text. AV-TranSpeech complements the audio stream with visual information to promote system robustness and opens up a host of practical applications: dictation or dubbing archival films. To mitigate the data scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised pre-training with unlabeled audio-visual data to learn contextual representation, and 2) introduce cross-modal distillation with S2ST models trained on the audio-only corpus to further reduce the requirements of visual data. Experimental results on two language pairs demonstrate that AV-TranSpeech outperforms audio-only models under all settings regardless of the type of noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation yields an improvement of 7.6 BLEU on average compared with baselines. Audio samples are available at https://AV-TranSpeech.github.io △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: Accepted to ACL 2023

arXiv:2305.14381 [pdf, other]

Connecting Multi-modal Contrastive Representations

Authors: Zehan Wang, Yang Zhao, Xize Cheng, Haifeng Huang, Jiageng Liu, Li Tang, Linjun Li, Yongqi Wang, Aoxiong Yin, Ziang Zhang, Zhou Zhao

Abstract: Multi-modal Contrastive Representation learning aims to encode different modalities into a semantically aligned shared space. This paradigm shows remarkable generalization ability on numerous downstream tasks across various modalities. However, the reliance on massive high-quality data pairs limits its further development on more modalities. This paper proposes a novel training-efficient method fo… ▽ More Multi-modal Contrastive Representation learning aims to encode different modalities into a semantically aligned shared space. This paradigm shows remarkable generalization ability on numerous downstream tasks across various modalities. However, the reliance on massive high-quality data pairs limits its further development on more modalities. This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR). Specifically, given two existing MCRs pre-trained on (A, B) and (B, C) modality pairs, we project them to a new space and use the data from the overlap** modality B to aligning the two MCRs in the new space. Meanwhile, since the modality pairs (A, B) and (B, C) are already aligned within each MCR, the connection learned by overlap** modality can also be transferred to non-overlap** modality pair (A, C). To unleash the potential of C-MCR, we further introduce a semantic-enhanced inter- and intra-MCR connection method. We first enhance the semantic consistency and completion of embeddings across different modalities for more robust alignment. Then we utilize the inter-MCR alignment to establish the connection, and employ the intra-MCR alignment to better maintain the connection for inputs from non-overlap** modalities. To demonstrate the effectiveness of C-MCR, we connect CLIP and CLAP via texts to derive audio-visual representations, and integrate CLIP and ULIP via images for 3D-language representations. Remarkably, without using any paired data, C-MCR for audio-visual achieves state-of-the-art performance on audio-image retrieval, audio-visual source localization, and counterfactual audio-image recognition tasks. Furthermore, C-MCR for 3D-language also attains advanced zero-shot 3D point cloud classification accuracy on ModelNet40. △ Less

Submitted 18 October, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: NeurIPS 2023

arXiv:2305.12552 [pdf, other]

Wav2SQL: Direct Generalizable Speech-To-SQL Parsing

Authors: Huadai Liu, Rongjie Huang, **zheng He, Gang Sun, Ran Shen, Xize Cheng, Zhou Zhao

Abstract: Speech-to-SQL (S2SQL) aims to convert spoken questions into SQL queries given relational databases, which has been traditionally implemented in a cascaded manner while facing the following challenges: 1) model training is faced with the major issue of data scarcity, where limited parallel data is available; and 2) the systems should be robust enough to handle diverse out-of-domain speech samples t… ▽ More Speech-to-SQL (S2SQL) aims to convert spoken questions into SQL queries given relational databases, which has been traditionally implemented in a cascaded manner while facing the following challenges: 1) model training is faced with the major issue of data scarcity, where limited parallel data is available; and 2) the systems should be robust enough to handle diverse out-of-domain speech samples that differ from the source data. In this work, we propose the first direct speech-to-SQL parsing model Wav2SQL which avoids error compounding across cascaded systems. Specifically, 1) to accelerate speech-driven SQL parsing research in the community, we release a large-scale and multi-speaker dataset MASpider; 2) leveraging the recent progress in the large-scale pre-training, we show that it alleviates the data scarcity issue and allow for direct speech-to-SQL parsing; and 3) we include the speech re-programming and gradient reversal classifier techniques to reduce acoustic variance and learned style-agnostic representation, improving generalization to unseen out-of-domain custom data. Experimental results demonstrate that Wav2SQL avoids error compounding and achieves state-of-the-art results by up to 2.5\% accuracy improvement over the baseline. △ Less

Submitted 21 May, 2023; originally announced May 2023.

arXiv:2303.11646 [pdf, other]

doi 10.1109/TCST.2024.3387588

Vehicle Sequencing at Signal-Free Intersections: Analytical Performance Guarantees Based on PDMP Formulation

Authors: Xiangchen Cheng, Wei Tang, Ming Yang, Li **

Abstract: Signal-free intersections are a representative application of smart and connected vehicle technologies. Although extensive results have been developed for trajectory planning and autonomous driving, the formulation and evaluation of vehicle sequencing have not been well understood.In this paper, we consider theoretical guarantees of macroscopic performance (i.e., capacity and delay) of typical seq… ▽ More Signal-free intersections are a representative application of smart and connected vehicle technologies. Although extensive results have been developed for trajectory planning and autonomous driving, the formulation and evaluation of vehicle sequencing have not been well understood.In this paper, we consider theoretical guarantees of macroscopic performance (i.e., capacity and delay) of typical sequencing policies at signal-free intersections. We model intersection traffic as a piecewise-deterministic Markov process (PDMP). We analytically characterize the intersection capacity regions and provide upper bounds on travel delay under three typical policies, viz. first-in-first-out, min-switchover, and longer-queue-first. We obtain these results by constructing policy-specific Lyapunov functions and computing mean drift of the PDMP. We also validate the results via a series of micro-simulation-based experiments. △ Less

Submitted 21 March, 2023; originally announced March 2023.

arXiv:2303.11330 [pdf, other]

Legs as Manipulator: Pushing Quadrupedal Agility Beyond Locomotion

Authors: Xuxin Cheng, Ashish Kumar, Deepak Pathak

Abstract: Locomotion has seen dramatic progress for walking or running across challenging terrains. However, robotic quadrupeds are still far behind their biological counterparts, such as dogs, which display a variety of agile skills and can use the legs beyond locomotion to perform several basic manipulation tasks like interacting with objects and climbing. In this paper, we take a step towards bridging th… ▽ More Locomotion has seen dramatic progress for walking or running across challenging terrains. However, robotic quadrupeds are still far behind their biological counterparts, such as dogs, which display a variety of agile skills and can use the legs beyond locomotion to perform several basic manipulation tasks like interacting with objects and climbing. In this paper, we take a step towards bridging this gap by training quadruped robots not only to walk but also to use the front legs to climb walls, press buttons, and perform object interaction in the real world. To handle this challenging optimization, we decouple the skill learning broadly into locomotion, which involves anything that involves movement whether via walking or climbing a wall, and manipulation, which involves using one leg to interact while balancing on the other three legs. These skills are trained in simulation using curriculum and transferred to the real world using our proposed sim2real variant that builds upon recent locomotion success. Finally, we combine these skills into a robust long-term plan by learning a behavior tree that encodes a high-level task hierarchy from one clean expert demonstration. We evaluate our method in both simulation and real-world showing successful executions of both short as well as long-range tasks and how robustness helps confront external perturbations. Videos at https://robot-skills.github.io △ Less

Submitted 22 March, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

Comments: Accepted at ICRA 2023. Videos at https://robot-skills.github.io

arXiv:2302.09328 [pdf, other]

SSVMR: Saliency-based Self-training for Video-Music Retrieval

Authors: Xuxin Cheng, Zhihong Zhu, Hongxiang Li, Yaowei Li, Yuexian Zou

Abstract: With the rise of short videos, the demand for selecting appropriate background music (BGM) for a video has increased significantly, video-music retrieval (VMR) task gradually draws much attention by research community. As other cross-modal learning tasks, existing VMR approaches usually attempt to measure the similarity between the video and music in the feature space. However, they (1) neglect th… ▽ More With the rise of short videos, the demand for selecting appropriate background music (BGM) for a video has increased significantly, video-music retrieval (VMR) task gradually draws much attention by research community. As other cross-modal learning tasks, existing VMR approaches usually attempt to measure the similarity between the video and music in the feature space. However, they (1) neglect the inevitable label noise; (2) neglect to enhance the ability to capture critical video clips. In this paper, we propose a novel saliency-based self-training framework, which is termed SSVMR. Specifically, we first explore to fully make use of the information containing in the training dataset by applying a semi-supervised method to suppress the adverse impact of label noise problem, where a self-training approach is adopted. In addition, we propose to capture the saliency of the video by mixing two videos at span level and preserving the locality of the two original videos. Inspired by back translation in NLP, we also conduct back retrieval to obtain more training data. Experimental results on MVD dataset show that our SSVMR achieves the state-of-the-art performance by a large margin, obtaining a relative improvement of 34.8% over the previous best model in terms of R@1. △ Less

Submitted 18 February, 2023; originally announced February 2023.

Comments: Accepted by ICASSP 2023

arXiv:2302.09316 [pdf, other]

Multi-timescale Trading Strategy for Renewable Power to Ammonia Virtual Power Plant in the Electricity, Hydrogen, and Ammonia Markets

Authors: Sirui Wu, ** Lin, Jiarong Li, Feng Liu, Yonghua Song, Yanhui Xu, Xiang Cheng, Zhipeng Yu

Abstract: Renewable power to ammonia (RePtA) is a prominent zero-carbon pathway for decarbonization. Due to the imbalance between renewables and production energy demand, the RePtA system relies on the electricity exchange with the power grid. Participating in the electricity market as a virtual power plant (VPP) may help to reduce energy costs. However, the power profile of local photovoltaics and wind tur… ▽ More Renewable power to ammonia (RePtA) is a prominent zero-carbon pathway for decarbonization. Due to the imbalance between renewables and production energy demand, the RePtA system relies on the electricity exchange with the power grid. Participating in the electricity market as a virtual power plant (VPP) may help to reduce energy costs. However, the power profile of local photovoltaics and wind turbines is similar to those in the market, resulting in rising energy costs under the conventional strategy. Hence, we develop a multi-timescale trading strategy for the RePtA VPP in the electricity, hydrogen, and ammonia markets. By utilizing the hydrogen and ammonia buffer systems, the RePtA VPP can optimally coordinate production planning. Moreover, we find it possible to describe the trading of electricity, ammonia, and hydrogen in a unified framework. The two-stage robust optimization model of the electricity market is extended to multiple markets and solved by the column and constraint generation (CC\&G) algorithm. The case is derived from an actual project in the Inner Mongolia Autonomous Region. Sensitivity analysis demonstrates the economic advantages of an RePtA VPP joining multiple markets over conventional strategy and reveals the necessity of the hydrogen and ammonia buffer and reactor's flexibility. △ Less

Submitted 18 February, 2023; originally announced February 2023.

arXiv:2302.08053 [pdf]

Selective Noise Suppression Methods Using Random SVPWM to Shape the Noise Spectrum of PMSMs

Authors: Jian Wen, Xiaobin Cheng, Peifeng Ji, Jun Yang, Feng Zhao

Abstract: Random pulse width modulation techniques are used in AC motors powered by two-level three-phase inverters, which cause a broadband spectrum of voltage, current, and electromagnetic force. The voltage distribution across a wide range of frequencies may increase the vibration and acoustic noise of motors. This study proposes two selective noise suppression (SNS) methods to eliminate voltage harmonic… ▽ More Random pulse width modulation techniques are used in AC motors powered by two-level three-phase inverters, which cause a broadband spectrum of voltage, current, and electromagnetic force. The voltage distribution across a wide range of frequencies may increase the vibration and acoustic noise of motors. This study proposes two selective noise suppression (SNS) methods to eliminate voltage harmonics for specified frequencies. In the first method, the switching frequency is constant. The pulse position is calculated by the duty cycle of the current switching cycle. Both the pulse position and switching frequency are randomized in the second method. This involves creating a unique relationship among the switching frequency, pulse position, and duty cycle to shape the noise spectrum. Computer simulation and experimental results show that both methods effectively perform selective noise suppression at a specific frequency. The power spectrum density (PSD) using the second SNS method is more uniform near integer multiples of the switching frequency than that using random pulse width modulation techniques or the first SNS method. These methods provide a valuable reference for eliminating electromagnetic and acoustic noises at resonant frequencies in motors. △ Less

Submitted 6 June, 2024; v1 submitted 15 February, 2023; originally announced February 2023.

Comments: 8 pages, 15 figures

arXiv:2212.03657 [pdf, other]

M3ST: Mix at Three Levels for Speech Translation

Authors: Xuxin Cheng, Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Yuexian Zou

Abstract: How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine… ▽ More How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine-tuning based on a pre-trained model using external machine translation (MT) data. In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data. At the second stage of fine-tuning, we take both original speech sequences and original text sequences in parallel into the model to fine-tune the network, and use Jensen-Shannon divergence to regularize their outputs. Experiments on MuST-C speech translation benchmark and analysis show that M^3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9. △ Less

Submitted 7 December, 2022; originally announced December 2022.

Comments: Submitted to ICASSP 2023

arXiv:2212.01042 [pdf, other]

doi 10.1109/SP46214.2022.9833716

AccEar: Accelerometer Acoustic Eavesdrop** with Unconstrained Vocabulary

Authors: Pengfei Hu, Hui Zhuang, Panneer Selvam Santhalingamy, Riccardo Spolaor, Parth Pathaky, Guoming Zhang, Xiuzhen Cheng

Abstract: With the increasing popularity of voice-based applications, acoustic eavesdrop** has become a serious threat to users' privacy. While on smartphones the access to microphones needs an explicit user permission, acoustic eavesdrop** attacks can rely on motion sensors (such as accelerometer and gyroscope), which access is unrestricted. However, previous instances of such attacks can only recogniz… ▽ More With the increasing popularity of voice-based applications, acoustic eavesdrop** has become a serious threat to users' privacy. While on smartphones the access to microphones needs an explicit user permission, acoustic eavesdrop** attacks can rely on motion sensors (such as accelerometer and gyroscope), which access is unrestricted. However, previous instances of such attacks can only recognize a limited set of pre-trained words or phrases. In this paper, we present AccEar, an accelerometerbased acoustic eavesdrop** attack that can reconstruct any audio played on the smartphone's loudspeaker with unconstrained vocabulary. We show that an attacker can employ a conditional Generative Adversarial Network (cGAN) to reconstruct highfidelity audio from low-frequency accelerometer signals. The presented cGAN model learns to recreate high-frequency components of the user's voice from low-frequency accelerometer signals through spectrogram enhancement. We assess the feasibility and effectiveness of AccEar attack in a thorough set of experiments using audio from 16 public personalities. As shown by the results in both objective and subjective evaluations, AccEar successfully reconstructs user speeches from accelerometer signals in different scenarios including varying sampling rate, audio volume, device model, etc. △ Less

Submitted 2 December, 2022; originally announced December 2022.

Comments: 2022 IEEE Symposium on Security and Privacy (SP)

Journal ref: 2022 IEEE Symposium on Security and Privacy (SP)

arXiv:2210.10044 [pdf, other]

Deep Whole-Body Control: Learning a Unified Policy for Manipulation and Locomotion

Authors: Zipeng Fu, Xuxin Cheng, Deepak Pathak

Abstract: An attached arm can significantly increase the applicability of legged robots to several mobile manipulation tasks that are not possible for the wheeled or tracked counterparts. The standard hierarchical control pipeline for such legged manipulators is to decouple the controller into that of manipulation and locomotion. However, this is ineffective. It requires immense engineering to support coord… ▽ More An attached arm can significantly increase the applicability of legged robots to several mobile manipulation tasks that are not possible for the wheeled or tracked counterparts. The standard hierarchical control pipeline for such legged manipulators is to decouple the controller into that of manipulation and locomotion. However, this is ineffective. It requires immense engineering to support coordination between the arm and legs, and error can propagate across modules causing non-smooth unnatural motions. It is also biological implausible given evidence for strong motor synergies across limbs. In this work, we propose to learn a unified policy for whole-body control of a legged manipulator using reinforcement learning. We propose Regularized Online Adaptation to bridge the Sim2Real gap for high-DoF control, and Advantage Mixing exploiting the causal dependency in the action space to overcome local minima during training the whole-body system. We also present a simple design for a low-cost legged manipulator, and find that our unified policy can demonstrate dynamic and agile behaviors across several task setups. Videos are at https://maniploco.github.io △ Less

Submitted 18 October, 2022; originally announced October 2022.

Comments: CoRL 2022 (Oral). Project website at https://maniploco.github.io

arXiv:2210.07721 [pdf, other]

Mechanical features based object recognition

Authors: Pakorn Uttayopas, Xiaoxiao Cheng, Jonathan Eden, Etienne Burdet

Abstract: Current robotic haptic object recognition relies on statistical measures derived from movement dependent interaction signals such as force, vibration or position. Mechanical properties that can be identified from these signals are intrinsic object properties that may yield a more robust object representation. Therefore, this paper proposes an object recognition framework using multiple representat… ▽ More Current robotic haptic object recognition relies on statistical measures derived from movement dependent interaction signals such as force, vibration or position. Mechanical properties that can be identified from these signals are intrinsic object properties that may yield a more robust object representation. Therefore, this paper proposes an object recognition framework using multiple representative mechanical properties: the coefficient of restitution, stiffness, viscosity and friction coefficient. These mechanical properties are identified in real-time using a dual Kalman filter, then used to classify objects. The proposed framework was tested with a robot identifying 20 objects through haptic exploration. The results demonstrate the technique's effectiveness and efficiency, and that all four mechanical properties are required for best recognition yielding a rate of 98.18 $\pm$ 0.424 %. Clustering with Gaussian mixture models further shows that using these mechanical properties results in superior recognition as compared to using statistical parameters of the interaction signals. △ Less

Submitted 14 October, 2022; originally announced October 2022.

Comments: 9 pages, journal paper

arXiv:2209.09018 [pdf, other]

A Causal Intervention Scheme for Semantic Segmentation of Quasi-periodic Cardiovascular Signals

Authors: Xingyao Wang, Yuwen Li, Hongxiang Gao, Xianghong Cheng, Jianqing Li, Chengyu Liu

Abstract: Precise segmentation is a vital first step to analyze semantic information of cardiac cycle and capture anomaly with cardiovascular signals. However, in the field of deep semantic segmentation, inference is often unilaterally confounded by the individual attribute of data. Towards cardiovascular signals, quasi-periodicity is the essential characteristic to be learned, regarded as the synthesize of… ▽ More Precise segmentation is a vital first step to analyze semantic information of cardiac cycle and capture anomaly with cardiovascular signals. However, in the field of deep semantic segmentation, inference is often unilaterally confounded by the individual attribute of data. Towards cardiovascular signals, quasi-periodicity is the essential characteristic to be learned, regarded as the synthesize of the attributes of morphology (Am) and rhythm (Ar). Our key insight is to suppress the over-dependence on Am or Ar while the generation process of deep representations. To address this issue, we establish a structural causal model as the foundation to customize the intervention approaches on Am and Ar, respectively. In this paper, we propose contrastive causal intervention (CCI) to form a novel training paradigm under a frame-level contrastive framework. The intervention can eliminate the implicit statistical bias brought by the single attribute and lead to more objective representations. We conduct comprehensive experiments with the controlled condition for QRS location and heart sound segmentation. The final results indicate that our approach can evidently improve the performance by up to 0.41% for QRS location and 2.73% for heart sound segmentation. The efficiency of the proposed method is generalized to multiple databases and noisy signals. △ Less

Submitted 19 September, 2022; originally announced September 2022.

Comments: submitted to IEEE Journal of Biomedical and Health Informatics (J-BHI)

arXiv:2209.07864 [pdf, ps, other]

doi 10.1109/MCOM.001.2200386

Toward 6G with Terahertz Communications: Understanding the Propagation Channels

Authors: Xuesong Cai, Xiang Cheng, Fredrik Tufvesson

Abstract: This article aims at providing insights for a comprehensive understanding of terahertz (THz) propagation channels. Specifically, we discuss essential THz channel characteristics to be well understood for the success of THz communications. The methodology of establishing realistic and 6G-compliant THz channel models based on measurements is then elaborated on, followed by a discussion on existing T… ▽ More This article aims at providing insights for a comprehensive understanding of terahertz (THz) propagation channels. Specifically, we discuss essential THz channel characteristics to be well understood for the success of THz communications. The methodology of establishing realistic and 6G-compliant THz channel models based on measurements is then elaborated on, followed by a discussion on existing THz channel measurements in the literature. Finally, future research directions, challenges and measures to enrich the understanding of THz channels are discussed. △ Less

Submitted 26 February, 2023; v1 submitted 16 September, 2022; originally announced September 2022.

Comments: The final version can be found in IEEE Communications Magazine

arXiv:2208.01227 [pdf, ps, other]

Optimal Measurement of Drone Swarm in RSS-based Passive Localization with Region Constraints

Authors: Xin Cheng, Feng Shu, Yifan Li, Zhihong Zhuang, Di Wu, Jiangzhou Wang

Abstract: Passive geolocation by multiple unmanned aerial vehicles (UAVs) covers a wide range of military and civilian applications including rescue, wild life tracking and electronic warfare. The sensor-target geometry is known to significantly affect the localization precision. The existing sensor placement strategies mainly work on the cases without any constraints on the sensors locations. However, UAVs… ▽ More Passive geolocation by multiple unmanned aerial vehicles (UAVs) covers a wide range of military and civilian applications including rescue, wild life tracking and electronic warfare. The sensor-target geometry is known to significantly affect the localization precision. The existing sensor placement strategies mainly work on the cases without any constraints on the sensors locations. However, UAVs cannot fly/hover simply in arbitrary region due to realistic constraints, such as the geographical limitations, the security issues, and the max flying speed. In this paper, optimal geometrical configurations of UAVs in received signal strength (RSS)-based localization under region constraints are investigated. Employing the D-optimal criteria, i.e., minimizing the determinate of Fisher information matrix (FIM), such optimal problem is formulated. Based on the rigorous algebra and geometrical derivations, optimal and also closed form configurations of UAVs under different flying states are proposed. Finally, the effectiveness and practicality of the proposed configurations are demonstrated by simulation examples. △ Less

Submitted 7 August, 2022; v1 submitted 1 August, 2022; originally announced August 2022.

Showing 1–50 of 88 results for author: Cheng, X