Search | arXiv e-print repository

MPAI-EEV: Standardization Efforts of Artificial Intelligence based End-to-End Video Coding

Authors: Chuanmin Jia, Feng Ye, Fanke Dong, Kai Lin, Leonardo Chiariglione, Siwei Ma, Huifang Sun, Wen Gao

Abstract: The rapid advancement of artificial intelligence (AI) technology has led to the prioritization of standardizing the processing, coding, and transmission of video using neural networks. To address this priority area, the Moving Picture, Audio, and Data Coding by Artificial Intelligence (MPAI) group is develo** a suite of standards called MPAI-EEV for "end-to-end optimized neural video coding." Th… ▽ More The rapid advancement of artificial intelligence (AI) technology has led to the prioritization of standardizing the processing, coding, and transmission of video using neural networks. To address this priority area, the Moving Picture, Audio, and Data Coding by Artificial Intelligence (MPAI) group is develo** a suite of standards called MPAI-EEV for "end-to-end optimized neural video coding." The aim of this AI-based video standard project is to compress the number of bits required to represent high-fidelity video data by utilizing data-trained neural coding technologies. This approach is not constrained by how data coding has traditionally been applied in the context of a hybrid framework. This paper presents an overview of recent and ongoing standardization efforts in this area and highlights the key technologies and design philosophy of EEV. It also provides a comparison and report on some primary efforts such as the coding efficiency of the reference model. Additionally, it discusses emerging activities such as learned Unmanned-Aerial-Vehicles (UAVs) video coding which are currently planned, under development, or in the exploration phase. With a focus on UAV video signals, this paper addresses the current status of these preliminary efforts. It also indicates development timelines, summarizes the main technical details, and provides pointers to further points of reference. The exploration experiment shows that the EEV model performs better than the state-of-the-art video coding standard H.266/VVC in terms of perceptual evaluation metric. △ Less

Submitted 14 September, 2023; originally announced September 2023.

arXiv:2308.12508 [pdf, other]

FFEINR: Flow Feature-Enhanced Implicit Neural Representation for Spatio-temporal Super-Resolution

Authors: Chenyue Jiao, Chongke Bi, Lu Yang

Abstract: Large-scale numerical simulations are capable of generating data up to terabytes or even petabytes. As a promising method of data reduction, super-resolution (SR) has been widely studied in the scientific visualization community. However, most of them are based on deep convolutional neural networks (CNNs) or generative adversarial networks (GANs) and the scale factor needs to be determined before… ▽ More Large-scale numerical simulations are capable of generating data up to terabytes or even petabytes. As a promising method of data reduction, super-resolution (SR) has been widely studied in the scientific visualization community. However, most of them are based on deep convolutional neural networks (CNNs) or generative adversarial networks (GANs) and the scale factor needs to be determined before constructing the network. As a result, a single training session only supports a fixed factor and has poor generalization ability. To address these problems, this paper proposes a Feature-Enhanced Implicit Neural Representation (FFEINR) for spatio-temporal super-resolution of flow field data. It can take full advantage of the implicit neural representation in terms of model structure and sampling resolution. The neural representation is based on a fully connected network with periodic activation functions, which enables us to obtain lightweight models. The learned continuous representation can decode the low-resolution flow field input data to arbitrary spatial and temporal resolutions, allowing for flexible upsampling. The training process of FFEINR is facilitated by introducing feature enhancements for the input layer, which complements the contextual information of the flow field. To demonstrate the effectiveness of the proposed method, a series of experiments are conducted on different datasets by setting different hyperparameters. The results show that FFEINR achieves significantly better results than the trilinear interpolation method. △ Less

Submitted 26 August, 2023; v1 submitted 23 August, 2023; originally announced August 2023.

Comments: This paper has been accepted and published by ChinaVis 2023(2023.7.21-24)

arXiv:2306.14108 [pdf, other]

SpikeCodec: An End-to-end Learned Compression Framework for Spiking Camera

Authors: Kexiang Feng, Chuanmin Jia, Siwei Ma, Wen Gao

Abstract: Recently, the bio-inspired spike camera with continuous motion recording capability has attracted tremendous attention due to its ultra high temporal resolution imaging characteristic. Such imaging feature results in huge data storage and transmission burden compared to that of traditional camera, raising severe challenge and imminent necessity in compression for spike camera captured content. Exi… ▽ More Recently, the bio-inspired spike camera with continuous motion recording capability has attracted tremendous attention due to its ultra high temporal resolution imaging characteristic. Such imaging feature results in huge data storage and transmission burden compared to that of traditional camera, raising severe challenge and imminent necessity in compression for spike camera captured content. Existing lossy data compression methods could not be applied for compressing spike streams efficiently due to integrate-and-fire characteristic and binarized data structure. Considering the imaging principle and information fidelity of spike cameras, we introduce an effective and robust representation of spike streams. Based on this representation, we propose a novel learned spike compression framework using scene recovery, variational auto-encoder plus spike simulator. To our knowledge, it is the first data-trained model for efficient and robust spike stream compression. Extensive experimental results show that our method outperforms the conventional and learning-based codecs, contributing a strong baseline for learned spike data compression. △ Less

Submitted 24 June, 2023; originally announced June 2023.

Comments: 13 pages, 11 figures and 5 tables

arXiv:2304.09322 [pdf, other]

doi 10.1016/j.eswa.2023.119965

Multi-Modality Multi-Scale Cardiovascular Disease Subtypes Classification Using Raman Image and Medical History

Authors: Bo Yu, Hechang Chen, Chengyou Jia, Hongren Zhou, Lele Cong, Xiankai Li, Jianhui Zhuang, Xianling Cong

Abstract: Raman spectroscopy (RS) has been widely used for disease diagnosis, e.g., cardiovascular disease (CVD), owing to its efficiency and component-specific testing capabilities. A series of popular deep learning methods have recently been introduced to learn nuance features from RS for binary classifications and achieved outstanding performance than conventional machine learning methods. However, these… ▽ More Raman spectroscopy (RS) has been widely used for disease diagnosis, e.g., cardiovascular disease (CVD), owing to its efficiency and component-specific testing capabilities. A series of popular deep learning methods have recently been introduced to learn nuance features from RS for binary classifications and achieved outstanding performance than conventional machine learning methods. However, these existing deep learning methods still confront some challenges in classifying subtypes of CVD. For example, the nuance between subtypes is quite hard to capture and represent by intelligent models due to the chillingly similar shape of RS sequences. Moreover, medical history information is an essential resource for distinguishing subtypes, but they are underutilized. In light of this, we propose a multi-modality multi-scale model called M3S, which is a novel deep learning method with two core modules to address these issues. First, we convert RS data to various resolution images by the Gramian angular field (GAF) to enlarge nuance, and a two-branch structure is leveraged to get embeddings for distinction in the multi-scale feature extraction module. Second, a probability matrix and a weight matrix are used to enhance the classification capacity by combining the RS and medical history data in the multi-modality data fusion module. We perform extensive evaluations of M3S and found its outstanding performance on our in-house dataset, with accuracy, precision, recall, specificity, and F1 score of 0.9330, 0.9379, 0.9291, 0.9752, and 0.9334, respectively. These results demonstrate that the M3S has high performance and robustness compared with popular methods in diagnosing CVD subtypes. △ Less

Submitted 18 April, 2023; originally announced April 2023.

Journal ref: [J]. Expert Systems with Applications, 2023: 119965

arXiv:2304.06896 [pdf, other]

doi 10.1109/TCSVT.2024.3418493

Machine Perception-Driven Image Compression: A Layered Generative Approach

Authors: Yuefeng Zhang, Chuanmin Jia, Jiannhui Chang, Siwei Ma

Abstract: In this age of information, images are a critical medium for storing and transmitting information. With the rapid growth of image data amount, visual compression and visual data perception are two important research topics attracting a lot attention. However, those two topics are rarely discussed together and follow separate research path. Due to the compact compressed domain representation offere… ▽ More In this age of information, images are a critical medium for storing and transmitting information. With the rapid growth of image data amount, visual compression and visual data perception are two important research topics attracting a lot attention. However, those two topics are rarely discussed together and follow separate research path. Due to the compact compressed domain representation offered by learning-based image compression methods, there exists possibility to have one stream targeting both efficient data storage and compression, and machine perception tasks. In this paper, we propose a layered generative image compression model achieving high human vision-oriented image reconstructed quality, even at extreme compression ratios. To obtain analysis efficiency and flexibility, a task-agnostic learning-based compression model is proposed, which effectively supports various compressed domain-based analytical tasks while reserves outstanding reconstructed perceptual quality, compared with traditional and learning-based codecs. In addition, joint optimization schedule is adopted to acquire best balance point among compression ratio, reconstructed image quality, and downstream perception performance. Experimental results verify that our proposed compressed domain-based multi-task analysis method can achieve comparable analysis results against the RGB image-based methods with up to 99.6% bit rate saving (i.e., compared with taking original RGB image as the analysis model input). The practical ability of our model is further justified from model size and information fidelity aspects. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Comments: 12 pages, 12 figures

Journal ref: IEEE Transactions on Circuits and Systems for Video Technology 2024

arXiv:2301.06115 [pdf, other]

Learning to Compress Unmanned Aerial Vehicle (UAV) Captured Video: Benchmark and Analysis

Authors: Chuanmin Jia, Feng Ye, Huifang Sun, Siwei Ma, Wen Gao

Abstract: During the past decade, the Unmanned-Aerial-Vehicles (UAVs) have attracted increasing attention due to their flexible, extensive, and dynamic space-sensing capabilities. The volume of video captured by UAVs is exponentially growing along with the increased bitrate generated by the advancement of the sensors mounted on UAVs, bringing new challenges for on-device UAV storage and air-ground data tran… ▽ More During the past decade, the Unmanned-Aerial-Vehicles (UAVs) have attracted increasing attention due to their flexible, extensive, and dynamic space-sensing capabilities. The volume of video captured by UAVs is exponentially growing along with the increased bitrate generated by the advancement of the sensors mounted on UAVs, bringing new challenges for on-device UAV storage and air-ground data transmission. Most existing video compression schemes were designed for natural scenes without consideration of specific texture and view characteristics of UAV videos. In this work, we first contribute a detailed analysis of the current state of the field of UAV video coding. Then we propose to establish a novel task for learned UAV video coding and construct a comprehensive and systematic benchmark for such a task, present a thorough review of high quality UAV video datasets and benchmarks, and contribute extensive rate-distortion efficiency comparison of learned and conventional codecs after. Finally, we discuss the challenges of encoding UAV videos. It is expected that the benchmark will accelerate the research and development in video coding on drone platforms. △ Less

Submitted 15 January, 2023; originally announced January 2023.

Comments: MPAI End-to-end Video group progress report, DCC 2023

arXiv:2209.02574 [pdf, other]

doi 10.1145/3474085.3475558

Cross Modal Compression: Towards Human-comprehensible Semantic Compression

Authors: Jiguo Li, Chuanmin Jia, Xinfeng Zhang, Siwei Ma, Wen Gao

Abstract: Traditional image/video compression aims to reduce the transmission/storage cost with signal fidelity as high as possible. However, with the increasing demand for machine analysis and semantic monitoring in recent years, semantic fidelity rather than signal fidelity is becoming another emerging concern in image/video compression. With the recent advances in cross modal translation and generation,… ▽ More Traditional image/video compression aims to reduce the transmission/storage cost with signal fidelity as high as possible. However, with the increasing demand for machine analysis and semantic monitoring in recent years, semantic fidelity rather than signal fidelity is becoming another emerging concern in image/video compression. With the recent advances in cross modal translation and generation, in this paper, we propose the cross modal compression~(CMC), a semantic compression framework for visual data, to transform the high redundant visual data~(such as image, video, etc.) into a compact, human-comprehensible domain~(such as text, sketch, semantic map, attributions, etc.), while preserving the semantic. Specifically, we first formulate the CMC problem as a rate-distortion optimization problem. Secondly, we investigate the relationship with the traditional image/video compression and the recent feature compression frameworks, showing the difference between our CMC and these prior frameworks. Then we propose a novel paradigm for CMC to demonstrate its effectiveness. The qualitative and quantitative results show that our proposed CMC can achieve encouraging reconstructed results with an ultrahigh compression ratio, showing better compression performance than the widely used JPEG baseline. △ Less

Submitted 6 September, 2022; originally announced September 2022.

Comments: 10 pages, 4 figures

arXiv:2106.14371 [pdf, other]

Sparsely Overlapped Speech Training in the Time Domain: Joint Learning of Target Speech Separation and Personal VAD Benefits

Authors: Qingjian Lin, Lin Yang, Xuyang Wang, Luyuan Xie, Chen Jia, Junjie Wang

Abstract: Target speech separation is the process of filtering a certain speaker's voice out of speech mixtures according to the additional speaker identity information provided. Recent works have made considerable improvement by processing signals in the time domain directly. The majority of them take fully overlapped speech mixtures for training. However, since most real-life conversations occur randomly… ▽ More Target speech separation is the process of filtering a certain speaker's voice out of speech mixtures according to the additional speaker identity information provided. Recent works have made considerable improvement by processing signals in the time domain directly. The majority of them take fully overlapped speech mixtures for training. However, since most real-life conversations occur randomly and are sparsely overlapped, we argue that training with different overlap ratio data benefits. To do so, an unavoidable problem is that the popularly used SI-SNR loss has no definition for silent sources. This paper proposes the weighted SI-SNR loss, together with the joint learning of target speech separation and personal VAD. The weighted SI-SNR loss imposes a weight factor that is proportional to the target speaker's duration and returns zero when the target speaker is absent. Meanwhile, the personal VAD generates masks and sets non-target speech to silence. Experiments show that our proposed method outperforms the baseline by 1.73 dB in terms of SDR on fully overlapped speech, as well as by 4.17 dB and 0.9 dB on sparsely overlapped speech of clean and noisy conditions. Besides, with slight degradation in performance, our model could reduce the time costs in inference. △ Less

Submitted 26 September, 2021; v1 submitted 27 June, 2021; originally announced June 2021.

Comments: Accepted by APSIPA 2021

arXiv:2106.12954 [pdf, other]

Rate Distortion Characteristic Modeling for Neural Image Compression

Authors: Chuanmin Jia, Ziqing Ge, Shanshe Wang, Siwei Ma, Wen Gao

Abstract: End-to-end optimized neural image compression (NIC) has obtained superior lossy compression performance recently. In this paper, we consider the problem of rate-distortion (R-D) characteristic analysis and modeling for NIC. We make efforts to formulate the essential mathematical functions to describe the R-D behavior of NIC using deep networks. Thus arbitrary bit-rate points could be elegantly rea… ▽ More End-to-end optimized neural image compression (NIC) has obtained superior lossy compression performance recently. In this paper, we consider the problem of rate-distortion (R-D) characteristic analysis and modeling for NIC. We make efforts to formulate the essential mathematical functions to describe the R-D behavior of NIC using deep networks. Thus arbitrary bit-rate points could be elegantly realized by leveraging such model via a single trained network. We propose a plugin-in module to learn the relationship between the target bit-rate and the binary representation for the latent variable of auto-encoder. The proposed scheme resolves the problem of training distinct models to reach different points in the R-D space. Furthermore, we model the rate and distortion characteristic of NIC as a function of the coding parameter $λ$ respectively. Our experiments show our proposed method is easy to adopt and realizes state-of-the-art continuous bit-rate coding performance, which implies that our approach would benefit the practical deployment of NIC. △ Less

Submitted 13 January, 2022; v1 submitted 24 June, 2021; originally announced June 2021.

Comments: 10 pages, accepted by DCC 2022 as full paper

arXiv:2104.10315 [pdf, ps, other]

Visual Analysis Motivated Rate-Distortion Model for Image Coding

Authors: Zhimeng Huang, Chuanmin Jia, Shanshe Wang, Siwei Ma

Abstract: Optimized for pixel fidelity metrics, images compressed by existing image codec are facing systematic challenges when used for visual analysis tasks, especially under low-bitrate coding. This paper proposes a visual analysis-motivated rate-distortion model for Versatile Video Coding (VVC) intra compression. The proposed model has two major contributions, a novel rate allocation strategy and a new… ▽ More Optimized for pixel fidelity metrics, images compressed by existing image codec are facing systematic challenges when used for visual analysis tasks, especially under low-bitrate coding. This paper proposes a visual analysis-motivated rate-distortion model for Versatile Video Coding (VVC) intra compression. The proposed model has two major contributions, a novel rate allocation strategy and a new distortion measurement model. We first propose the region of interest for machine (ROIM) to evaluate the degree of importance for each coding tree unit (CTU) in visual analysis. Then, a novel CTU-level bit allocation model is proposed based on ROIM and the local texture characteristics of each CTU. After an in-depth analysis of multiple distortion models, a visual analysis friendly distortion criteria is subsequently proposed by extracting deep feature of each coding unit (CU). To alleviate the problem of lacking spatial context information when calculating the distortion of each CU, we finally propose a multi-scale feature distortion (MSFD) metric using different neighboring pixels by weighting the extracted deep features in each scale. Extensive experimental results show that the proposed scheme could achieve up to 28.17\% bitrate saving under the same analysis performance among several typical visual analysis tasks such as image classification, object detection, and semantic segmentation. △ Less

Submitted 20 April, 2021; originally announced April 2021.

arXiv:2103.07131 [pdf, other]

Thousand to One: Semantic Prior Modeling for Conceptual Coding

Authors: Jianhui Chang, Zhenghui Zhao, Lingbo Yang, Chuanmin Jia, Jian Zhang, Siwei Ma

Abstract: Conceptual coding has been an emerging research topic recently, which encodes natural images into disentangled conceptual representations for compression. However, the compression performance of the existing methods is still sub-optimal due to the lack of comprehensive consideration of rate constraint and reconstruction quality. To this end, we propose a novel end-to-end semantic prior modeling-ba… ▽ More Conceptual coding has been an emerging research topic recently, which encodes natural images into disentangled conceptual representations for compression. However, the compression performance of the existing methods is still sub-optimal due to the lack of comprehensive consideration of rate constraint and reconstruction quality. To this end, we propose a novel end-to-end semantic prior modeling-based conceptual coding scheme towards extremely low bitrate image compression, which leverages semantic-wise deep representations as a unified prior for entropy estimation and texture synthesis. Specifically, we employ semantic segmentation maps as structural guidance for extracting deep semantic prior, which provides fine-grained texture distribution modeling for better detail construction and higher flexibility in subsequent high-level vision tasks. Moreover, a cross-channel entropy model is proposed to further exploit the inter-channel correlation of the spatially independent semantic prior, leading to more accurate entropy estimation for rate-constrained training. The proposed scheme achieves an ultra-high 1000x compression ratio, while still enjoying high visual reconstruction quality and versatility towards visual processing and analysis tasks. △ Less

Submitted 15 March, 2021; v1 submitted 12 March, 2021; originally announced March 2021.

Comments: ICME 2021 ORAL accepted

arXiv:2011.04976 [pdf, other]

Conceptual Compression via Deep Structure and Texture Synthesis

Authors: Jianhui Chang, Zhenghui Zhao, Chuanmin Jia, Shiqi Wang, Lingbo Yang, Qi Mao, Jian Zhang, Siwei Ma

Abstract: Existing compression methods typically focus on the removal of signal-level redundancies, while the potential and versatility of decomposing visual data into compact conceptual components still lack further study. To this end, we propose a novel conceptual compression framework that encodes visual data into compact structure and texture representations, then decodes in a deep synthesis fashion, ai… ▽ More Existing compression methods typically focus on the removal of signal-level redundancies, while the potential and versatility of decomposing visual data into compact conceptual components still lack further study. To this end, we propose a novel conceptual compression framework that encodes visual data into compact structure and texture representations, then decodes in a deep synthesis fashion, aiming to achieve better visual reconstruction quality, flexible content manipulation, and potential support for various vision tasks. In particular, we propose to compress images by a dual-layered model consisting of two complementary visual features: 1) structure layer represented by structural maps and 2) texture layer characterized by low-dimensional deep representations. At the encoder side, the structural maps and texture representations are individually extracted and compressed, generating the compact, interpretable, inter-operable bitstreams. During the decoding stage, a hierarchical fusion GAN (HF-GAN) is proposed to learn the synthesis paradigm where the textures are rendered into the decoded structural maps, leading to high-quality reconstruction with remarkable visual realism. Extensive experiments on diverse images have demonstrated the superiority of our framework with lower bitrates, higher reconstruction quality, and increased versatility towards visual analysis and content manipulation tasks. △ Less

Submitted 10 March, 2022; v1 submitted 10 November, 2020; originally announced November 2020.

Comments: 15 pages, 14 figures

arXiv:2006.12696 [pdf, ps, other]

When Distributed Formation Control Is Feasible under Hard Constraints on Energy and Time?

Authors: Chunxiang Jia, Fei Chen, Linying Xiang, Weiyao Lan, Gang Feng

Abstract: This paper studies distributed optimal formation control with hard constraints on energy levels and termination time, in which the formation error is to be minimized jointly with the energy cost. The main contributions include a globally optimal distributed formation control law and a comprehensive analysis of the resulting closed-loop system under those hard constraints. It is revealed that the e… ▽ More This paper studies distributed optimal formation control with hard constraints on energy levels and termination time, in which the formation error is to be minimized jointly with the energy cost. The main contributions include a globally optimal distributed formation control law and a comprehensive analysis of the resulting closed-loop system under those hard constraints. It is revealed that the energy levels, the task termination time, the steady-state error tolerance, as well as the network topology impose inherent limitations in achieving the formation control mission. Most notably, the lower bounds on the achievable termination time and the required minimum energy levels are derived, which are given in terms of the initial formation error, the steady-state error tolerance, and the largest eigenvalue of the Laplacian matrix. These lower bounds can be employed to assert whether an energy and time constrained formation task is achievable and how to accomplish such a task. Furthermore, the monotonicity of those lower bounds in relation to the control parameters is revealed. A simulation example is finally given to illustrate the obtained results. △ Less

Submitted 22 June, 2020; originally announced June 2020.

arXiv:2006.11730 [pdf, other]

High-Resolution Channel Estimation for Intelligent Reflecting Surface-Assisted MmWave Communications

Authors: C. Jia, J. Cheng, H. Gao, W. Xu

Abstract: In this paper, we study the high-resolution channel estimation problem for intelligent reflecting surface (IRS)-assisted millimeter wave (mmWave) multiple-input-multiple-output (MIMO) communications, which is a prerequisite to guarantee further high-rate data transmission. Considering the typical sparsity of mmWave channels, we formulate the cascaded channel estimation problem from a sparse signal… ▽ More In this paper, we study the high-resolution channel estimation problem for intelligent reflecting surface (IRS)-assisted millimeter wave (mmWave) multiple-input-multiple-output (MIMO) communications, which is a prerequisite to guarantee further high-rate data transmission. Considering the typical sparsity of mmWave channels, we formulate the cascaded channel estimation problem from a sparse signal recovery perspective, and then propose a novel two-step cascaded channel estimation protocol to estimate the cascaded user-IRS-base station channel with high-resolution for IRS-assisted mmWave MIMO communications. More specifically, the first step is to estimate the coarse angular domain information (ADI) and further establish the robust uplink by beam training. In the second step, by exploiting the coarse ADI, an adaptive grid matching pursuit (AGMP) algorithm is proposed to estimate the high-resolution cascaded channel state information (CSI) with low complexity. Simulation results verify that the proposed two-step channel estimation protocol significantly outperforms the state-of-the-art scheme, i.e., beam training based channel estimation, and meanwhile can reap near-optimal system performance achieved by perfect CSI. △ Less

Submitted 21 June, 2020; originally announced June 2020.

Comments: 6 pages, 7 figures, conference

arXiv:2004.03428 [pdf, other]

Universal Adversarial Perturbations Generative Network for Speaker Recognition

Authors: Jiguo Li, Xinfeng Zhang, Chuanmin Jia, Jizheng Xu, Li Zhang, Yue Wang, Siwei Ma, Wen Gao

Abstract: Attacking deep learning based biometric systems has drawn more and more attention with the wide deployment of fingerprint/face/speaker recognition systems, given the fact that the neural networks are vulnerable to the adversarial examples, which have been intentionally perturbed to remain almost imperceptible for human. In this paper, we demonstrated the existence of the universal adversarial pert… ▽ More Attacking deep learning based biometric systems has drawn more and more attention with the wide deployment of fingerprint/face/speaker recognition systems, given the fact that the neural networks are vulnerable to the adversarial examples, which have been intentionally perturbed to remain almost imperceptible for human. In this paper, we demonstrated the existence of the universal adversarial perturbations~(UAPs) for the speaker recognition systems. We proposed a generative network to learn the map** from the low-dimensional normal distribution to the UAPs subspace, then synthesize the UAPs to perturbe any input signals to spoof the well-trained speaker recognition model with high probability. Experimental results on TIMIT and LibriSpeech datasets demonstrate the effectiveness of our model. △ Less

Submitted 7 April, 2020; originally announced April 2020.

Comments: Accepted by ICME2020

arXiv:2004.03413 [pdf, other]

doi 10.1109/JSTSP.2020.2987417

Direct Speech-to-image Translation

Authors: Jiguo Li, Xinfeng Zhang, Chuanmin Jia, Jizheng Xu, Li Zhang, Yue Wang, Siwei Ma, Wen Gao

Abstract: Direct speech-to-image translation without text is an interesting and useful topic due to the potential applications in human-computer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing form. However, as far as we know, it has not been well-studied how to translate the speech signals into images directly and how well they can be translated. In… ▽ More Direct speech-to-image translation without text is an interesting and useful topic due to the potential applications in human-computer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing form. However, as far as we know, it has not been well-studied how to translate the speech signals into images directly and how well they can be translated. In this paper, we attempt to translate the speech signals into the image signals without the transcription stage. Specifically, a speech encoder is designed to represent the input speech signals as an embedding feature, and it is trained with a pretrained image encoder using teacher-student learning to obtain better generalization ability on new classes. Subsequently, a stacked generative adversarial network is used to synthesize high-quality images conditioned on the embedding feature. Experimental results on both synthesized and real data show that our proposed method is effective to translate the raw speech signals into images without the middle text representation. Ablation study gives more insights about our method. △ Less

Submitted 9 April, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

Comments: Accepted by JSTSP

arXiv:2003.01400 [pdf, ps, other]

OTFS Based Receiver Scheme With Multi-Antennas in High-Mobility V2X Systems

Authors: Junqiang Cheng, Chenglu Jia, Hui Gao, Wenjun Xu, Zhisong Bie

Abstract: Vehicle-to-everything (V2X) is considered as one of the most important applications of future wireless communication networks. However, the Doppler effect caused by the vehicle mobility may seriously deteriorate the performance of the vehicular communication links, especially when the channels exhibit a large number of Doppler frequency offsets (DFOs). Orthogonal time frequency space (OTFS) is a n… ▽ More Vehicle-to-everything (V2X) is considered as one of the most important applications of future wireless communication networks. However, the Doppler effect caused by the vehicle mobility may seriously deteriorate the performance of the vehicular communication links, especially when the channels exhibit a large number of Doppler frequency offsets (DFOs). Orthogonal time frequency space (OTFS) is a new waveform designed in the delay-Doppler domain, and can effectively convert a doubly dispersive channel into an almost non-fading channel, which makes it very attractive for V2X communications. In this paper, we design a novel OTFS based receiver with multi-antennas to deal with the high-mobility challenges in V2X systems. We show that the multiple DFOs associated with multipaths can be separated with the high-spatial resolution provided by multi-antennas, which leads to an enhanced sparsity of the OTFS channel in the delay-Doppler domain and bears a potential to reduce the complexity of the message passing (MP) detection algorithm. Based on this observation, we further propose a joint MP-maximum ration combining (MRC) iterative detection for OTFS, where the integration of MRC significantly improves the convergence performance of the iteration and gains an excellent system error performance. Finally, we provide numerical simulation results to corroborate the superiorities of the proposed scheme. △ Less

Submitted 3 March, 2020; originally announced March 2020.

Comments: Accepted in IEEE ICC'20 Workshop - V2X-NGD

arXiv:2003.01306 [pdf, other]

Machine Learning Empowered Beam Management for Intelligent Reflecting Surface Assisted MmWave Networks

Authors: Chenglu Jia, Hui Gao, Na Chen, Yuan He

Abstract: Recently, intelligent reflecting surface (IRS) assisted mmWave networks are emerging, which bear the potential to address the blockage issue of the millimeter wave (mmWave) communication in a more cost-effective way. In particular, IRS is built by passive and programmable electromagnetic elements that can manipulate the mmWave propagation channel into a more favorable condition that is free of blo… ▽ More Recently, intelligent reflecting surface (IRS) assisted mmWave networks are emerging, which bear the potential to address the blockage issue of the millimeter wave (mmWave) communication in a more cost-effective way. In particular, IRS is built by passive and programmable electromagnetic elements that can manipulate the mmWave propagation channel into a more favorable condition that is free of blockage via judicious joint BS-IRS transmission design. However, the coexistence of IRSs and mmWave BSs complicates the network architecture, and thus poses great challenges for efficient beam management (BM) that is one critical prerequisite for high performance mmWave networks. In this paper, we systematically evaluate the key issues and challenges of BM for IRS-assisted mmWave networks to bring insights into the future network design. Specifically, we carefully classify and discuss the extensibility and limitations of the existing BM of conventional mmWave towards the IRS-assisted new paradigm. Moreover, we propose a novel machine learning empowered BM framework for IRS-assisted networks with representative showcases, which processes environmental and mobility awareness to achieve highly efficient BM with significantly reduced system overhead. Finally, some interesting future directions are also suggested to inspire further researches. △ Less

Submitted 2 March, 2020; originally announced March 2020.

arXiv:1909.13342 [pdf, other]

Interference-Precancelled Pilot Design for LMMSE Channel Estimation of GFDM

Authors: Ching-Lun Tai, Borching Su, Cai Jia

Abstract: Generalized frequency division multiplexing (GFDM) is a promising candidate waveform for next-generation wireless communication systems. However, GFDM channel estimation is still challenging due to the inherent interference. In this paper, we formulate a pilot design framework with linear minimum mean square error (LMMSE) channel estimation for GFDM, and propose a novel pilot design to achieve int… ▽ More Generalized frequency division multiplexing (GFDM) is a promising candidate waveform for next-generation wireless communication systems. However, GFDM channel estimation is still challenging due to the inherent interference. In this paper, we formulate a pilot design framework with linear minimum mean square error (LMMSE) channel estimation for GFDM, and propose a novel pilot design to achieve interference precancellation during pilot generation with the fixed transmit sample values at selected frequency bins. Numerical results demonstrate that the proposed method reduces the channel estimation mean square error and the symbol error rate (SER) in high signal-to-noise ratio (SNR) regions, compared with the conventional methods. △ Less

Submitted 29 September, 2019; originally announced September 2019.

Comments: 5 pages, 6 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:1906.07233 [pdf, ps, other]

doi 10.1109/JIOT.2019.2919225

Learn to Sense: a Meta-learning Based Sensing and Fusion Framework for Wireless Sensor Networks

Authors: Hui Wu, Zhaoyang Zhang, Chunxu Jiao, Chunguang Li, Tony Q. S. Quek

Abstract: Wireless sensor networks (WSN) acts as the backbone of Internet of Things (IoT) technology. In WSN, field sensing and fusion are the most commonly seen problems, which involve collecting and processing of a huge volume of spatial samples in an unknown field to reconstruct the field or extract its features. One of the major concerns is how to reduce the communication overhead and data redundancy wi… ▽ More Wireless sensor networks (WSN) acts as the backbone of Internet of Things (IoT) technology. In WSN, field sensing and fusion are the most commonly seen problems, which involve collecting and processing of a huge volume of spatial samples in an unknown field to reconstruct the field or extract its features. One of the major concerns is how to reduce the communication overhead and data redundancy with prescribed fusion accuracy. In this paper, an integrated communication and computation framework based on meta-learning is proposed to enable adaptive field sensing and reconstruction. It consists of a stochastic-gradient-descent (SGD) based base-learner used for the field model prediction aiming to minimize the average prediction error, and a reinforcement meta-learner aiming to optimize the sensing decision by simultaneously rewarding the error reduction with samples obtained so far and penalizing the corresponding communication cost. An adaptive sensing algorithm based on the above two-layer meta-learning framework is presented. It actively determines the next most informative sensing location, and thus considerably reduces the spatial samples and yields superior performance and robustness compared with conventional schemes. The convergence behavior of the proposed algorithm is also comprehensively analyzed and simulated. The results reveal that the proposed field sensing algorithm significantly improves the convergence rate. △ Less

Submitted 13 June, 2019; originally announced June 2019.

Comments: Paper accepted for publication in IEEE Internet of Things Journal

arXiv:1903.09752 [pdf, ps, other]

MmWave Communication With Active Ambient Perception

Authors: Chunxu Jiao, Zhaoyang Zhang, Caijun Zhong, Xiaoming Chen, Zhiyong Feng

Abstract: In existing communication systems, the channel state information of each UE (user equipment) should be repeatedly estimated when it moves to a new position or another UE takes its place. The underlying ambient information, including the specific layout of potential reflectors, which provides more detailed information about all UEs' channel structures, has not been fully explored and exploited. In… ▽ More In existing communication systems, the channel state information of each UE (user equipment) should be repeatedly estimated when it moves to a new position or another UE takes its place. The underlying ambient information, including the specific layout of potential reflectors, which provides more detailed information about all UEs' channel structures, has not been fully explored and exploited. In this paper, we rethink the mmWave channel estimation problem in a new and indirect way, i.e., instead of estimating the resultant composite channel response at each time and for any specific location, we first conduct the ambient perception exploiting the fascinating radar capability of a mmWave antenna array and then accomplish the location-based sparse channel reconstruction. In this way, the sparse channel for a quasi-static UE arriving at a specific location can be rapidly synthesized based on the perceived ambient information, thus greatly reducing the signalling overhead and online computational complexity. Based on the reconstructed mmWave channel, single-beam mmWave communication is designed and evaluated which shows excellent performance. Such an approach in fact integrates radar with communication, which may possibly open a new paradigm for future communication system design. △ Less

Submitted 22 March, 2019; originally announced March 2019.

Comments: Accepted for publication in IEEE Transactions on Wireless Communications

arXiv:1903.07243 [pdf]

doi 10.1109/JSTARS.2018.2879440

Complex Scene Classification of PolSAR Imagery based on a Self-paced Learning Approach

Authors: Wenshuai Chen, Shui** Gou, Xinlin Wang, Licheng Jiao, Changzhe Jiao, Alina Zare

Abstract: Existing polarimetric synthetic aperture radar (PolSAR) image classification methods cannot achieve satisfactory performance on complex scenes characterized by several types of land cover with significant levels of noise or similar scattering properties across land cover types. Hence, we propose a supervised classification method aimed at constructing a classifier based on self-paced learning (SPL… ▽ More Existing polarimetric synthetic aperture radar (PolSAR) image classification methods cannot achieve satisfactory performance on complex scenes characterized by several types of land cover with significant levels of noise or similar scattering properties across land cover types. Hence, we propose a supervised classification method aimed at constructing a classifier based on self-paced learning (SPL). SPL has been demonstrated to be effective at dealing with complex data while providing classifier. In this paper, a novel Support Vector Machine (SVM) algorithm based on SPL with neighborhood constraints (SVM_SPLNC) is proposed. The proposed method leverages the easiest samples first to obtain an initial parameter vector. Then, more complex samples are gradually incorporated to update the parameter vector iteratively. Moreover, neighborhood constraints are introduced during the training process to further improve performance. Experimental results on three real PolSAR images show that the proposed method performs well on complex scenes. △ Less

Submitted 17 March, 2019; originally announced March 2019.

arXiv:1711.00727 [pdf, ps, other]

Performance Evaluation of Channel Decoding With Deep Neural Networks

Authors: Wei Lyu, Zhaoyang Zhang, Chunxu Jiao, Kangjian Qin, Huazi Zhang

Abstract: With the demand of high data rate and low latency in fifth generation (5G), deep neural network decoder (NND) has become a promising candidate due to its capability of one-shot decoding and parallel computing. In this paper, three types of NND, i.e., multi-layer perceptron (MLP), convolution neural network (CNN) and recurrent neural network (RNN), are proposed with the same parameter magnitude. Th… ▽ More With the demand of high data rate and low latency in fifth generation (5G), deep neural network decoder (NND) has become a promising candidate due to its capability of one-shot decoding and parallel computing. In this paper, three types of NND, i.e., multi-layer perceptron (MLP), convolution neural network (CNN) and recurrent neural network (RNN), are proposed with the same parameter magnitude. The performance of these deep neural networks are evaluated through extensive simulation. Numerical results show that RNN has the best decoding performance, yet at the price of the highest computational overhead. Moreover, we find there exists a saturation length for each type of neural network, which is caused by their restricted learning abilities. △ Less

Submitted 31 January, 2018; v1 submitted 1 November, 2017; originally announced November 2017.

Comments: 6 pages, 11 figures, Latex; typos corrected; IEEE ICC 2018 to appear

Showing 1–23 of 23 results for author: Jia, C