Search | arXiv e-print repository

Joint Reference Frame Synthesis and Post Filter Enhancement for Versatile Video Coding

Authors: Weijie Bao, Yuantong Zhang, Jianghao Jia, Zhenzhong Chen, Shan Liu

Abstract: This paper presents the joint reference frame synthesis (RFS) and post-processing filter enhancement (PFE) for Versatile Video Coding (VVC), aiming to explore the combination of different neural network-based video coding (NNVC) tools to better utilize the hierarchical bi-directional coding structure of VVC. Both RFS and PFE utilize the Space-Time Enhancement Network (STENet), which receives two i… ▽ More This paper presents the joint reference frame synthesis (RFS) and post-processing filter enhancement (PFE) for Versatile Video Coding (VVC), aiming to explore the combination of different neural network-based video coding (NNVC) tools to better utilize the hierarchical bi-directional coding structure of VVC. Both RFS and PFE utilize the Space-Time Enhancement Network (STENet), which receives two input frames with artifacts and produces two enhanced frames with suppressed artifacts, along with an intermediate synthesized frame. STENet comprises two pipelines, the synthesis pipeline and the enhancement pipeline, tailored for different purposes. During RFS, two reconstructed frames are sent into STENet's synthesis pipeline to synthesize a virtual reference frame, similar to the current to-be-coded frame. The synthesized frame serves as an additional reference frame inserted into the reference picture list (RPL). During PFE, two reconstructed frames are fed into STENet's enhancement pipeline to alleviate their artifacts and distortions, resulting in enhanced frames with reduced artifacts and distortions. To reduce inference complexity, we propose joint inference of RFS and PFE (JISE), achieved through a single execution of STENet. Integrated into the VVC reference software VTM-15.0, RFS, PFE, and JISE are coordinated within a novel Space-Time Enhancement Window (STEW) under Random Access (RA) configuration. The proposed method could achieve -7.34%/-17.21%/-16.65% PSNR-based BD-rate on average for three components under RA configuration. △ Less

Submitted 27 April, 2024; originally announced April 2024.

arXiv:2404.13748 [pdf, other]

Application of Kalman Filter in Stochastic Differential Equations

Authors: Wencheng Bao, Shi Feng, Kaiwen Zhang

Abstract: In areas such as finance, engineering, and science, we often face situations that change quickly and unpredictably. These situations are tough to handle and require special tools and methods capable of understanding and predicting what might happen next. Stochastic Differential Equations (SDEs) are renowned for modeling and analyzing real-world dynamical systems. However, obtaining the parameters,… ▽ More In areas such as finance, engineering, and science, we often face situations that change quickly and unpredictably. These situations are tough to handle and require special tools and methods capable of understanding and predicting what might happen next. Stochastic Differential Equations (SDEs) are renowned for modeling and analyzing real-world dynamical systems. However, obtaining the parameters, boundary conditions, and closed-form solutions of SDEs can often be challenging. In this paper, we will discuss the application of Kalman filtering theory to SDEs, including Extended Kalman filtering and Particle Extended Kalman filtering. We will explore how to fit existing SDE systems through filtering and track the original SDEs by fitting the obtained closed-form solutions. This approach aims to gather more information about these SDEs, which could be used in various ways, such as incorporating them into parameters of data-based SDE models. △ Less

Submitted 21 April, 2024; originally announced April 2024.

Comments: 18 pages, 14 figures

arXiv:2305.11094 [pdf, other]

QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation

Authors: Sicheng Yang, Zhiyong Wu, Minglei Li, Zhensong Zhang, Lei Hao, Weihong Bao, Haolin Zhuang

Abstract: Speech-driven gesture generation is highly challenging due to the random jitters of human motion. In addition, there is an inherent asynchronous relationship between human speech and gestures. To tackle these challenges, we introduce a novel quantization-based and phase-guided motion-matching framework. Specifically, we first present a gesture VQ-VAE module to learn a codebook to summarize meaning… ▽ More Speech-driven gesture generation is highly challenging due to the random jitters of human motion. In addition, there is an inherent asynchronous relationship between human speech and gestures. To tackle these challenges, we introduce a novel quantization-based and phase-guided motion-matching framework. Specifically, we first present a gesture VQ-VAE module to learn a codebook to summarize meaningful gesture units. With each code representing a unique gesture, random jittering problems are alleviated effectively. We then use Levenshtein distance to align diverse gestures with different speech. Levenshtein distance based on audio quantization as a similarity metric of corresponding speech of gestures helps match more appropriate gestures with speech, and solves the alignment problem of speech and gestures well. Moreover, we introduce phase to guide the optimal gesture matching based on the semantics of context or rhythm of audio. Phase guides when text-based or speech-based gestures should be performed to make the generated gestures more natural. Extensive experiments show that our method outperforms recent approaches on speech-driven gesture generation. Our code, database, pre-trained models, and demos are available at https://github.com/YoungSeng/QPGesture. △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: 15 pages, 12 figures, CVPR 2023 Highlight

arXiv:2208.12133 [pdf, other]

doi 10.1145/3536221.3558066

The ReprGesture entry to the GENEA Challenge 2022

Authors: Sicheng Yang, Zhiyong Wu, Minglei Li, Mengchen Zhao, Jiuxin Lin, Liyang Chen, Weihong Bao

Abstract: This paper describes the ReprGesture entry to the Generation and Evaluation of Non-verbal Behaviour for Embodied Agents (GENEA) challenge 2022. The GENEA challenge provides the processed datasets and performs crowdsourced evaluations to compare the performance of different gesture generation systems. In this paper, we explore an automatic gesture generation system based on multimodal representatio… ▽ More This paper describes the ReprGesture entry to the Generation and Evaluation of Non-verbal Behaviour for Embodied Agents (GENEA) challenge 2022. The GENEA challenge provides the processed datasets and performs crowdsourced evaluations to compare the performance of different gesture generation systems. In this paper, we explore an automatic gesture generation system based on multimodal representation learning. We use WavLM features for audio, FastText features for text and position and rotation matrix features for gesture. Each modality is projected to two distinct subspaces: modality-invariant and modality-specific. To learn inter-modality-invariant commonalities and capture the characters of modality-specific representations, gradient reversal layer based adversarial classifier and modality reconstruction decoders are used during training. The gesture decoder generates proper gestures using all representations and features related to the rhythm in the audio. Our code, pre-trained models and demo are available at https://github.com/YoungSeng/ReprGesture. △ Less

Submitted 25 August, 2022; originally announced August 2022.

Comments: 8 pages, 4 figures, ICMI 2022

arXiv:2206.04922 [pdf, other]

A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation

Authors: Junhui Zhang, Wudi Bao, Junjie Pan, Xiang Yin, Zejun Ma

Abstract: Chinese dialects are different variations of Chinese and can be considered as different languages in the same language family with Mandarin. Though they all use Chinese characters, the pronunciations, grammar and idioms can vary significantly, and even local speakers may find it hard to input correct written forms of dialect. Besides, using Mandarin text as text-to-speech inputs would generate spe… ▽ More Chinese dialects are different variations of Chinese and can be considered as different languages in the same language family with Mandarin. Though they all use Chinese characters, the pronunciations, grammar and idioms can vary significantly, and even local speakers may find it hard to input correct written forms of dialect. Besides, using Mandarin text as text-to-speech inputs would generate speech with poor naturalness. In this paper, we propose a novel Chinese dialect TTS frontend with a translation module, which converts Mandarin text into dialectic expressions to improve the intelligibility and naturalness of synthesized speech. A non-autoregressive neural machine translation model with various tricks is proposed for the translation task. It is the first known work to incorporate translation with TTS frontend. Experiments on Cantonese show the proposed model improves 2.56 BLEU and TTS improves 0.27 MOS with Mandarin inputs. △ Less

Submitted 12 December, 2022; v1 submitted 10 June, 2022; originally announced June 2022.

Comments: 4 pages,5 figures

arXiv:2103.09455 [pdf, other]

Prediction-assistant Frame Super-Resolution for Video Streaming

Authors: Wang Shen, Wenbo Bao, Guangtao Zhai, Charlie L Wang, Jerry W Hu, Zhiyong Gao

Abstract: Video frame transmission delay is critical in real-time applications such as online video gaming, live show, etc. The receiving deadline of a new frame must catch up with the frame rendering time. Otherwise, the system will buffer a while, and the user will encounter a frozen screen, resulting in unsatisfactory user experiences. An effective approach is to transmit frames in lower-quality under po… ▽ More Video frame transmission delay is critical in real-time applications such as online video gaming, live show, etc. The receiving deadline of a new frame must catch up with the frame rendering time. Otherwise, the system will buffer a while, and the user will encounter a frozen screen, resulting in unsatisfactory user experiences. An effective approach is to transmit frames in lower-quality under poor bandwidth conditions, such as using scalable video coding. In this paper, we propose to enhance video quality using lossy frames in two situations. First, when current frames are too late to receive before rendering deadline (i.e., lost), we propose to use previously received high-resolution images to predict the future frames. Second, when the quality of the currently received frames is low~(i.e., lossy), we propose to use previously received high-resolution frames to enhance the low-quality current ones. For the first case, we propose a small yet effective video frame prediction network. For the second case, we improve the video prediction network to a video enhancement network to associate current frames as well as previous frames to restore high-quality images. Extensive experimental results demonstrate that our method performs favorably against state-of-the-art algorithms in the lossy video streaming environment. △ Less

Submitted 17 March, 2021; originally announced March 2021.

arXiv:2103.08259 [pdf, other]

The QXS-SAROPT Dataset for Deep Learning in SAR-Optical Data Fusion

Authors: Meiyu Huang, Yao Xu, Lixin Qian, Weili Shi, Yaqin Zhang, Wei Bao, Nan Wang, Xuejiao Liu, Xueshuang Xiang

Abstract: Deep learning techniques have made an increasing impact on the field of remote sensing. However, deep neural networks based fusion of multimodal data from different remote sensors with heterogenous characteristics has not been fully explored, due to the lack of availability of big amounts of perfectly aligned multi-sensor image data with diverse scenes of high resolutions, especially for synthetic… ▽ More Deep learning techniques have made an increasing impact on the field of remote sensing. However, deep neural networks based fusion of multimodal data from different remote sensors with heterogenous characteristics has not been fully explored, due to the lack of availability of big amounts of perfectly aligned multi-sensor image data with diverse scenes of high resolutions, especially for synthetic aperture radar (SAR) data and optical imagery. To promote the development of deep learning based SAR-optical fusion approaches, we release the QXS-SAROPT dataset, which contains 20,000 pairs of SAR-optical image patches. We obtain the SAR patches from SAR satellite GaoFen-3 images and the optical patches from Google Earth images. These images cover three port cities: San Diego, Shanghai and Qingdao. Here, we present a detailed introduction of the construction of the dataset, and show its two representative exemplary applications, namely SAR-optical image matching and SAR ship detection boosted by cross-modal information from optical images. As a large open SAR-optical dataset with multiple scenes of a high resolution, we believe QXS-SAROPT will be of potential value for further research in SAR-optical data fusion technology based on deep learning. △ Less

Submitted 25 April, 2021; v1 submitted 15 March, 2021; originally announced March 2021.

arXiv:1910.04572 [pdf]

Design, Modelling and Validation of a Novel Extra Slender Continuum Robot for In-situ Inspection and Repair in Aeroengine

Authors: Mingfeng Wang, Xin Dong, Weiming Ba, Abdelkhalick Mohammad, Dragos Axinte, Andy Norton

Abstract: In-situ aeroengine maintenance works are highly beneficial as it can significantly reduce the current maintenance cycle which is extensive and costly due to the disassembly requirement of engines from aircrafts. However, navigating in/out via inspection ports and performing multi-axis movements with end-effectors in constrained environments (e.g. combustion chamber) are fairly challenging. A novel… ▽ More In-situ aeroengine maintenance works are highly beneficial as it can significantly reduce the current maintenance cycle which is extensive and costly due to the disassembly requirement of engines from aircrafts. However, navigating in/out via inspection ports and performing multi-axis movements with end-effectors in constrained environments (e.g. combustion chamber) are fairly challenging. A novel extra-slender (diameter-to-length ratio <0.02) dual-stage continuum robot (16 degree-of-freedom) is proposed to navigate in/out confined environments and perform required configuration shapes for further repair operations. Firstly, the robot design presents several innovative mechatronic solutions: (i) dual-stage tendon-driven structure with bevelled disks to perform required shapes and to provide selective stiffness for carrying high payloads; (ii) various rigid-compliant combined joints to enable different flexibility and stiffness in each stage; (iii) three commanding cables for each 2-DoF section to minimise the number of actuators with precise actuations. Secondly, a segment-scaled piecewise-constant-curvature-theory based kinematic model and a Kirchhoff-elastic-rod-theory based static model are established by considering the applied forces/moments (friction, actuation, gravity and external load), where the friction coefficient is modelled as a function of bending angle. Finally, experiments were carried out to validate the proposed static modelling and to evaluate the robot capabilities of performing the predefined shape and stiffness. △ Less

Submitted 9 October, 2019; originally announced October 2019.

Comments: 11 pages, 12 figures, journal

arXiv:1909.13051 [pdf, other]

A Dual Camera System for High Spatiotemporal Resolution Video Acquisition

Authors: Ming Cheng, Zhan Ma, M. Salman Asif, Yiling Xu, Haojie Liu, Wenbo Bao, Jun Sun

Abstract: This paper presents a dual camera system for high spatiotemporal resolution (HSTR) video acquisition, where one camera shoots a video with high spatial resolution and low frame rate (HSR-LFR) and another one captures a low spatial resolution and high frame rate (LSR-HFR) video. Our main goal is to combine videos from LSR-HFR and HSR-LFR cameras to create an HSTR video. We propose an end-to-end lea… ▽ More This paper presents a dual camera system for high spatiotemporal resolution (HSTR) video acquisition, where one camera shoots a video with high spatial resolution and low frame rate (HSR-LFR) and another one captures a low spatial resolution and high frame rate (LSR-HFR) video. Our main goal is to combine videos from LSR-HFR and HSR-LFR cameras to create an HSTR video. We propose an end-to-end learning framework, AWnet, mainly consisting of a FlowNet and a FusionNet that learn an adaptive weighting function in pixel domain to combine inputs in a frame recurrent fashion. To improve the reconstruction quality for cameras used in reality, we also introduce noise regularization under the same framework. Our method has demonstrated noticeable performance gains in terms of both objective PSNR measurement in simulation with different publicly available video and light-field datasets and subjective evaluation with real data captured by dual iPhone 7 and Grasshopper3 cameras. Ablation studies are further conducted to investigate and explore various aspects (such as reference structure, camera parallax, exposure time, etc) of our system to fully understand its capability for potential applications. △ Less

Submitted 24 March, 2020; v1 submitted 28 September, 2019; originally announced September 2019.

Comments: To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence

arXiv:1909.00548 [pdf, other]

Resource Optimized Neural Architecture Search for 3D Medical Image Segmentation

Authors: Woong Bae, Seungho Lee, Yeha Lee, Beomhee Park, Minki Chung, Kyu-Hwan Jung

Abstract: Neural Architecture Search (NAS), a framework which automates the task of designing neural networks, has recently been actively studied in the field of deep learning. However, there are only a few NAS methods suitable for 3D medical image segmentation. Medical 3D images are generally very large; thus it is difficult to apply previous NAS methods due to their GPU computational burden and long train… ▽ More Neural Architecture Search (NAS), a framework which automates the task of designing neural networks, has recently been actively studied in the field of deep learning. However, there are only a few NAS methods suitable for 3D medical image segmentation. Medical 3D images are generally very large; thus it is difficult to apply previous NAS methods due to their GPU computational burden and long training time. We propose the resource-optimized neural architecture search method which can be applied to 3D medical segmentation tasks in a short training time (1.39 days for 1GB dataset) using a small amount of computation power (one RTX 2080Ti, 10.8GB GPU memory). Excellent performance can also be achieved without retraining(fine-tuning) which is essential in most NAS methods. These advantages can be achieved by using a reinforcement learning-based controller with parameter sharing and focusing on the optimal search space configuration of macro search rather than micro search. Our experiments demonstrate that the proposed NAS method outperforms manually designed networks with state-of-the-art performance in 3D medical image segmentation. △ Less

Submitted 2 September, 2019; originally announced September 2019.

Comments: MICCAI(International Conference on Medical Image Computing and Computer Assisted Intervention) 2019 accepted

arXiv:1905.06103 [pdf]

Closed Loop Load Model Identification Using Small Disturbance Data

Authors: Shangyuan Li, Li Feng, Deqiang Gan, Zhen Wang, Wei Bao, Hao Xu

Abstract: Load model identification using small disturbance data is studied. It is proved that the individual load to be identified and the rest of the system forms a closed-loop system. Then, the impacts of disturbances entering the feedforward channel (internal disturbance) and feedback channel (external disturbance) on relationship between load inputs and outputs are examined analytically. It is found ou… ▽ More Load model identification using small disturbance data is studied. It is proved that the individual load to be identified and the rest of the system forms a closed-loop system. Then, the impacts of disturbances entering the feedforward channel (internal disturbance) and feedback channel (external disturbance) on relationship between load inputs and outputs are examined analytically. It is found out that relationship between load inputs and outputs is not determined by load itself (feedforward transfer function) only, but also related with equivalent network matrix (feedback transfer function). Thus, load identification is closed loop identification essentially and the impact of closed loop identification cannot be neglected when using small disturbance data to identify load parameters. Closed loop load model identification can be solved by prediction error method (PEM). Implementation of PEM based on a Kalman filtering formulation is detailed. Identification results using simulated data demonstrates the correctness and significance of theoretical analysis. △ Less

Submitted 15 May, 2019; originally announced May 2019.

Comments: 6 pages, 5 figures

Showing 1–11 of 11 results for author: Bao, W