-
An Enhanced Encoder-Decoder Network Architecture for Reducing Information Loss in Image Semantic Segmentation
Authors:
Zijun Gao,
Qi Wang,
Taiyuan Mei,
Xiaohan Cheng,
Yun Zi,
Haowei Yang
Abstract:
The traditional SegNet architecture commonly encounters significant information loss during the sampling process, which detrimentally affects its accuracy in image semantic segmentation tasks. To counter this challenge, we introduce an innovative encoder-decoder network structure enhanced with residual connections. Our approach employs a multi-residual connection strategy designed to preserve the…
▽ More
The traditional SegNet architecture commonly encounters significant information loss during the sampling process, which detrimentally affects its accuracy in image semantic segmentation tasks. To counter this challenge, we introduce an innovative encoder-decoder network structure enhanced with residual connections. Our approach employs a multi-residual connection strategy designed to preserve the intricate details across various image scales more effectively, thus minimizing the information loss inherent to down-sampling procedures. Additionally, to enhance the convergence rate of network training and mitigate sample imbalance issues, we have devised a modified cross-entropy loss function incorporating a balancing factor. This modification optimizes the distribution between positive and negative samples, thus improving the efficiency of model training. Experimental evaluations of our model demonstrate a substantial reduction in information loss and improved accuracy in semantic segmentation. Notably, our proposed network architecture demonstrates a substantial improvement in the finely annotated mean Intersection over Union (mIoU) on the dataset compared to the conventional SegNet. The proposed network structure not only reduces operational costs by decreasing manual inspection needs but also scales up the deployment of AI-driven image analysis across different sectors.
△ Less
Submitted 26 May, 2024;
originally announced June 2024.
-
Visual-Aware Text-to-Speech
Authors:
Mohan Zhou,
Yalong Bai,
Wei Zhang,
Ting Yao,
Tiejun Zhao,
Tao Mei
Abstract:
Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and s…
▽ More
Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and sequential visual feedback (e.g., nod, smile) of the listener in face-to-face communication. Different from traditional text-to-speech, VA-TTS highlights the impact of visual modality. On this newly-minted task, we devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis. Extensive experiments on multimodal conversation dataset ViCo-X verify our proposal for generating more natural audio with scenario-appropriate rhythm and prosody.
△ Less
Submitted 21 June, 2023;
originally announced June 2023.
-
Freeform Body Motion Generation from Speech
Authors:
**g Xu,
Wei Zhang,
Yalong Bai,
Qibin Sun,
Tao Mei
Abstract:
People naturally conduct spontaneous body motions to enhance their speeches while giving talks. Body motion generation from speech is inherently difficult due to the non-deterministic map** from speech to body motions. Most existing works map speech to motion in a deterministic way by conditioning on certain styles, leading to sub-optimal results. Motivated by studies in linguistics, we decompos…
▽ More
People naturally conduct spontaneous body motions to enhance their speeches while giving talks. Body motion generation from speech is inherently difficult due to the non-deterministic map** from speech to body motions. Most existing works map speech to motion in a deterministic way by conditioning on certain styles, leading to sub-optimal results. Motivated by studies in linguistics, we decompose the co-speech motion into two complementary parts: pose modes and rhythmic dynamics. Accordingly, we introduce a novel freeform motion generation model (FreeMo) by equip** a two-stream architecture, i.e., a pose mode branch for primary posture generation, and a rhythmic motion branch for rhythmic dynamics synthesis. On one hand, diverse pose modes are generated by conditional sampling in a latent space, guided by speech semantics. On the other hand, rhythmic dynamics are synced with the speech prosody. Extensive experiments demonstrate the superior performance against several baselines, in terms of motion diversity, quality and syncing with speech. Code and pre-trained models will be publicly available through https://github.com/TheTempAccount/Co-Speech-Motion-Generation.
△ Less
Submitted 4 March, 2022;
originally announced March 2022.
-
Black Re-ID: A Head-shoulder Descriptor for the Challenging Problem of Person Re-Identification
Authors:
Boqiang Xu,
Lingxiao He,
Xingyu Liao,
Wu Liu,
Zhenan Sun,
Tao Mei
Abstract:
Person re-identification (Re-ID) aims at retrieving an input person image from a set of images captured by multiple cameras. Although recent Re-ID methods have made great success, most of them extract features in terms of the attributes of clothing (e.g., color, texture). However, it is common for people to wear black clothes or be captured by surveillance systems in low light illumination, in whi…
▽ More
Person re-identification (Re-ID) aims at retrieving an input person image from a set of images captured by multiple cameras. Although recent Re-ID methods have made great success, most of them extract features in terms of the attributes of clothing (e.g., color, texture). However, it is common for people to wear black clothes or be captured by surveillance systems in low light illumination, in which cases the attributes of the clothing are severely missing. We call this problem the Black Re-ID problem. To solve this problem, rather than relying on the clothing information, we propose to exploit head-shoulder features to assist person Re-ID. The head-shoulder adaptive attention network (HAA) is proposed to learn the head-shoulder feature and an innovative ensemble method is designed to enhance the generalization of our model. Given the input person image, the ensemble method would focus on the head-shoulder feature by assigning a larger weight if the individual insides the image is in black clothing. Due to the lack of a suitable benchmark dataset for studying the Black Re-ID problem, we also contribute the first Black-reID dataset, which contains 1274 identities in training set. Extensive evaluations on the Black-reID, Market1501 and DukeMTMC-reID datasets show that our model achieves the best result compared with the state-of-the-art Re-ID methods on both Black and conventional Re-ID problems. Furthermore, our method is also proved to be effective in dealing with person Re-ID in similar clothing. Our code and dataset are avaliable on https://github.com/xbq1994/.
△ Less
Submitted 19 August, 2020;
originally announced August 2020.
-
daBNN: A Super Fast Inference Framework for Binary Neural Networks on ARM devices
Authors:
Jianhao Zhang,
Yingwei Pan,
Ting Yao,
He Zhao,
Tao Mei
Abstract:
It is always well believed that Binary Neural Networks (BNNs) could drastically accelerate the inference efficiency by replacing the arithmetic operations in float-valued Deep Neural Networks (DNNs) with bit-wise operations. Nevertheless, there has not been open-source implementation in support of this idea on low-end ARM devices (e.g., mobile phones and embedded devices). In this work, we propose…
▽ More
It is always well believed that Binary Neural Networks (BNNs) could drastically accelerate the inference efficiency by replacing the arithmetic operations in float-valued Deep Neural Networks (DNNs) with bit-wise operations. Nevertheless, there has not been open-source implementation in support of this idea on low-end ARM devices (e.g., mobile phones and embedded devices). In this work, we propose daBNN --- a super fast inference framework that implements BNNs on ARM devices. Several speed-up and memory refinement strategies for bit-packing, binarized convolution, and memory layout are uniquely devised to enhance inference efficiency. Compared to the recent open-source BNN inference framework, BMXNet, our daBNN is $7\times$$\sim$$23\times$ faster on a single binary convolution, and about $6\times$ faster on Bi-Real Net 18 (a BNN variant of ResNet-18). The daBNN is a BSD-licensed inference framework, and its source code, sample projects and pre-trained models are available on-line: https://github.com/JDAI-CV/dabnn.
△ Less
Submitted 16 August, 2019;
originally announced August 2019.
-
Gaussian Random Number Generator: Implemented in FPGA for Quantum Key Distribution
Authors:
Yue Hu,
Yan Wu,
Yi Chen,
G. C. Wan,
S. T. Mei
Abstract:
Quantum Key Distribution is the process of using quantum communication to establish a shared key between two parties. It has been demonstrated the unconditional security and effective communication of quantum communication system can be guaranteed by an excellent Gaussian random number generator with high speed and an extended random period. In this paper, we propose to construct the Gaussian rand…
▽ More
Quantum Key Distribution is the process of using quantum communication to establish a shared key between two parties. It has been demonstrated the unconditional security and effective communication of quantum communication system can be guaranteed by an excellent Gaussian random number generator with high speed and an extended random period. In this paper, we propose to construct the Gaussian random number generator using Field-Programmable Gate Array (FPGA) which is able to process large data in high speed. We also compare three algorithms ofGRNgeneration: Box-Muller algorithm, polarization decision algorithm, and central limit algorithm. We demonstrate that the polarization decision algorithm implemented inFPGArequires less computing resources and also produces a high-quality Gaussian random number, through the null hypothesis test.
△ Less
Submitted 9 January, 2019; v1 submitted 20 February, 2018;
originally announced February 2018.