Search | arXiv e-print repository

An Enhanced Encoder-Decoder Network Architecture for Reducing Information Loss in Image Semantic Segmentation

Authors: Zijun Gao, Qi Wang, Taiyuan Mei, Xiaohan Cheng, Yun Zi, Haowei Yang

Abstract: The traditional SegNet architecture commonly encounters significant information loss during the sampling process, which detrimentally affects its accuracy in image semantic segmentation tasks. To counter this challenge, we introduce an innovative encoder-decoder network structure enhanced with residual connections. Our approach employs a multi-residual connection strategy designed to preserve the… ▽ More The traditional SegNet architecture commonly encounters significant information loss during the sampling process, which detrimentally affects its accuracy in image semantic segmentation tasks. To counter this challenge, we introduce an innovative encoder-decoder network structure enhanced with residual connections. Our approach employs a multi-residual connection strategy designed to preserve the intricate details across various image scales more effectively, thus minimizing the information loss inherent to down-sampling procedures. Additionally, to enhance the convergence rate of network training and mitigate sample imbalance issues, we have devised a modified cross-entropy loss function incorporating a balancing factor. This modification optimizes the distribution between positive and negative samples, thus improving the efficiency of model training. Experimental evaluations of our model demonstrate a substantial reduction in information loss and improved accuracy in semantic segmentation. Notably, our proposed network architecture demonstrates a substantial improvement in the finely annotated mean Intersection over Union (mIoU) on the dataset compared to the conventional SegNet. The proposed network structure not only reduces operational costs by decreasing manual inspection needs but also scales up the deployment of AI-driven image analysis across different sectors. △ Less

Submitted 26 May, 2024; originally announced June 2024.

arXiv:2306.12020 [pdf, other]

doi 10.1109/ICASSP49357.2023.10095084

Visual-Aware Text-to-Speech

Authors: Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, Tao Mei

Abstract: Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and s… ▽ More Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and sequential visual feedback (e.g., nod, smile) of the listener in face-to-face communication. Different from traditional text-to-speech, VA-TTS highlights the impact of visual modality. On this newly-minted task, we devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis. Extensive experiments on multimodal conversation dataset ViCo-X verify our proposal for generating more natural audio with scenario-appropriate rhythm and prosody. △ Less

Submitted 21 June, 2023; originally announced June 2023.

Comments: accepted as oral and top 3% paper by ICASSP 2023

Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023, 1-5

arXiv:2203.02291 [pdf, other]

Freeform Body Motion Generation from Speech

Authors: **g Xu, Wei Zhang, Yalong Bai, Qibin Sun, Tao Mei

Abstract: People naturally conduct spontaneous body motions to enhance their speeches while giving talks. Body motion generation from speech is inherently difficult due to the non-deterministic map** from speech to body motions. Most existing works map speech to motion in a deterministic way by conditioning on certain styles, leading to sub-optimal results. Motivated by studies in linguistics, we decompos… ▽ More People naturally conduct spontaneous body motions to enhance their speeches while giving talks. Body motion generation from speech is inherently difficult due to the non-deterministic map** from speech to body motions. Most existing works map speech to motion in a deterministic way by conditioning on certain styles, leading to sub-optimal results. Motivated by studies in linguistics, we decompose the co-speech motion into two complementary parts: pose modes and rhythmic dynamics. Accordingly, we introduce a novel freeform motion generation model (FreeMo) by equip** a two-stream architecture, i.e., a pose mode branch for primary posture generation, and a rhythmic motion branch for rhythmic dynamics synthesis. On one hand, diverse pose modes are generated by conditional sampling in a latent space, guided by speech semantics. On the other hand, rhythmic dynamics are synced with the speech prosody. Extensive experiments demonstrate the superior performance against several baselines, in terms of motion diversity, quality and syncing with speech. Code and pre-trained models will be publicly available through https://github.com/TheTempAccount/Co-Speech-Motion-Generation. △ Less

Submitted 4 March, 2022; originally announced March 2022.

arXiv:2008.08528 [pdf, other]

Black Re-ID: A Head-shoulder Descriptor for the Challenging Problem of Person Re-Identification

Authors: Boqiang Xu, Lingxiao He, Xingyu Liao, Wu Liu, Zhenan Sun, Tao Mei

Abstract: Person re-identification (Re-ID) aims at retrieving an input person image from a set of images captured by multiple cameras. Although recent Re-ID methods have made great success, most of them extract features in terms of the attributes of clothing (e.g., color, texture). However, it is common for people to wear black clothes or be captured by surveillance systems in low light illumination, in whi… ▽ More Person re-identification (Re-ID) aims at retrieving an input person image from a set of images captured by multiple cameras. Although recent Re-ID methods have made great success, most of them extract features in terms of the attributes of clothing (e.g., color, texture). However, it is common for people to wear black clothes or be captured by surveillance systems in low light illumination, in which cases the attributes of the clothing are severely missing. We call this problem the Black Re-ID problem. To solve this problem, rather than relying on the clothing information, we propose to exploit head-shoulder features to assist person Re-ID. The head-shoulder adaptive attention network (HAA) is proposed to learn the head-shoulder feature and an innovative ensemble method is designed to enhance the generalization of our model. Given the input person image, the ensemble method would focus on the head-shoulder feature by assigning a larger weight if the individual insides the image is in black clothing. Due to the lack of a suitable benchmark dataset for studying the Black Re-ID problem, we also contribute the first Black-reID dataset, which contains 1274 identities in training set. Extensive evaluations on the Black-reID, Market1501 and DukeMTMC-reID datasets show that our model achieves the best result compared with the state-of-the-art Re-ID methods on both Black and conventional Re-ID problems. Furthermore, our method is also proved to be effective in dealing with person Re-ID in similar clothing. Our code and dataset are avaliable on https://github.com/xbq1994/. △ Less

Submitted 19 August, 2020; originally announced August 2020.

arXiv:1908.05858 [pdf, other]

daBNN: A Super Fast Inference Framework for Binary Neural Networks on ARM devices

Authors: Jianhao Zhang, Yingwei Pan, Ting Yao, He Zhao, Tao Mei

Abstract: It is always well believed that Binary Neural Networks (BNNs) could drastically accelerate the inference efficiency by replacing the arithmetic operations in float-valued Deep Neural Networks (DNNs) with bit-wise operations. Nevertheless, there has not been open-source implementation in support of this idea on low-end ARM devices (e.g., mobile phones and embedded devices). In this work, we propose… ▽ More It is always well believed that Binary Neural Networks (BNNs) could drastically accelerate the inference efficiency by replacing the arithmetic operations in float-valued Deep Neural Networks (DNNs) with bit-wise operations. Nevertheless, there has not been open-source implementation in support of this idea on low-end ARM devices (e.g., mobile phones and embedded devices). In this work, we propose daBNN --- a super fast inference framework that implements BNNs on ARM devices. Several speed-up and memory refinement strategies for bit-packing, binarized convolution, and memory layout are uniquely devised to enhance inference efficiency. Compared to the recent open-source BNN inference framework, BMXNet, our daBNN is $7\times$$\sim$$23\times$ faster on a single binary convolution, and about $6\times$ faster on Bi-Real Net 18 (a BNN variant of ResNet-18). The daBNN is a BSD-licensed inference framework, and its source code, sample projects and pre-trained models are available on-line: https://github.com/JDAI-CV/dabnn. △ Less

Submitted 16 August, 2019; originally announced August 2019.

Comments: Accepted by 2019 ACMMM Open Source Software Competition. Source code: https://github.com/JDAI-CV/dabnn

arXiv:1802.07368 [pdf, other]

doi 10.1002/jnm.2554

Gaussian Random Number Generator: Implemented in FPGA for Quantum Key Distribution

Authors: Yue Hu, Yan Wu, Yi Chen, G. C. Wan, S. T. Mei

Abstract: Quantum Key Distribution is the process of using quantum communication to establish a shared key between two parties. It has been demonstrated the unconditional security and effective communication of quantum communication system can be guaranteed by an excellent Gaussian random number generator with high speed and an extended random period. In this paper, we propose to construct the Gaussian rand… ▽ More Quantum Key Distribution is the process of using quantum communication to establish a shared key between two parties. It has been demonstrated the unconditional security and effective communication of quantum communication system can be guaranteed by an excellent Gaussian random number generator with high speed and an extended random period. In this paper, we propose to construct the Gaussian random number generator using Field-Programmable Gate Array (FPGA) which is able to process large data in high speed. We also compare three algorithms ofGRNgeneration: Box-Muller algorithm, polarization decision algorithm, and central limit algorithm. We demonstrate that the polarization decision algorithm implemented inFPGArequires less computing resources and also produces a high-quality Gaussian random number, through the null hypothesis test. △ Less

Submitted 9 January, 2019; v1 submitted 20 February, 2018; originally announced February 2018.

Comments: 18 pages, 6 figures, accepted for publication in the International Journal of Numerical Modeling: Electronic Networks, Devices and Fields

Showing 1–6 of 6 results for author: Mei, T