Search | arXiv e-print repository

On Efficient Neural Network Architectures for Image Compression

Authors: Yichi Zhang, Zhihao Duan, Fengqing Zhu

Abstract: Recent advances in learning-based image compression typically come at the cost of high complexity. Designing computationally efficient architectures remains an open challenge. In this paper, we empirically investigate the impact of different network designs in terms of rate-distortion performance and computational complexity. Our experiments involve testing various transforms, including convolutio… ▽ More Recent advances in learning-based image compression typically come at the cost of high complexity. Designing computationally efficient architectures remains an open challenge. In this paper, we empirically investigate the impact of different network designs in terms of rate-distortion performance and computational complexity. Our experiments involve testing various transforms, including convolutional neural networks and transformers, as well as various context models, including hierarchical, channel-wise, and space-channel context models. Based on the results, we present a series of efficient models, the final model of which has comparable performance to recent best-performing methods but with significantly lower complexity. Extensive experiments provide insights into the design of architectures for learned image compression and potential direction for future research. The code is available at \url{https://gitlab.com/viper-purdue/efficient-compression}. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 2024 IEEE International Conference on Image Processing (ICIP2024)

arXiv:2405.07291 [pdf, other]

Robust Beamforming with Gradient-based Liquid Neural Network

Authors: Xinquan Wang, Fenghao Zhu, Chongwen Huang, Ahmed Alhammadi, Faouzi Bader, Zhaoyang Zhang, Chau Yuen, Merouane Debbah

Abstract: Millimeter-wave (mmWave) multiple-input multiple-output (MIMO) communication with the advanced beamforming technologies is a key enabler to meet the growing demands of future mobile communication. However, the dynamic nature of cellular channels in large-scale urban mmWave MIMO communication scenarios brings substantial challenges, particularly in terms of complexity and robustness. To address the… ▽ More Millimeter-wave (mmWave) multiple-input multiple-output (MIMO) communication with the advanced beamforming technologies is a key enabler to meet the growing demands of future mobile communication. However, the dynamic nature of cellular channels in large-scale urban mmWave MIMO communication scenarios brings substantial challenges, particularly in terms of complexity and robustness. To address these issues, we propose a robust gradient-based liquid neural network (GLNN) framework that utilizes ordinary differential equation-based liquid neurons to solve the beamforming problem. Specifically, our proposed GLNN framework takes gradients of the optimization objective function as inputs to extract the high-order channel feature information, and then introduces a residual connection to mitigate the training burden. Furthermore, we use the manifold learning technique to compress the search space of the beamforming problem. These designs enable the GLNN to effectively maintain low complexity while ensuring strong robustness to noisy and highly dynamic channels. Extensive simulation results demonstrate that the GLNN can achieve 4.15% higher spectral efficiency than that of typical iterative algorithms, and reduce the time consumption to only 1.61% that of conventional methods. △ Less

Submitted 17 May, 2024; v1 submitted 12 May, 2024; originally announced May 2024.

arXiv:2405.00391 [pdf, ps, other]

Beamforming Inferring by Conditional WGAN-GP for Holographic Antenna Arrays

Authors: Fenghao Zhu, Xinquan Wang, Chongwen Huang, Ahmed Alhammadi, Hui Chen, Zhaoyang Zhang, Chau Yuen, Mérouane Debbah

Abstract: The beamforming technology with large holographic antenna arrays is one of the key enablers for the next generation of wireless systems, which can significantly improve the spectral efficiency. However, the deployment of large antenna arrays implies high algorithm complexity and resource overhead at both receiver and transmitter ends. To address this issue, advanced technologies such as artificial… ▽ More The beamforming technology with large holographic antenna arrays is one of the key enablers for the next generation of wireless systems, which can significantly improve the spectral efficiency. However, the deployment of large antenna arrays implies high algorithm complexity and resource overhead at both receiver and transmitter ends. To address this issue, advanced technologies such as artificial intelligence have been developed to reduce beamforming overhead. Intuitively, if we can implement the near-optimal beamforming only using a tiny subset of the all channel information, the overhead for channel estimation and beamforming would be reduced significantly compared with the traditional beamforming methods that usually need full channel information and the inversion of large dimensional matrix. In light of this idea, we propose a novel scheme that utilizes Wasserstein generative adversarial network with gradient penalty to infer the full beamforming matrices based on very little of channel information. Simulation results confirm that it can accomplish comparable performance with the weighted minimum mean-square error algorithm, while reducing the overhead by over 50%. △ Less

Submitted 15 May, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

arXiv:2405.00365 [pdf, other]

Robust Continuous-Time Beam Tracking with Liquid Neural Network

Authors: Fenghao Zhu, Xinquan Wang, Chongwen Huang, Richeng **, Qianqian Yang, Ahmed Alhammadi, Zhaoyang Zhang, Chau Yuen, Mérouane Debbah

Abstract: Millimeter-wave (mmWave) technology is increasingly recognized as a pivotal technology of the sixth-generation communication networks due to the large amounts of available spectrum at high frequencies. However, the huge overhead associated with beam training imposes a significant challenge in mmWave communications, particularly in urban environments with high background noise. To reduce this high… ▽ More Millimeter-wave (mmWave) technology is increasingly recognized as a pivotal technology of the sixth-generation communication networks due to the large amounts of available spectrum at high frequencies. However, the huge overhead associated with beam training imposes a significant challenge in mmWave communications, particularly in urban environments with high background noise. To reduce this high overhead, we propose a novel solution for robust continuous-time beam tracking with liquid neural network, which dynamically adjust the narrow mmWave beams to ensure real-time beam alignment with mobile users. Through extensive simulations, we validate the effectiveness of our proposed method and demonstrate its superiority over existing state-of-the-art deep-learning-based approaches. Specifically, our scheme achieves at most 46.9% higher normalized spectral efficiency than the baselines when the user is moving at 5 m/s, demonstrating the potential of liquid neural networks to enhance mmWave mobile communication performance. △ Less

Submitted 1 May, 2024; originally announced May 2024.

arXiv:2404.15575 [pdf, other]

Jitter Characterization of the HyTI Satellite

Authors: Chase Urasaki, Frances Zhu, Michael Bottom, Miguel Nunes, Aidan Walk

Abstract: The Hyperspectral Thermal Imager (HyTI) is a technology demonstration mission that will obtain high spatial, spectral, and temporal resolution long-wave infrared images of Earth's surface from a 6U cubesat. HyTI science requires that the pointing accuracy of the optical axis shall not exceed 2.89 arcsec over the 0.5 ms integration time due to microvibration effects (known as jitter). Two sources o… ▽ More The Hyperspectral Thermal Imager (HyTI) is a technology demonstration mission that will obtain high spatial, spectral, and temporal resolution long-wave infrared images of Earth's surface from a 6U cubesat. HyTI science requires that the pointing accuracy of the optical axis shall not exceed 2.89 arcsec over the 0.5 ms integration time due to microvibration effects (known as jitter). Two sources of vibration are a cryocooler that is added to maintain the detector at 68 K and three orthogonally placed reaction wheels that are a part of the attitude control system. Both of these parts will introduce vibrations that are propagated through to the satellite structure while imaging. Typical methods of characterizing and measuring jitter involve complex finite element methods and specialized equipment and setups. In this paper, we describe a novel method of characterizing jitter for small satellite systems that is low-cost and minimally modifies the subject's mass distribution. The metrology instrument is comprised of a laser source, a small mirror mounted via a 3D printed clamp to a jig, and a lateral effect position-sensing detector. The position-sensing detector samples 1000 Hz and can measure displacements as little as 0.15 arcsec at distances of one meter. This paper provides an experimental procedure that incrementally analyzes vibratory sources to establish causal relationships between sources and the vibratory modes they create. We demonstrate the capabilities of this metrology system and testing procedure on HyTI in the Hawaii Space Flight Lab's clean room. Results include power spectral density plots that show fundamental and higher-order vibratory modal frequencies. Results from metrology show that jitter from reaction wheels meets HyTI system requirements within 3$σ$. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: Accepted for the 2024 IEEE Aerospace Conference Proceedings

arXiv:2404.12257 [pdf, other]

Food Portion Estimation via 3D Object Scaling

Authors: Gautham Vinod, Jiangpeng He, Zeman Shao, Fengqing Zhu

Abstract: Image-based methods to analyze food images have alleviated the user burden and biases associated with traditional methods. However, accurate portion estimation remains a major challenge due to the loss of 3D information in the 2D representation of foods captured by smartphone cameras or wearable devices. In this paper, we propose a new framework to estimate both food volume and energy from 2D imag… ▽ More Image-based methods to analyze food images have alleviated the user burden and biases associated with traditional methods. However, accurate portion estimation remains a major challenge due to the loss of 3D information in the 2D representation of foods captured by smartphone cameras or wearable devices. In this paper, we propose a new framework to estimate both food volume and energy from 2D images by leveraging the power of 3D food models and physical reference in the eating scene. Our method estimates the pose of the camera and the food object in the input image and recreates the eating occasion by rendering an image of a 3D model of the food with the estimated poses. We also introduce a new dataset, SimpleFood45, which contains 2D images of 45 food items and associated annotations including food volume, weight, and energy. Our method achieves an average error of 31.10 kCal (17.67%) on this dataset, outperforming existing portion estimation methods. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2404.07507 [pdf, other]

Learning to Classify New Foods Incrementally Via Compressed Exemplars

Authors: Justin Yang, Zhihao Duan, Jiangpeng He, Fengqing Zhu

Abstract: Food image classification systems play a crucial role in health monitoring and diet tracking through image-based dietary assessment techniques. However, existing food recognition systems rely on static datasets characterized by a pre-defined fixed number of food classes. This contrasts drastically with the reality of food consumption, which features constantly changing data. Therefore, food image… ▽ More Food image classification systems play a crucial role in health monitoring and diet tracking through image-based dietary assessment techniques. However, existing food recognition systems rely on static datasets characterized by a pre-defined fixed number of food classes. This contrasts drastically with the reality of food consumption, which features constantly changing data. Therefore, food image classification systems should adapt to and manage data that continuously evolves. This is where continual learning plays an important role. A challenge in continual learning is catastrophic forgetting, where ML models tend to discard old knowledge upon learning new information. While memory-replay algorithms have shown promise in mitigating this problem by storing old data as exemplars, they are hampered by the limited capacity of memory buffers, leading to an imbalance between new and previously learned data. To address this, our work explores the use of neural image compression to extend buffer size and enhance data diversity. We introduced the concept of continuously learning a neural compression model to adaptively improve the quality of compressed data and optimize the bitrates per pixel (bpp) to store more exemplars. Our extensive experiments, including evaluations on food-specific datasets including Food-101 and VFN-74, as well as the general dataset ImageNet-100, demonstrate improvements in classification accuracy. This progress is pivotal in advancing more realistic food recognition systems that are capable of adapting to continually evolving data. Moreover, the principles and methodologies we've developed hold promise for broader applications, extending their benefits to other domains of continual machine learning systems. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.00432 [pdf, other]

doi 10.1109/ICMEW59549.2023.00038

Flexible Variable-Rate Image Feature Compression for Edge-Cloud Systems

Authors: Md Adnan Faisal Hossain, Zhihao Duan, Yuning Huang, Fengqing Zhu

Abstract: Feature compression is a promising direction for coding for machines. Existing methods have made substantial progress, but they require designing and training separate neural network models to meet different specifications of compression rate, performance accuracy and computational complexity. In this paper, a flexible variable-rate feature compression method is presented that can operate on a ran… ▽ More Feature compression is a promising direction for coding for machines. Existing methods have made substantial progress, but they require designing and training separate neural network models to meet different specifications of compression rate, performance accuracy and computational complexity. In this paper, a flexible variable-rate feature compression method is presented that can operate on a range of rates by introducing a rate control parameter as an input to the neural network model. By compressing different intermediate features of a pre-trained vision task model, the proposed method can scale the encoding complexity without changing the overall size of the model. The proposed method is more flexible than existing baselines, at the same time outperforming them in terms of the three-way trade-off between feature compression rate, vision task accuracy, and encoding complexity. We have made the source code available at https://github.com/adnan-hossain/var_feat_comp.git. △ Less

Submitted 30 March, 2024; originally announced April 2024.

Comments: 6 pages, 7 figures, 1 table, International Conference on Multimedia and Expo Workshops 2023

arXiv:2403.18535 [pdf, other]

Theoretical Bound-Guided Hierarchical VAE for Neural Image Codecs

Authors: Yichi Zhang, Zhihao Duan, Yuning Huang, Fengqing Zhu

Abstract: Recent studies reveal a significant theoretical link between variational autoencoders (VAEs) and rate-distortion theory, notably in utilizing VAEs to estimate the theoretical upper bound of the information rate-distortion function of images. Such estimated theoretical bounds substantially exceed the performance of existing neural image codecs (NICs). To narrow this gap, we propose a theoretical bo… ▽ More Recent studies reveal a significant theoretical link between variational autoencoders (VAEs) and rate-distortion theory, notably in utilizing VAEs to estimate the theoretical upper bound of the information rate-distortion function of images. Such estimated theoretical bounds substantially exceed the performance of existing neural image codecs (NICs). To narrow this gap, we propose a theoretical bound-guided hierarchical VAE (BG-VAE) for NIC. The proposed BG-VAE leverages the theoretical bound to guide the NIC model towards enhanced performance. We implement the BG-VAE using Hierarchical VAEs and demonstrate its effectiveness through extensive experiments. Along with advanced neural network blocks, we provide a versatile, variable-rate NIC that outperforms existing methods when considering both rate-distortion performance and computational complexity. The code is available at BG-VAE. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: 2024 IEEE International Conference on Multimedia and Expo (ICME2024)

arXiv:2402.18862 [pdf, other]

Towards Backward-Compatible Continual Learning of Image Compression

Authors: Zhihao Duan, Ming Lu, Justin Yang, Jiangpeng He, Zhan Ma, Fengqing Zhu

Abstract: This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g., adapting to new data or target bitrates) without breaking backward compatibility, the ability to decode bitstreams encoded by the original model. We refer to this problem as continual learning of image compression. Our initial findings show that baseline solutions, such as end-to-end fine… ▽ More This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g., adapting to new data or target bitrates) without breaking backward compatibility, the ability to decode bitstreams encoded by the original model. We refer to this problem as continual learning of image compression. Our initial findings show that baseline solutions, such as end-to-end fine-tuning, do not preserve the desired backward compatibility. To tackle this, we propose a knowledge replay training strategy that effectively addresses this issue. We also design a new model architecture that enables more effective continual learning than existing baselines. Experiments are conducted for two scenarios: data-incremental learning and rate-incremental learning. The main conclusion of this paper is that neural image compressors can be fine-tuned to achieve better performance (compared to their pre-trained version) on new data and rates without compromising backward compatibility. Our code is available at https://gitlab.com/viper-purdue/continual-compression △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: Accepted to CVPR 2024

arXiv:2402.10626 [pdf, other]

Robust Beamforming for RIS-aided Communications: Gradient-based Manifold Meta Learning

Authors: Fenghao Zhu, Xinquan Wang, Chongwen Huang, Zhaohui Yang, Xiaoming Chen, Ahmed Alhammadi, Zhaoyang Zhang, Chau Yuen, Mérouane Debbah

Abstract: Reconfigurable intelligent surface (RIS) has become a promising technology to realize the programmable wireless environment via steering the incident signal in fully customizable ways. However, a major challenge in RIS-aided communication systems is the simultaneous design of the precoding matrix at the base station (BS) and the phase shifting matrix of the RIS elements. This is mainly attributed… ▽ More Reconfigurable intelligent surface (RIS) has become a promising technology to realize the programmable wireless environment via steering the incident signal in fully customizable ways. However, a major challenge in RIS-aided communication systems is the simultaneous design of the precoding matrix at the base station (BS) and the phase shifting matrix of the RIS elements. This is mainly attributed to the highly non-convex optimization space of variables at both the BS and the RIS, and the diversity of communication environments. Generally, traditional optimization methods for this problem suffer from the high complexity, while existing deep learning based methods are lack of robustness in various scenarios. To address these issues, we introduce a gradient-based manifold meta learning method (GMML), which works without pre-training and has strong robustness for RIS-aided communications. Specifically, the proposed method fuses meta learning and manifold learning to improve the overall spectral efficiency, and reduce the overhead of the high-dimensional signal process. Unlike traditional deep learning based methods which directly take channel state information as input, GMML feeds the gradients of the precoding matrix and phase shifting matrix into neural networks. Coherently, we design a differential regulator to constrain the phase shifting matrix of the RIS. Numerical results show that the proposed GMML can improve the spectral efficiency by up to 7.31\%, and speed up the convergence by 23 times faster compared to traditional approaches. Moreover, they also demonstrate remarkable robustness and adaptability in dynamic settings. △ Less

Submitted 16 February, 2024; originally announced February 2024.

Comments: journal

arXiv:2402.02349 [pdf]

Vision Transformer-based Multimodal Feature Fusion Network for Lymphoma Segmentation on PET/CT Images

Authors: Huan Huang, Liheng Qiu, Shenmiao Yang, Longxi Li, Jiaofen Nan, Yanting Li, Chuang Han, Fubao Zhu, Chen Zhao, Weihua Zhou

Abstract: Background: Diffuse large B-cell lymphoma (DLBCL) segmentation is a challenge in medical image analysis. Traditional segmentation methods for lymphoma struggle with the complex patterns and the presence of DLBCL lesions. Objective: We aim to develop an accurate method for lymphoma segmentation with 18F-Fluorodeoxyglucose positron emission tomography (PET) and computed tomography (CT) images. Metho… ▽ More Background: Diffuse large B-cell lymphoma (DLBCL) segmentation is a challenge in medical image analysis. Traditional segmentation methods for lymphoma struggle with the complex patterns and the presence of DLBCL lesions. Objective: We aim to develop an accurate method for lymphoma segmentation with 18F-Fluorodeoxyglucose positron emission tomography (PET) and computed tomography (CT) images. Methods: Our lymphoma segmentation approach combines a vision transformer with dual encoders, adeptly fusing PET and CT data via multimodal cross-attention fusion (MMCAF) module. In this study, PET and CT data from 165 DLBCL patients were analyzed. A 5-fold cross-validation was employed to evaluate the performance and generalization ability of our method. Ground truths were annotated by experienced nuclear medicine experts. We calculated the total metabolic tumor volume (TMTV) and performed a statistical analysis on our results. Results: The proposed method exhibited accurate performance in DLBCL lesion segmentation, achieving a Dice similarity coefficient of 0.9173$\pm$0.0071, a Hausdorff distance of 2.71$\pm$0.25mm, a sensitivity of 0.9462$\pm$0.0223, and a specificity of 0.9986$\pm$0.0008. Additionally, a Pearson correlation coefficient of 0.9030$\pm$0.0179 and an R-square of 0.8586$\pm$0.0173 were observed in TMTV when measured on manual annotation compared to our segmentation results. Conclusion: This study highlights the advantages of MMCAF and vision transformer for lymphoma segmentation using PET and CT, offering great promise for computer-aided lymphoma diagnosis and treatment. △ Less

Submitted 4 February, 2024; originally announced February 2024.

Comments: 14 pages, 6 figures; reference added

arXiv:2401.11615 [pdf, other]

Another Way to the Top: Exploit Contextual Clustering in Learned Image Coding

Authors: Yichi Zhang, Zhihao Duan, Ming Lu, Dandan Ding, Fengqing Zhu, Zhan Ma

Abstract: While convolution and self-attention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image… ▽ More While convolution and self-attention are extensively used in learned image compression (LIC) for transform coding, this paper proposes an alternative called Contextual Clustering based LIC (CLIC) which primarily relies on clustering operations and local attention for correlation characterization and compact representation of an image. As seen, CLIC expands the receptive field into the entire image for intra-cluster feature aggregation. Afterward, features are reordered to their original spatial positions to pass through the local attention units for inter-cluster embedding. Additionally, we introduce the Guided Post-Quantization Filtering (GuidedPQF) into CLIC, effectively mitigating the propagation and accumulation of quantization errors at the initial decoding stage. Extensive experiments demonstrate the superior performance of CLIC over state-of-the-art works: when optimized using MSE, it outperforms VVC by about 10% BD-Rate in three widely-used benchmark datasets; when optimized using MS-SSIM, it saves more than 50% BD-Rate over VVC. Our CLIC offers a new way to generate compact representations for image compression, which also provides a novel direction along the line of LIC development. △ Less

Submitted 21 January, 2024; originally announced January 2024.

Comments: The 38th Annual AAAI Conference on Artificial Intelligence (AAAI 2024)

arXiv:2312.07126 [pdf, other]

Deep Hierarchical Video Compression

Authors: Ming Lu, Zhihao Duan, Fengqing Zhu, Zhan Ma

Abstract: Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video f… ▽ More Recently, probabilistic predictive coding that directly models the conditional distribution of latent features across successive frames for temporal redundancy removal has yielded promising results. Existing methods using a single-scale Variational AutoEncoder (VAE) must devise complex networks for conditional probability estimation in latent space, neglecting multiscale characteristics of video frames. Instead, this work proposes hierarchical probabilistic predictive coding, for which hierarchal VAEs are carefully designed to characterize multiscale latent features as a family of flexible priors and posteriors to predict the probabilities of future frames. Under such a hierarchical structure, lightweight networks are sufficient for prediction. The proposed method outperforms representative learned video compression models on common testing videos and demonstrates computational friendliness with much less memory footprint and faster encoding/decoding. Extensive experiments on adaptation to temporal patterns also indicate the better generalization of our hierarchical predictive mechanism. Furthermore, our solution is the first to enable progressive decoding that is favored in networked video applications with packet loss. △ Less

Submitted 12 December, 2023; originally announced December 2023.

arXiv:2311.06861 [pdf, other]

Energy-efficient Beamforming for RISs-aided Communications: Gradient Based Meta Learning

Authors: Xinquan Wang, Fenghao Zhu, Qianyun Zhou, Qihao Yu, Chongwen Huang, Ahmed Alhammadi, Zhaoyang Zhang, Chau Yuen, Mérouane Debbah

Abstract: Reconfigurable intelligent surfaces (RISs) have become a promising technology to meet the requirements of energy efficiency and scalability in future six-generation (6G) communications. However, a significant challenge in RISs-aided communications is the joint optimization of active and passive beamforming at base stations (BSs) and RISs respectively. Specifically, the main difficulty is attribute… ▽ More Reconfigurable intelligent surfaces (RISs) have become a promising technology to meet the requirements of energy efficiency and scalability in future six-generation (6G) communications. However, a significant challenge in RISs-aided communications is the joint optimization of active and passive beamforming at base stations (BSs) and RISs respectively. Specifically, the main difficulty is attributed to the highly non-convex optimization space of beamforming matrices at both BSs and RISs, as well as the diversity and mobility of communication scenarios. To address this, we present a greenly gradient based meta learning beamforming (GMLB) approach. Unlike traditional deep learning based methods which take channel information directly as input, GMLB feeds the gradient of sum rate into neural networks. Coherently, we design a differential regulator to address the phase shift optimization of RISs. Moreover, we use the meta learning to iteratively optimize the beamforming matrices of BSs and RISs. These techniques make the proposed method to work well without requiring energy-consuming pre-training. Simulations show that GMLB could achieve higher sum rate than that of typical alternating optimization algorithms with the energy consumption by two orders of magnitude less. △ Less

Submitted 16 February, 2024; v1 submitted 12 November, 2023; originally announced November 2023.

Comments: 5 pages, 8 figures. Accepted in IEEE ICC 2024 (GCSN symposium)

arXiv:2311.00567 [pdf]

A Robust Deep Learning Method with Uncertainty Estimation for the Pathological Classification of Renal Cell Carcinoma based on CT Images

Authors: Ni Yao, Hang Hu, Kaicong Chen, Chen Zhao, Yuan Guo, Boya Li, Jiaofen Nan, Yanting Li, Chuang Han, Fubao Zhu, Weihua Zhou, Li Tian

Abstract: Objectives To develop and validate a deep learning-based diagnostic model incorporating uncertainty estimation so as to facilitate radiologists in the preoperative differentiation of the pathological subtypes of renal cell carcinoma (RCC) based on CT images. Methods Data from 668 consecutive patients, pathologically proven RCC, were retrospectively collected from Center 1. By using five-fold cross… ▽ More Objectives To develop and validate a deep learning-based diagnostic model incorporating uncertainty estimation so as to facilitate radiologists in the preoperative differentiation of the pathological subtypes of renal cell carcinoma (RCC) based on CT images. Methods Data from 668 consecutive patients, pathologically proven RCC, were retrospectively collected from Center 1. By using five-fold cross-validation, a deep learning model incorporating uncertainty estimation was developed to classify RCC subtypes into clear cell RCC (ccRCC), papillary RCC (pRCC), and chromophobe RCC (chRCC). An external validation set of 78 patients from Center 2 further evaluated the model's performance. Results In the five-fold cross-validation, the model's area under the receiver operating characteristic curve (AUC) for the classification of ccRCC, pRCC, and chRCC was 0.868 (95% CI: 0.826-0.923), 0.846 (95% CI: 0.812-0.886), and 0.839 (95% CI: 0.802-0.88), respectively. In the external validation set, the AUCs were 0.856 (95% CI: 0.838-0.882), 0.787 (95% CI: 0.757-0.818), and 0.793 (95% CI: 0.758-0.831) for ccRCC, pRCC, and chRCC, respectively. Conclusions The developed deep learning model demonstrated robust performance in predicting the pathological subtypes of RCC, while the incorporated uncertainty emphasized the importance of understanding model confidence, which is crucial for assisting clinical decision-making for patients with renal tumors. Clinical relevance statement Our deep learning approach, integrated with uncertainty estimation, offers clinicians a dual advantage: accurate RCC subtype predictions complemented by diagnostic confidence references, promoting informed decision-making for patients with RCC. △ Less

Submitted 12 November, 2023; v1 submitted 1 November, 2023; originally announced November 2023.

Comments: 16 pages, 6 figures

arXiv:2309.05423 [pdf, other]

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

Authors: **zuomu Zhong, Yang Li, Hui Huang, Korin Richmond, Jie Liu, Zhiba Su, **g Guo, Benlai Tang, Fengjie Zhu

Abstract: In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silenc… ▽ More In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs to enhance prosodic information in latent representations. In the second stage, we build a multi-modal prosody annotator, comprising pretrained encoders, a text-speech fusing scheme, and a sequence classifier. Experiments on English prosodic boundaries demonstrate that our method achieves state-of-the-art (SOTA) performance with 0.72 and 0.93 f1 score for Prosodic Word and Prosodic Phrase boundary respectively, while bearing remarkable robustness to data scarcity. △ Less

Submitted 11 June, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

arXiv:2309.02574 [pdf, other]

An Improved Upper Bound on the Rate-Distortion Function of Images

Authors: Zhihao Duan, Jack Ma, Jiangpeng He, Fengqing Zhu

Abstract: Recent work has shown that Variational Autoencoders (VAEs) can be used to upper-bound the information rate-distortion (R-D) function of images, i.e., the fundamental limit of lossy image compression. In this paper, we report an improved upper bound on the R-D function of images implemented by (1) introducing a new VAE model architecture, (2) applying variable-rate compression techniques, and (3) p… ▽ More Recent work has shown that Variational Autoencoders (VAEs) can be used to upper-bound the information rate-distortion (R-D) function of images, i.e., the fundamental limit of lossy image compression. In this paper, we report an improved upper bound on the R-D function of images implemented by (1) introducing a new VAE model architecture, (2) applying variable-rate compression techniques, and (3) proposing a novel \ourfunction{} to stabilize training. We demonstrate that at least 30\% BD-rate reduction w.r.t. the intra prediction mode in VVC codec is achievable, suggesting that there is still great potential for improving lossy image compression. Code is made publicly available at https://github.com/duanzhiihao/lossy-vae. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: Conference paper at ICIP 2023. The first two authors share equal contributions

arXiv:2307.13241 [pdf, other]

A Visual Quality Assessment Method for Raster Images in Scanned Document

Authors: Justin Yang, Peter Bauer, Todd Harris, Changhyung Lee, Hyeon Seok Seo, Jan P Allebach, Fengqing Zhu

Abstract: Image quality assessment (IQA) is an active research area in the field of image processing. Most prior works focus on visual quality of natural images captured by cameras. In this paper, we explore visual quality of scanned documents, focusing on raster image areas. Different from many existing works which aim to estimate a visual quality score, we propose a machine learning based classification m… ▽ More Image quality assessment (IQA) is an active research area in the field of image processing. Most prior works focus on visual quality of natural images captured by cameras. In this paper, we explore visual quality of scanned documents, focusing on raster image areas. Different from many existing works which aim to estimate a visual quality score, we propose a machine learning based classification method to determine whether the visual quality of a scanned raster image at a given resolution setting is acceptable. We conduct a psychophysical study to determine the acceptability at different image resolutions based on human subject ratings and use them as the ground truth to train our machine learning model. However, this dataset is unbalanced as most images were rated as visually acceptable. To address the data imbalance problem, we introduce several noise models to simulate the degradation of image quality during the scanning process. Our results show that by including augmented data in training, we can significantly improve the performance of the classifier to determine whether the visual quality of raster images in a scanned document is acceptable or not for a given resolution setting. △ Less

Submitted 25 July, 2023; originally announced July 2023.

arXiv:2307.12263 [pdf, other]

Efficient Gaussian Process Classification-based Physical-Layer Authentication with Configurable Fingerprints for 6G-Enabled IoT

Authors: Rui Meng, Fangzhou Zhu, Xiaodong Xu, Liang **, Bizhu Wang, Bingxuan Xu, Han Meng, ** Zhang

Abstract: Physical-Layer Authentication (PLA) has been recently believed as an endogenous-secure and energy-efficient technique to recognize IoT terminals. However, the major challenge of applying the state-of-the-art PLA schemes directly to 6G-enabled IoT is the inaccurate channel fingerprint estimation in low Signal-Noise Ratio (SNR) environments, which will greatly influence the reliability and robustnes… ▽ More Physical-Layer Authentication (PLA) has been recently believed as an endogenous-secure and energy-efficient technique to recognize IoT terminals. However, the major challenge of applying the state-of-the-art PLA schemes directly to 6G-enabled IoT is the inaccurate channel fingerprint estimation in low Signal-Noise Ratio (SNR) environments, which will greatly influence the reliability and robustness of PLA. To tackle this issue, we propose a configurable-fingerprint-based PLA architecture through Intelligent Reflecting Surface (IRS) that helps create an alternative wireless transmission path to provide more accurate fingerprints. According to Baye's theorem, we propose a Gaussian Process Classification (GPC)-based PLA scheme, which utilizes the Expectation Propagation (EP) method to obtain the identities of unknown fingerprints. Considering that obtaining sufficient labeled fingerprint samples to train the GPC-based authentication model is challenging for future 6G systems, we further extend the GPC-based PLA to the Efficient-GPC (EGPC)-based PLA through active learning, which requires fewer labeled fingerprints and is more feasible. We also propose three fingerprint selecting algorithms to choose fingerprints, whose identities are queried to the upper-layers authentication mechanisms. For this reason, the proposed EGPC-based scheme is also a lightweight cross-layer authentication method to offer a superior security level. The simulations conducted on synthetic datasets demonstrate that the IRS-assisted scheme reduces the authentication error rate by 98.69% compared to the non-IRS-based scheme. Additionally, the proposed fingerprint selection algorithms reduce the authentication error rate by 65.96% to 86.93% and 45.45% to 70.00% under perfect and imperfect channel estimation conditions, respectively, when compared with baseline algorithms. △ Less

Submitted 23 July, 2023; originally announced July 2023.

Comments: 12 pages, 9 figures

arXiv:2306.17008 [pdf]

MLA-BIN: Model-level Attention and Batch-instance Style Normalization for Domain Generalization of Federated Learning on Medical Image Segmentation

Authors: Fubao Zhu, Yanhui Tian, Chuang Han, Yanting Li, Jiaofen Nan, Ni Yao, Weihua Zhou

Abstract: The privacy protection mechanism of federated learning (FL) offers an effective solution for cross-center medical collaboration and data sharing. In multi-site medical image segmentation, each medical site serves as a client of FL, and its data naturally forms a domain. FL supplies the possibility to improve the performance of seen domains model. However, there is a problem of domain generalizatio… ▽ More The privacy protection mechanism of federated learning (FL) offers an effective solution for cross-center medical collaboration and data sharing. In multi-site medical image segmentation, each medical site serves as a client of FL, and its data naturally forms a domain. FL supplies the possibility to improve the performance of seen domains model. However, there is a problem of domain generalization (DG) in the actual de-ployment, that is, the performance of the model trained by FL in unseen domains will decrease. Hence, MLA-BIN is proposed to solve the DG of FL in this study. Specifically, the model-level attention module (MLA) and batch-instance style normalization (BIN) block were designed. The MLA represents the unseen domain as a linear combination of seen domain models. The atten-tion mechanism is introduced for the weighting coefficient to obtain the optimal coefficient ac-cording to the similarity of inter-domain data features. MLA enables the global model to gen-eralize to unseen domain. In the BIN block, batch normalization (BN) and instance normalization (IN) are combined to perform the shallow layers of the segmentation network for style normali-zation, solving the influence of inter-domain image style differences on DG. The extensive experimental results of two medical image seg-mentation tasks demonstrate that the proposed MLA-BIN outperforms state-of-the-art methods. △ Less

Submitted 29 June, 2023; originally announced June 2023.

Comments: 9 pages, 8 figures, 2 tables

arXiv:2306.15212 [pdf, other]

TranssionADD: A multi-frame reinforcement based sequence tagging model for audio deepfake detection

Authors: Jie Liu, Zhiba Su, Hui Huang, Caiyan Wan, Quanxiu Wang, Jiangli Hong, Benlai Tang, Fengjie Zhu

Abstract: Thanks to recent advancements in end-to-end speech modeling technology, it has become increasingly feasible to imitate and clone a user`s voice. This leads to a significant challenge in differentiating between authentic and fabricated audio segments. To address the issue of user voice abuse and misuse, the second Audio Deepfake Detection Challenge (ADD 2023) aims to detect and analyze deepfake spe… ▽ More Thanks to recent advancements in end-to-end speech modeling technology, it has become increasingly feasible to imitate and clone a user`s voice. This leads to a significant challenge in differentiating between authentic and fabricated audio segments. To address the issue of user voice abuse and misuse, the second Audio Deepfake Detection Challenge (ADD 2023) aims to detect and analyze deepfake speech utterances. Specifically, Track 2, named the Manipulation Region Location (RL), aims to pinpoint the location of manipulated regions in audio, which can be present in both real and generated audio segments. We propose our novel TranssionADD system as a solution to the challenging problem of model robustness and audio segment outliers in the trace competition. Our system provides three unique contributions: 1) we adapt sequence tagging task for audio deepfake detection; 2) we improve model generalization by various data augmentation techniques; 3) we incorporate multi-frame detection (MFD) module to overcome limited representation provided by a single frame and use isolated-frame penalty (IFP) loss to handle outliers in segments. Our best submission achieved 2nd place in Track 2, demonstrating the effectiveness and robustness of our proposed system. △ Less

Submitted 27 June, 2023; originally announced June 2023.

arXiv:2303.09046 [pdf, other]

Self-Supervised Visual Representation Learning on Food Images

Authors: Andrew Peng, Jiangpeng He, Fengqing Zhu

Abstract: Food image analysis is the groundwork for image-based dietary assessment, which is the process of monitoring what kinds of food and how much energy is consumed using captured food or eating scene images. Existing deep learning-based methods learn the visual representation for downstream tasks based on human annotation of each food image. However, most food images in real life are obtained without… ▽ More Food image analysis is the groundwork for image-based dietary assessment, which is the process of monitoring what kinds of food and how much energy is consumed using captured food or eating scene images. Existing deep learning-based methods learn the visual representation for downstream tasks based on human annotation of each food image. However, most food images in real life are obtained without labels, and data annotation requires plenty of time and human effort, which is not feasible for real-world applications. To make use of the vast amount of unlabeled images, many existing works focus on unsupervised or self-supervised learning of visual representations directly from unlabeled data. However, none of these existing works focus on food images, which is more challenging than general objects due to its high inter-class similarity and intra-class variance. In this paper, we focus on the implementation and analysis of existing representative self-supervised learning methods on food images. Specifically, we first compare the performance of six selected self-supervised learning models on the Food-101 dataset. Then we analyze the pros and cons of each selected model when training on food data to identify the key factors that can help improve the performance. Finally, we propose several ideas for future work on self-supervised visual representation learning for food images. △ Less

Submitted 15 March, 2023; originally announced March 2023.

Comments: Presented and published in EI 2023 Conference Proceedings

arXiv:2303.08156 [pdf, other]

Nonlinear Hyperspectral Unmixing based on Multilinear Mixing Model using Convolutional Autoencoders

Authors: Tingting Fang, Fei Zhu, Jie Chen

Abstract: Unsupervised spectral unmixing consists of representing each observed pixel as a combination of several pure materials called endmembers with their corresponding abundance fractions. Beyond the linear assumption, various nonlinear unmixing models have been proposed, with the associated optimization problems solved either by traditional optimization algorithms or deep learning techniques. Current d… ▽ More Unsupervised spectral unmixing consists of representing each observed pixel as a combination of several pure materials called endmembers with their corresponding abundance fractions. Beyond the linear assumption, various nonlinear unmixing models have been proposed, with the associated optimization problems solved either by traditional optimization algorithms or deep learning techniques. Current deep learning-based nonlinear unmixing focuses on the models in additive, bilinear-based formulations. By interpreting the reflection process using the discrete Markov chain, the multilinear mixing model (MLM) successfully accounts for the up to infinite-order interactions between endmembers. However, to simulate the physics process of MLM by neural networks explicitly is a challenging problem that has not been approached by far. In this article, we propose a novel autoencoder-based network for unsupervised unmixing based on MLM. Benefitting from an elaborate network design, the relationships among all the model parameters {\em i.e.}, endmembers, abundances, and transition probability parameters are explicitly modeled. There are two modes: MLM-1DAE considers only pixel-wise spectral information, and MLM-3DAE exploits the spectral-spatial correlations within input patches. Experiments on both the synthetic and real datasets demonstrate the effectiveness of the proposed method as it achieves competitive performance to the classic solutions of MLM. △ Less

Submitted 14 March, 2023; originally announced March 2023.

arXiv:2302.08899 [pdf, other]

doi 10.1109/TPAMI.2023.3322904

QARV: Quantization-Aware ResNet VAE for Lossy Image Compression

Authors: Zhihao Duan, Ming Lu, Jack Ma, Yuning Huang, Zhan Ma, Fengqing Zhu

Abstract: This paper addresses the problem of lossy image compression, a fundamental problem in image processing and information theory that is involved in many real-world applications. We start by reviewing the framework of variational autoencoders (VAEs), a powerful class of generative probabilistic models that has a deep connection to lossy compression. Based on VAEs, we develop a novel scheme for lossy… ▽ More This paper addresses the problem of lossy image compression, a fundamental problem in image processing and information theory that is involved in many real-world applications. We start by reviewing the framework of variational autoencoders (VAEs), a powerful class of generative probabilistic models that has a deep connection to lossy compression. Based on VAEs, we develop a novel scheme for lossy image compression, which we name quantization-aware ResNet VAE (QARV). Our method incorporates a hierarchical VAE architecture integrated with test-time quantization and quantization-aware training, without which efficient entropy coding would not be possible. In addition, we design the neural network architecture of QARV specifically for fast decoding and propose an adaptive normalization operation for variable-rate compression. Extensive experiments are conducted, and results show that QARV achieves variable-rate compression, high-speed decoding, and a better rate-distortion performance than existing baseline methods. The code of our method is publicly accessible at https://github.com/duanzhiihao/lossy-vae △ Less

Submitted 1 December, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

Comments: Full version (19 pages, includes appendix) of the paper accepted by IEEE TPAMI

arXiv:2301.12340 [pdf]

Incremental Value and Interpretability of Radiomics Features of Both Lung and Epicardial Adipose Tissue for Detecting the Severity of COVID-19 Infection

Authors: Ni Yao, Yanhui Tian, Daniel Gama das Neves, Chen Zhao, Claudio Tinoco Mesquita, Wolney de Andrade Martins, Alair Augusto Sarmet Moreira Damas dos Santos, Yanting Li, Chuang Han, Fubao Zhu, Neng Dai, Weihua Zhou

Abstract: Epicardial adipose tissue (EAT) is known for its pro-inflammatory properties and association with Coronavirus Disease 2019 (COVID-19) severity. However, current EAT segmentation methods do not consider positional information. Additionally, the detection of COVID-19 severity lacks consideration for EAT radiomics features, which limits interpretability. This study investigates the use of radiomics f… ▽ More Epicardial adipose tissue (EAT) is known for its pro-inflammatory properties and association with Coronavirus Disease 2019 (COVID-19) severity. However, current EAT segmentation methods do not consider positional information. Additionally, the detection of COVID-19 severity lacks consideration for EAT radiomics features, which limits interpretability. This study investigates the use of radiomics features from EAT and lungs to detect the severity of COVID-19 infections. A retrospective analysis of 515 patients with COVID-19 (Cohort1: 415, Cohort2: 100) was conducted using a proposed three-stage deep learning approach for EAT extraction. Lung segmentation was achieved using a published method. A hybrid model for detecting the severity of COVID-19 was built in a derivation cohort, and its performance and uncertainty were evaluated in internal (125, Cohort1) and external (100, Cohort2) validation cohorts. For EAT extraction, the Dice similarity coefficients (DSC) of the two centers were 0.972 (+-0.011) and 0.968 (+-0.005), respectively. For severity detection, the hybrid model with radiomics features of both lungs and EAT showed improvements in AUC, net reclassification improvement (NRI), and integrated discrimination improvement (IDI) compared to the model with only lung radiomics features. The hybrid model exhibited an increase of 0.1 (p<0.001), 19.3%, and 18.0% respectively, in the internal validation cohort and an increase of 0.09 (p<0.001), 18.0%, and 18.0%, respectively, in the external validation cohort while outperforming existing detection methods. Uncertainty quantification and radiomics features analysis confirmed the interpretability of case prediction after inclusion of EAT features. △ Less

Submitted 6 December, 2023; v1 submitted 28 January, 2023; originally announced January 2023.

Comments: 20 pages, 7 figures

arXiv:2211.09897 [pdf, other]

Efficient Feature Compression for Edge-Cloud Systems

Authors: Zhihao Duan, Fengqing Zhu

Abstract: Optimizing computation in an edge-cloud system is an important yet challenging problem. In this paper, we consider a three-way trade-off between bit rate, classification accuracy, and encoding complexity in an edge-cloud image classification system. Our method includes a new training strategy and an efficient encoder architecture to improve the rate-accuracy performance. Our design can also be eas… ▽ More Optimizing computation in an edge-cloud system is an important yet challenging problem. In this paper, we consider a three-way trade-off between bit rate, classification accuracy, and encoding complexity in an edge-cloud image classification system. Our method includes a new training strategy and an efficient encoder architecture to improve the rate-accuracy performance. Our design can also be easily scaled according to different computation resources on the edge device, taking a step towards achieving a rate-accuracy-complexity (RAC) trade-off. Under various settings, our feature coding system consistently outperforms previous methods in terms of the RAC performance. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: Picture Coding Symposium (PCS) 2022

arXiv:2210.05644 [pdf, other]

Simulating single-photon detector array sensors for depth imaging

Authors: Stirling Scholes, Germán Mora-Martín, Feng Zhu, Istvan Gyongy, Phil Soan, Jonathan Leach

Abstract: Single-Photon Avalanche Detector (SPAD) arrays are a rapidly emerging technology. These multi-pixel sensors have single-photon sensitivities and pico-second temporal resolutions thus they can rapidly generate depth images with millimeter precision. Such sensors are a key enabling technology for future autonomous systems as they provide guidance and situational awareness. However, to fully exploit… ▽ More Single-Photon Avalanche Detector (SPAD) arrays are a rapidly emerging technology. These multi-pixel sensors have single-photon sensitivities and pico-second temporal resolutions thus they can rapidly generate depth images with millimeter precision. Such sensors are a key enabling technology for future autonomous systems as they provide guidance and situational awareness. However, to fully exploit the capabilities of SPAD array sensors, it is crucial to establish the quality of depth images they are able to generate in a wide range of scenarios. Given a particular optical system and a finite image acquisition time, what is the best-case depth resolution and what are realistic images generated by SPAD arrays? In this work, we establish a robust yet simple numerical procedure that rapidly establishes the fundamental limits to depth imaging with SPAD arrays under real world conditions. Our approach accurately generates realistic depth images in a wide range of scenarios, allowing the performance of an optical depth imaging system to be established without the need for costly and laborious field testing. This procedure has applications in object detection and tracking for autonomous systems and could be easily extended to systems for underwater imaging or for imaging around corners. △ Less

Submitted 7 October, 2022; originally announced October 2022.

arXiv:2208.13056 [pdf, other]

doi 10.1109/WACV56688.2023.00028

Lossy Image Compression with Quantized Hierarchical VAEs

Authors: Zhihao Duan, Ming Lu, Zhan Ma, Fengqing Zhu

Abstract: Recent research has shown a strong theoretical connection between variational autoencoders (VAEs) and the rate-distortion theory. Motivated by this, we consider the problem of lossy image compression from the perspective of generative modeling. Starting with ResNet VAEs, which are originally designed for data (image) distribution modeling, we redesign their latent variable model using a quantizati… ▽ More Recent research has shown a strong theoretical connection between variational autoencoders (VAEs) and the rate-distortion theory. Motivated by this, we consider the problem of lossy image compression from the perspective of generative modeling. Starting with ResNet VAEs, which are originally designed for data (image) distribution modeling, we redesign their latent variable model using a quantization-aware posterior and prior, enabling easy quantization and entropy coding at test time. Along with improved neural network architecture, we present a powerful and efficient model that outperforms previous methods on natural image lossy compression. Our model compresses images in a coarse-to-fine fashion and supports parallel encoding and decoding, leading to fast execution on GPUs. Code is available at https://github.com/duanzhiihao/lossy-vae. △ Less

Submitted 25 March, 2023; v1 submitted 27 August, 2022; originally announced August 2022.

Comments: WACV 2023 Best Algorithms Paper Award, revised version

arXiv:2208.03752 [pdf]

doi 10.1007/s12350-023-03226-2

Automatic reorientation by deep learning to generate short axis SPECT myocardial perfusion images

Authors: Fubao Zhu, Guojie Wang, Chen Zhao, Saurabh Malhotra, Min Zhao, Zhuo He, Jianzhou Shi, Zhixin Jiang, Weihua Zhou

Abstract: Single photon emission computed tomography (SPECT) myocardial perfusion images (MPI) can be displayed both in traditional short-axis (SA) cardiac planes and polar maps for interpretation and quantification. It is essential to reorient the reconstructed transaxial SPECT MPI into standard SA slices. This study is aimed to develop a deep-learning-based approach for automatic reorientation of MPI. Met… ▽ More Single photon emission computed tomography (SPECT) myocardial perfusion images (MPI) can be displayed both in traditional short-axis (SA) cardiac planes and polar maps for interpretation and quantification. It is essential to reorient the reconstructed transaxial SPECT MPI into standard SA slices. This study is aimed to develop a deep-learning-based approach for automatic reorientation of MPI. Methods: A total of 254 patients were enrolled, including 228 stress SPECT MPIs and 248 rest SPECT MPIs. Five-fold cross-validation with 180 stress and 201 rest MPIs was used for training and internal validation; the remaining images were used for testing. The rigid transformation parameters (translation and rotation) from manual reorientation were annotated by an experienced operator and used as the ground truth. A convolutional neural network (CNN) was designed to predict the transformation parameters. Then, the derived transform was applied to the grid generator and sampler in spatial transformer network (STN) to generate the reoriented image. A loss function containing mean absolute errors for translation and mean square errors for rotation was employed. A three-stage optimization strategy was adopted for model optimization: 1) optimize the translation parameters while fixing the rotation parameters; 2) optimize rotation parameters while fixing the translation parameters; 3) optimize both translation and rotation parameters together. △ Less

Submitted 7 August, 2022; originally announced August 2022.

Comments: 27 pages,7 figures

arXiv:2207.07195 [pdf]

doi 10.1016/j.trc.2022.103933

COOR-PLT: A hierarchical control model for coordinating adaptive platoons of connected and autonomous vehicles at signal-free intersections based on deep reinforcement learning

Authors: Duowei Li, Jian** Wu, Feng Zhu, Tianyi Chen, Yiik Diew Wong

Abstract: Platooning and coordination are two implementation strategies that are frequently proposed for traffic control of connected and autonomous vehicles (CAVs) at signal-free intersections instead of using conventional traffic signals. However, few studies have attempted to integrate both strategies to better facilitate the CAV control at signal-free intersections. To this end, this study proposes a hi… ▽ More Platooning and coordination are two implementation strategies that are frequently proposed for traffic control of connected and autonomous vehicles (CAVs) at signal-free intersections instead of using conventional traffic signals. However, few studies have attempted to integrate both strategies to better facilitate the CAV control at signal-free intersections. To this end, this study proposes a hierarchical control model, named COOR-PLT, to coordinate adaptive CAV platoons at a signal-free intersection based on deep reinforcement learning (DRL). COOR-PLT has a two-layer framework. The first layer uses a centralized control strategy to form adaptive platoons. The optimal size of each platoon is determined by considering multiple objectives (i.e., efficiency, fairness and energy saving). The second layer employs a decentralized control strategy to coordinate multiple platoons passing through the intersection. Each platoon is labeled with coordinated status or independent status, upon which its passing priority is determined. As an efficient DRL algorithm, Deep Q-network (DQN) is adopted to determine platoon sizes and passing priorities respectively in the two layers. The model is validated and examined on the simulator Simulation of Urban Mobility (SUMO). The simulation results demonstrate that the model is able to: (1) achieve satisfactory convergence performances; (2) adaptively determine platoon size in response to varying traffic conditions; and (3) completely avoid deadlocks at the intersection. By comparison with other control methods, the model manifests its superiority of adopting adaptive platooning and DRL-based coordination strategies. Also, the model outperforms several state-of-the-art methods on reducing travel time and fuel consumption in different traffic conditions. △ Less

Submitted 30 June, 2022; originally announced July 2022.

Comments: This paper has been submitted to Transportation Research Part C: Emerging Technologies and is currently under review

Journal ref: Transportation Research Part C: Emerging Technologies 146 (2023): 103933

arXiv:2206.03603 [pdf]

doi 10.1016/j.compbiomed.2023.106954

A new method incorporating deep learning with shape priors for left ventricular segmentation in myocardial perfusion SPECT images

Authors: Fubao Zhu, **yu Zhao, Chen Zhao, Shaojie Tang, Jiaofen Nan, Yanting Li, Zhongqiang Zhao, Jianzhou Shi, Zenghong Chen, Zhixin Jiang, Weihua Zhou

Abstract: Background: The assessment of left ventricular (LV) function by myocardial perfusion SPECT (MPS) relies on accurate myocardial segmentation. The purpose of this paper is to develop and validate a new method incorporating deep learning with shape priors to accurately extract the LV myocardium for automatic measurement of LV functional parameters. Methods: A segmentation architecture that integrates… ▽ More Background: The assessment of left ventricular (LV) function by myocardial perfusion SPECT (MPS) relies on accurate myocardial segmentation. The purpose of this paper is to develop and validate a new method incorporating deep learning with shape priors to accurately extract the LV myocardium for automatic measurement of LV functional parameters. Methods: A segmentation architecture that integrates a three-dimensional (3D) V-Net with a shape deformation module was developed. Using the shape priors generated by a dynamic programming (DP) algorithm, the model output was then constrained and guided during the model training for quick convergence and improved performance. A stratified 5-fold cross-validation was used to train and validate our models. Results: Results of our proposed method agree well with those from the ground truth. Our proposed model achieved a Dice similarity coefficient (DSC) of 0.9573(0.0244), 0.9821(0.0137), and 0.9903(0.0041), a Hausdorff distances (HD) of 6.7529(2.7334) mm, 7.2507(3.1952) mm, and 7.6121(3.0134) mm in extracting the endocardium, myocardium, and epicardium, respectively. Conclusion: Our proposed method achieved a high accuracy in extracting LV myocardial contours and assessing LV function. △ Less

Submitted 7 June, 2022; originally announced June 2022.

Comments: 21 pages, 14 figures

arXiv:2205.01805 [pdf, other]

doi 10.1109/MIPR.2019.00024

Splicing Detection and Localization In Satellite Imagery Using Conditional GANs

Authors: Emily R. Bartusiak, Sri Kalyan Yarlagadda, David Güera, Paolo Bestagini, Stefano Tubaro, Fengqing M. Zhu, Edward J. Delp

Abstract: The widespread availability of image editing tools and improvements in image processing techniques allow image manipulation to be very easy. Oftentimes, easy-to-use yet sophisticated image manipulation tools yields distortions/changes imperceptible to the human observer. Distribution of forged images can have drastic ramifications, especially when coupled with the speed and vastness of the Interne… ▽ More The widespread availability of image editing tools and improvements in image processing techniques allow image manipulation to be very easy. Oftentimes, easy-to-use yet sophisticated image manipulation tools yields distortions/changes imperceptible to the human observer. Distribution of forged images can have drastic ramifications, especially when coupled with the speed and vastness of the Internet. Therefore, verifying image integrity poses an immense and important challenge to the digital forensic community. Satellite images specifically can be modified in a number of ways, including the insertion of objects to hide existing scenes and structures. In this paper, we describe the use of a Conditional Generative Adversarial Network (cGAN) to identify the presence of such spliced forgeries within satellite images. Additionally, we identify their locations and shapes. Trained on pristine and falsified images, our method achieves high success on these detection and localization objectives. △ Less

Submitted 3 May, 2022; originally announced May 2022.

Comments: Accepted to the 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR)

Journal ref: IEEE Conference on Multimedia Information Processing and Retrieval, pp. 91-96, March 2019, San Jose, CA

arXiv:2202.13209 [pdf, other]

Opening the Black Box of Learned Image Coders

Authors: Zhihao Duan, Ming Lu, Zhan Ma, Fengqing Zhu

Abstract: End-to-end learned lossy image coders (LICs), as opposed to hand-crafted image codecs, have shown increasing superiority in terms of the rate-distortion performance. However, they are mainly treated as black-box systems and their interpretability is not well studied. In this paper, we show that LICs learn a set of basis functions to transform input image for its compact representation in the laten… ▽ More End-to-end learned lossy image coders (LICs), as opposed to hand-crafted image codecs, have shown increasing superiority in terms of the rate-distortion performance. However, they are mainly treated as black-box systems and their interpretability is not well studied. In this paper, we show that LICs learn a set of basis functions to transform input image for its compact representation in the latent space, as analogous to the orthogonal transforms used in image coding standards. Our analysis provides insights to help understand how learned image coders work and could benefit future design and development. △ Less

Submitted 14 October, 2022; v1 submitted 26 February, 2022; originally announced February 2022.

arXiv:2110.06439 [pdf, other]

Statistical CSI-Based Transmission Design for Reconfigurable Intelligent Surface-aided Massive MIMO Systems with Hardware Impairments

Authors: Jianxin Dai, Feng Zhu, Cunhua Pan, Hong Ren, Kezhi Wang

Abstract: We consider a reconfigurable intelligent surface (RIS)-aided massive multi-user multiple-input multiple-output (MIMO) communication system with transceiver hardware impairments (HWIs) and RIS phase noise. Different from the existing contributions, the phase shifts of the RIS are designed based on the long-term angle informations. Firstly, an approximate analytical expression of the uplink achievab… ▽ More We consider a reconfigurable intelligent surface (RIS)-aided massive multi-user multiple-input multiple-output (MIMO) communication system with transceiver hardware impairments (HWIs) and RIS phase noise. Different from the existing contributions, the phase shifts of the RIS are designed based on the long-term angle informations. Firstly, an approximate analytical expression of the uplink achievable rate is derived. Then, we use genetic algorithm (GA) to maximize the sum rate and the minimum date rate. Finally, we show that it is crucial to take HWIs into account when designing the phase shift of RIS. △ Less

Submitted 12 October, 2021; originally announced October 2021.

Comments: Accepted by IEEE Wireless Communications Letters. Keywords: Reconfigurable Intelligent Surface, Intelligent Reflecting Surface, Massive MIMO, Channel estimation, etc

arXiv:2109.02755 [pdf, other]

Motion Artifact Reduction In Photoplethysmography For Reliable Signal Selection

Authors: Runyu Mao, Mackenzie Tweardy, Stephan W. Wegerich, Craig J. Goergen, George R. Wodicka, Fengqing Zhu

Abstract: Photoplethysmography (PPG) is a non-invasive and economical technique to extract vital signs of the human body. Although it has been widely used in consumer and research grade wrist devices to track a user's physiology, the PPG signal is very sensitive to motion which can corrupt the signal's quality. Existing Motion Artifact (MA) reduction techniques have been developed and evaluated using either… ▽ More Photoplethysmography (PPG) is a non-invasive and economical technique to extract vital signs of the human body. Although it has been widely used in consumer and research grade wrist devices to track a user's physiology, the PPG signal is very sensitive to motion which can corrupt the signal's quality. Existing Motion Artifact (MA) reduction techniques have been developed and evaluated using either synthetic noisy signals or signals collected during high-intensity activities - both of which are difficult to generalize for real-life scenarios. Therefore, it is valuable to collect realistic PPG signals while performing Activities of Daily Living (ADL) to develop practical signal denoising and analysis methods. In this work, we propose an automatic pseudo clean PPG generation process for reliable PPG signal selection. For each noisy PPG segment, the corresponding pseudo clean PPG reduces the MAs and contains rich temporal details depicting cardiac features. Our experimental results show that 71% of the pseudo clean PPG collected from ADL can be considered as high quality segment where the derived MAE of heart rate and respiration rate are 1.46 BPM and 3.93 BrPM, respectively. Therefore, our proposed method can determine the reliability of the raw noisy PPG by considering quality of the corresponding pseudo clean PPG signal. △ Less

Submitted 6 September, 2021; originally announced September 2021.

arXiv:2107.06185 [pdf]

A new method for vehicle system safety design based on data mining with uncertainty modeling

Authors: ** Du, Binhui Jiang, Feng Zhu

Abstract: In this research, a new data mining-based design approach has been developed for designing complex mechanical systems such as a crashworthy passenger car with uncertainty modeling. The method allows exploring the big crash simulation dataset to design the vehicle at multi-levels in a top-down manner (main energy absorbing system, components, and geometric features) and derive design rules based on… ▽ More In this research, a new data mining-based design approach has been developed for designing complex mechanical systems such as a crashworthy passenger car with uncertainty modeling. The method allows exploring the big crash simulation dataset to design the vehicle at multi-levels in a top-down manner (main energy absorbing system, components, and geometric features) and derive design rules based on the whole vehicle body safety requirements to make decisions towards the component and sub-component level design. Full vehicle and component simulation datasets are mined to build decision trees, where the interrelationship among parameters can be revealed and the design rules are derived to produce designs with good performance. This method has been extended by accounting for the uncertainty in the design variables. A new decision tree algorithm for uncertain data (DTUD) is developed to produce the desired designs and evaluate the design performance variations due to the uncertainty in design variables. The framework of this method is implemented by combining the design of experiments (DOE) and crash finite element analysis (FEA) and then demonstrated by designing a passenger car subject to front impact. The results show that the new methodology could achieve the design objectives efficiently and effectively. By applying the new method, the reliability of the final designs is also increased greatly. This approach has the potential to be applied as a general design methodology for a wide range of complex structures and mechanical systems. △ Less

Submitted 12 July, 2021; originally announced July 2021.

Comments: 38 pages, 21 figures, 6 tables

arXiv:2105.08819 [pdf, other]

Fast and Accurate Quantized Camera Scene Detection on Smartphones, Mobile AI 2021 Challenge: Report

Authors: Andrey Ignatov, Grigory Malivenko, Radu Timofte, Sheng Chen, Xin Xia, Zhaoyan Liu, Yuwei Zhang, Feng Zhu, Jiashi Li, Xuefeng Xiao, Yuan Tian, Xinglong Wu, Christos Kyrkou, Yixin Chen, Zexin Zhang, Yunbo Peng, Yue Lin, Saikat Dutta, Sourya Dipta Das, Nisarg A. Shah, Himanshu Kumar, Chao Ge, Pei-Lin Wu, **-Hua Du, Andrew Batutin , et al. (6 additional authors not shown)

Abstract: Camera scene detection is among the most popular computer vision problem on smartphones. While many custom solutions were developed for this task by phone vendors, none of the designed models were available publicly up until now. To address this problem, we introduce the first Mobile AI challenge, where the target is to develop quantized deep learning-based camera scene classification solutions th… ▽ More Camera scene detection is among the most popular computer vision problem on smartphones. While many custom solutions were developed for this task by phone vendors, none of the designed models were available publicly up until now. To address this problem, we introduce the first Mobile AI challenge, where the target is to develop quantized deep learning-based camera scene classification solutions that can demonstrate a real-time performance on smartphones and IoT platforms. For this, the participants were provided with a large-scale CamSDD dataset consisting of more than 11K images belonging to the 30 most important scene categories. The runtime of all models was evaluated on the popular Apple Bionic A11 platform that can be found in many iOS devices. The proposed solutions are fully compatible with all major mobile AI accelerators and can demonstrate more than 100-200 FPS on the majority of recent smartphone platforms while achieving a top-3 accuracy of more than 98%. A detailed description of all models developed in the challenge is provided in this paper. △ Less

Submitted 17 May, 2021; originally announced May 2021.

Comments: Mobile AI 2021 Workshop and Challenges: https://ai-benchmark.com/workshops/mai/2021/. arXiv admin note: substantial text overlap with arXiv:2105.08630; text overlap with arXiv:2105.07825, arXiv:2105.07809, arXiv:2105.08629

arXiv:2102.05024 [pdf, other]

Turkey Behavior Identification System with a GUI Using Deep Learning and Video Analytics

Authors: Shengtai Ju, Sneha Mahapatra, Marisa A. Erasmus, Amy R. Reibman, Fengqing Zhu

Abstract: In this paper, we propose a video analytics system to identify the behavior of turkeys. Turkey behavior provides evidence to assess turkey welfare, which can be negatively impacted by uncomfortable ambient temperature and various diseases. In particular, healthy and sick turkeys behave differently in terms of the duration and frequency of activities such as eating, drinking, preening, and aggressi… ▽ More In this paper, we propose a video analytics system to identify the behavior of turkeys. Turkey behavior provides evidence to assess turkey welfare, which can be negatively impacted by uncomfortable ambient temperature and various diseases. In particular, healthy and sick turkeys behave differently in terms of the duration and frequency of activities such as eating, drinking, preening, and aggressive interactions. Our system incorporates recent advances in object detection and tracking to automate the process of identifying and analyzing turkey behavior captured by commercial grade cameras. We combine deep-learning and traditional image processing methods to address challenges in this practical agricultural problem. Our system also includes a web-based user interface to create visualization of automated analysis results. Together, we provide an improved tool for turkey researchers to assess turkey welfare without the time-consuming and labor-intensive manual inspection. △ Less

Submitted 9 February, 2021; originally announced February 2021.

arXiv:2101.06341 [pdf, other]

Advances In Video Compression System Using Deep Neural Network: A Review And Case Studies

Authors: Dandan Ding, Zhan Ma, Di Chen, Qingshuang Chen, Zoe Liu, Fengqing Zhu

Abstract: Significant advances in video compression system have been made in the past several decades to satisfy the nearly exponential growth of Internet-scale video traffic. From the application perspective, we have identified three major functional blocks including pre-processing, coding, and post-processing, that have been continuously investigated to maximize the end-user quality of experience (QoE) un… ▽ More Significant advances in video compression system have been made in the past several decades to satisfy the nearly exponential growth of Internet-scale video traffic. From the application perspective, we have identified three major functional blocks including pre-processing, coding, and post-processing, that have been continuously investigated to maximize the end-user quality of experience (QoE) under a limited bit rate budget. Recently, artificial intelligence (AI) powered techniques have shown great potential to further increase the efficiency of the aforementioned functional blocks, both individually and jointly. In this article, we review extensively recent technical advances in video compression system, with an emphasis on deep neural network (DNN)-based approaches; and then present three comprehensive case studies. On pre-processing, we show a switchable texture-based video coding example that leverages DNN-based scene understanding to extract semantic areas for the improvement of subsequent video coder. On coding, we present an end-to-end neural video coding framework that takes advantage of the stacked DNNs to efficiently and compactly code input raw videos via fully data-driven learning. On post-processing, we demonstrate two neural adaptive filters to respectively facilitate the in-loop and post filtering for the enhancement of compressed frames. Finally, a companion website hosting the contents developed in this work can be accessed publicly at https://purdueviper.github.io/dnn-coding/. △ Less

Submitted 15 January, 2021; originally announced January 2021.

arXiv:2012.00650 [pdf, other]

Decomposition, Compression, and Synthesis (DCS)-based Video Coding: A Neural Exploration via Resolution-Adaptive Learning

Authors: Ming Lu, Tong Chen, Dandan Ding, Fengqing Zhu, Zhan Ma

Abstract: Inspired by the facts that retinal cells actually segregate the visual scene into different attributes (e.g., spatial details, temporal motion) for respective neuronal processing, we propose to first decompose the input video into respective spatial texture frames (STF) at its native spatial resolution that preserve the rich spatial details, and the other temporal motion frames (TMF) at a lower sp… ▽ More Inspired by the facts that retinal cells actually segregate the visual scene into different attributes (e.g., spatial details, temporal motion) for respective neuronal processing, we propose to first decompose the input video into respective spatial texture frames (STF) at its native spatial resolution that preserve the rich spatial details, and the other temporal motion frames (TMF) at a lower spatial resolution that retain the motion smoothness; then compress them together using any popular video coder; and finally synthesize decoded STFs and TMFs for high-fidelity video reconstruction at the same resolution as its native input. This work simply applies the bicubic resampling in decomposition and HEVC compliant codec in compression, and puts the focus on the synthesis part. For resolution-adaptive synthesis, a motion compensation network (MCN) is devised on TMFs to efficiently align and aggregate temporal motion features that will be jointly processed with corresponding STFs using a non-local texture transfer network (NL-TTN) to better augment spatial details, by which the compression and resolution resampling noises can be effectively alleviated with better rate-distortion efficiency. Such "Decomposition, Compression, Synthesis (DCS)" based scheme is codec agnostic, currently exemplifying averaged $\approx$1 dB PSNR gain or $\approx$25% BD-rate saving, against the HEVC anchor using reference software. In addition, experimental comparisons to the state-of-the-art methods and ablation studies are conducted to further report the efficiency and generalization of DCS algorithm, promising an encouraging direction for future video coding. △ Less

Submitted 15 January, 2024; v1 submitted 1 December, 2020; originally announced December 2020.

arXiv:2008.05765 [pdf, other]

Revisiting Temporal Modeling for Video Super-resolution

Authors: Takashi Isobe, Fang Zhu, Xu Jia, Sheng** Wang

Abstract: Video super-resolution plays an important role in surveillance video analysis and ultra-high-definition video display, which has drawn much attention in both the research and industrial communities. Although many deep learning-based VSR methods have been proposed, it is hard to directly compare these methods since the different loss functions and training datasets have a significant impact on the… ▽ More Video super-resolution plays an important role in surveillance video analysis and ultra-high-definition video display, which has drawn much attention in both the research and industrial communities. Although many deep learning-based VSR methods have been proposed, it is hard to directly compare these methods since the different loss functions and training datasets have a significant impact on the super-resolution results. In this work, we carefully study and compare three temporal modeling methods (2D CNN with early fusion, 3D CNN with slow fusion and Recurrent Neural Network) for video super-resolution. We also propose a novel Recurrent Residual Network (RRN) for efficient video super-resolution, where residual learning is utilized to stabilize the training of RNN and meanwhile to boost the super-resolution performance. Extensive experiments show that the proposed RRN is highly computational efficiency and produces temporal consistent VSR results with finer details than other temporal modeling methods. Besides, the proposed method achieves state-of-the-art results on several widely used benchmarks. △ Less

Submitted 19 August, 2020; v1 submitted 13 August, 2020; originally announced August 2020.

Comments: BMVC 2020

arXiv:2007.02091 [pdf]

doi 10.1016/j.ijleo.2021.167551

Semantic Segmentation Using Deep Learning to Extract Total Extraocular Muscles and Optic Nerve from Orbital Computed Tomography Images

Authors: Fubao Zhu, Zhengyuan Gao, Chen Zhao, Zelin Zhu, Yanyun Liu, Shaojie Tang, Chengzhi Jiang, Xinhui Li, Min Zhao, Weihua Zhou

Abstract: Objectives: Precise segmentation of total extraocular muscles (EOM) and optic nerve (ON) is essential to assess anatomical development and progression of thyroid-associated ophthalmopathy (TAO). We aim to develop a semantic segmentation method based on deep learning to extract the total EOM and ON from orbital CT images in patients with suspected TAO. Materials and Methods: A total of 7,879 images… ▽ More Objectives: Precise segmentation of total extraocular muscles (EOM) and optic nerve (ON) is essential to assess anatomical development and progression of thyroid-associated ophthalmopathy (TAO). We aim to develop a semantic segmentation method based on deep learning to extract the total EOM and ON from orbital CT images in patients with suspected TAO. Materials and Methods: A total of 7,879 images obtained from 97 subjects who underwent orbit CT scans due to suspected TAO were enrolled in this study. Eighty-eight patients were randomly selected into the training/validation dataset, and the rest were put into the test dataset. Contours of the total EOM and ON in all the patients were manually delineated by experienced radiologists as the ground truth. A three-dimensional (3D) end-to-end fully convolutional neural network called semantic V-net (SV-net) was developed for our segmentation task. Intersection over Union (IoU) was measured to evaluate the accuracy of the segmentation results, and Pearson correlation analysis was used to evaluate the volumes measured from our segmentation results against those from the ground truth. Results: Our model in the test dataset achieved an overall IoU of 0.8207; the IoU was 0.7599 for the superior rectus muscle, 0.8183 for the lateral rectus muscle, 0.8481 for the medial rectus muscle, 0.8436 for the inferior rectus muscle and 0.8337 for the optic nerve. The volumes measured from our segmentation results agreed well with those from the ground truth (all R>0.98, P<0.0001). Conclusion: The qualitative and quantitative evaluations demonstrate excellent performance of our method in automatically extracting the total EOM and ON and measuring their volumes in orbital CT images. There is a great promise for clinical application to assess these anatomical structures for the diagnosis and prognosis of TAO. △ Less

Submitted 4 July, 2020; originally announced July 2020.

Comments: 17 pages, 8 figures

arXiv:2006.11538 [pdf, other]

Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition

Authors: Ionut Cosmin Duta, Li Liu, Fan Zhu, Ling Shao

Abstract: This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales. PyConv contains a pyramid of kernels, where each level involves different types of filters with varying size and depth, which are able to capture different levels of details in the scene. On top of these improved recognition capabilities, PyConv is also efficient and, with our f… ▽ More This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales. PyConv contains a pyramid of kernels, where each level involves different types of filters with varying size and depth, which are able to capture different levels of details in the scene. On top of these improved recognition capabilities, PyConv is also efficient and, with our formulation, it does not increase the computational cost and parameters compared to standard convolution. Moreover, it is very flexible and extensible, providing a large space of potential network architectures for different applications. PyConv has the potential to impact nearly every computer vision task and, in this work, we present different architectures based on PyConv for four main tasks on visual recognition: image classification, video action classification/recognition, object detection and semantic image segmentation/parsing. Our approach shows significant improvements over all these core tasks in comparison with the baselines. For instance, on image recognition, our 50-layers network outperforms in terms of recognition performance on ImageNet dataset its counterpart baseline ResNet with 152 layers, while having 2.39 times less parameters, 2.52 times lower computational complexity and more than 3 times less layers. On image segmentation, our novel framework sets a new state-of-the-art on the challenging ADE20K benchmark for scene parsing. Code is available at: https://github.com/iduta/pyconv △ Less

Submitted 20 June, 2020; originally announced June 2020.

arXiv:2004.12027 [pdf, other]

Deepfakes Detection with Automatic Face Weighting

Authors: Daniel Mas Montserrat, Hanxiang Hao, S. K. Yarlagadda, Sriram Baireddy, Ruiting Shao, János Horváth, Emily Bartusiak, Justin Yang, David Güera, Fengqing Zhu, Edward J. Delp

Abstract: Altered and manipulated multimedia is increasingly present and widely distributed via social media platforms. Advanced video manipulation tools enable the generation of highly realistic-looking altered multimedia. While many methods have been presented to detect manipulations, most of them fail when evaluated with data outside of the datasets used in research environments. In order to address this… ▽ More Altered and manipulated multimedia is increasingly present and widely distributed via social media platforms. Advanced video manipulation tools enable the generation of highly realistic-looking altered multimedia. While many methods have been presented to detect manipulations, most of them fail when evaluated with data outside of the datasets used in research environments. In order to address this problem, the Deepfake Detection Challenge (DFDC) provides a large dataset of videos containing realistic manipulations and an evaluation system that ensures that methods work quickly and accurately, even when faced with challenging data. In this paper, we introduce a method based on convolutional neural networks (CNNs) and recurrent neural networks (RNNs) that extracts visual and temporal features from faces present in videos to accurately detect manipulations. The method is evaluated with the DFDC dataset, providing competitive results compared to other techniques. △ Less

Submitted 4 May, 2020; v1 submitted 24 April, 2020; originally announced April 2020.

arXiv:2004.00583 [pdf, other]

Improving Deep Hyperspectral Image Classification Performance with Spectral Unmixing

Authors: Alan J. X. Guo, Fei Zhu

Abstract: Recent advances in neural networks have made great progress in the hyperspectral image (HSI) classification. However, the overfitting effect, which is mainly caused by complicated model structure and small training set, remains a major concern. Reducing the complexity of the neural networks could prevent overfitting to some extent, but also declines the networks' ability to express more abstract f… ▽ More Recent advances in neural networks have made great progress in the hyperspectral image (HSI) classification. However, the overfitting effect, which is mainly caused by complicated model structure and small training set, remains a major concern. Reducing the complexity of the neural networks could prevent overfitting to some extent, but also declines the networks' ability to express more abstract features. Enlarging the training set is also difficult, for the high expense of acquisition and manual labeling. In this paper, we propose an abundance-based multi-HSI classification method. Firstly, we convert every HSI from the spectral domain to the abundance domain by a dataset-specific autoencoder. Secondly, the abundance representations from multiple HSIs are collected to form an enlarged dataset. Lastly, we train an abundance-based classifier and employ the classifier to predict over all the involved HSI datasets. Different from the spectra that are usually highly mixed, the abundance features are more representative in reduced dimension with less noise. This benefits the proposed method to employ simple classifiers and enlarged training data, and to expect less overfitting issues. The effectiveness of the proposed method is verified by the ablation study and the comparative experiments. △ Less

Submitted 21 December, 2020; v1 submitted 1 April, 2020; originally announced April 2020.

arXiv:1908.02875 [pdf, ps, other]

Convolutional Neural Networks Based Texture Modeling For AV1

Authors: Di Chen, Chichen Fu, Zoe Liu, Fengqing Zhu

Abstract: Modern video codecs including the newly developed AOMedia Video 1 (AV1) utilize hybrid coding techniques to remove spatial and temporal redundancy. However, efficient exploitation of statistical dependencies measured by a mean squared error (MSE) does not always produce the best psychovisual result. One interesting approach is to only encode visually relevant information and use a different coding… ▽ More Modern video codecs including the newly developed AOMedia Video 1 (AV1) utilize hybrid coding techniques to remove spatial and temporal redundancy. However, efficient exploitation of statistical dependencies measured by a mean squared error (MSE) does not always produce the best psychovisual result. One interesting approach is to only encode visually relevant information and use a different coding method for "perceptually insignificant" regions in the frame, which can lead to substantial data rate reductions while maintaining visual quality. In this paper, we introduce a texture analyzer before encoding the input sequences to identify "perceptually insignificant" regions in the frame using convolutional neural networks. We designed and developed a new scheme that integrate the texture analyzer into the codec that can largely reduce the temporal flickering artifact for codec with hierarchical coding structure. The proposed method is implemented in AV1 codec by introducing a new coding tool called texture mode, where texture mode is a special inter mode treated at the encoder, that if texture mode is selected, no inter prediction is performed for the identified texture regions. Instead, displacement of the entire region is modeled by just one set of motion parameters. Therefore, only the model parameters are transmitted to the decoder for reconstructing the texture regions. Non-texture regions in the frame are coded conventionally. We show that for many standard test sets, the proposed method achieved significant data rate reductions with satisfying visual quality. △ Less

Submitted 7 August, 2019; originally announced August 2019.

Comments: 22 pages, 7 figures. arXiv admin note: substantial text overlap with arXiv:1804.09291

arXiv:1906.06878 [pdf, other]

doi 10.1109/TIP.2020.3026622

Noisy-As-Clean: Learning Self-supervised Denoising from the Corrupted Image

Authors: Jun Xu, Yuan Huang, Ming-Ming Cheng, Li Liu, Fan Zhu, Zhou Xu, Ling Shao

Abstract: Supervised deep networks have achieved promisingperformance on image denoising, by learning image priors andnoise statistics on plenty pairs of noisy and clean images. Unsupervised denoising networks are trained with only noisy images. However, for an unseen corrupted image, both supervised andunsupervised networks ignore either its particular image prior, the noise statistics, or both. That is, t… ▽ More Supervised deep networks have achieved promisingperformance on image denoising, by learning image priors andnoise statistics on plenty pairs of noisy and clean images. Unsupervised denoising networks are trained with only noisy images. However, for an unseen corrupted image, both supervised andunsupervised networks ignore either its particular image prior, the noise statistics, or both. That is, the networks learned from external images inherently suffer from a domain gap problem: the image priors and noise statistics are very different between the training and test images. This problem becomes more clear when dealing with the signal dependent realistic noise. To circumvent this problem, in this work, we propose a novel "Noisy-As-Clean" (NAC) strategy of training self-supervised denoising networks. Specifically, the corrupted test image is directly taken as the "clean" target, while the inputs are synthetic images consisted of this corrupted image and a second and similar corruption. A simple but useful observation on our NAC is: as long as the noise is weak, it is feasible to learn a self-supervised network only with the corrupted image, approximating the optimal parameters of a supervised network learned with pairs of noisy and clean images. Experiments on synthetic and realistic noise removal demonstrate that, the DnCNN and ResNet networks trained with our self-supervised NAC strategy achieve comparable or better performance than the original ones and previous supervised/unsupervised/self-supervised networks. The code is publicly available at https://github.com/csjunxu/Noisy-As-Clean. △ Less

Submitted 9 May, 2020; v1 submitted 17 June, 2019; originally announced June 2019.

Comments: 12 pages, 9 figures, 6 tables, the first two authors contribute equally

arXiv:1812.04943 [pdf, other]

Long-range depth imaging using a single-photon detector array and non-local data fusion

Authors: Susan Chan, Abderrahim Halimi, Feng Zhu, Istvan Gyongy, Robert K. Henderson, Richard Bowman, Steve McLaughlin, Gerald S. Buller, Jonathan Leach

Abstract: The ability to measure and record high-resolution depth images at long stand-off distances is important for a wide range of applications, including connected and automotive vehicles, defense and security, and agriculture and mining. In LIDAR (light detection and ranging) applications, single-photon sensitive detection is an emerging approach, offering high sensitivity to light and picosecond tempo… ▽ More The ability to measure and record high-resolution depth images at long stand-off distances is important for a wide range of applications, including connected and automotive vehicles, defense and security, and agriculture and mining. In LIDAR (light detection and ranging) applications, single-photon sensitive detection is an emerging approach, offering high sensitivity to light and picosecond temporal resolution, and consequently excellent surface-to-surface resolution. The use of large format CMOS single-photon detector arrays provides high spatial resolution and allows the timing information to be acquired simultaneously across many pixels. In this work, we combine state-of-the-art single-photon detector array technology with non-local data fusion to generate high resolution three-dimensional depth information of long-range targets. The system is based on a visible pulsed illumination system at 670~nm and a 240~$\times$ 320 pixel array sensor, achieving sub-centimeter precision in all three spatial dimensions at a distance of 150 meters. The non-local data fusion combines information from an optical image with sparse sampling of the single-photon array data, providing accurate depth information at low signature regions of the target. △ Less

Submitted 11 December, 2018; originally announced December 2018.

arXiv:1807.08048 [pdf, other]

Baidu Apollo EM Motion Planner

Authors: Haoyang Fan, Fan Zhu, Changchun Liu, Liangliang Zhang, Li Zhuang, Dong Li, Weicheng Zhu, Jiangtao Hu, Hongye Li, Qi Kong

Abstract: In this manuscript, we introduce a real-time motion planning system based on the Baidu Apollo (open source) autonomous driving platform. The developed system aims to address the industrial level-4 motion planning problem while considering safety, comfort and scalability. The system covers multilane and single-lane autonomous driving in a hierarchical manner: (1) The top layer of the system is a mu… ▽ More In this manuscript, we introduce a real-time motion planning system based on the Baidu Apollo (open source) autonomous driving platform. The developed system aims to address the industrial level-4 motion planning problem while considering safety, comfort and scalability. The system covers multilane and single-lane autonomous driving in a hierarchical manner: (1) The top layer of the system is a multilane strategy that handles lane-change scenarios by comparing lane-level trajectories computed in parallel. (2) Inside the lane-level trajectory generator, it iteratively solves path and speed optimization based on a Frenet frame. (3) For path and speed optimization, a combination of dynamic programming and spline-based quadratic programming is proposed to construct a scalable and easy-to-tune framework to handle traffic rules, obstacle decisions and smoothness simultaneously. The planner is scalable to both highway and lower-speed city driving scenarios. We also demonstrate the algorithm through scenario illustrations and on-road test results. The system described in this manuscript has been deployed to dozens of Baidu Apollo autonomous driving vehicles since Apollo v1.5 was announced in September 2017. As of May 16th, 2018, the system has been tested under 3,380 hours and approximately 68,000 kilometers (42,253 miles) of closed-loop autonomous driving under various urban scenarios. The algorithm described in this manuscript is available at https://github.com/ApolloAuto/apollo/tree/master/modules/planning. △ Less

Submitted 20 July, 2018; originally announced July 2018.

Showing 1–50 of 53 results for author: Zhu, F