Search | arXiv e-print repository

Overfitted image coding at reduced complexity

Authors: Théophile Blard, Théo Ladune, Pierrick Philippe, Gordon Clare, Xiaoran Jiang, Olivier Déforges

Abstract: Overfitted image codecs offer compelling compression performance and low decoder complexity, through the overfitting of a lightweight decoder for each image. Such codecs include Cool-chic, which presents image coding performance on par with VVC while requiring around 2000 multiplications per decoded pixel. This paper proposes to decrease Cool-chic encoding and decoding complexity. The encoding com… ▽ More Overfitted image codecs offer compelling compression performance and low decoder complexity, through the overfitting of a lightweight decoder for each image. Such codecs include Cool-chic, which presents image coding performance on par with VVC while requiring around 2000 multiplications per decoded pixel. This paper proposes to decrease Cool-chic encoding and decoding complexity. The encoding complexity is reduced by shortening Cool-chic training, up to the point where no overfitting is performed at all. It is also shown that a tiny neural decoder with 300 multiplications per pixel still outperforms HEVC. A near real-time CPU implementation of this decoder is made available at https://orange-opensource.github.io/Cool-Chic/. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: 5 pages, submitted to European Signal Processing Conference (EUSIPCO) 2024

arXiv:2402.03179 [pdf, other]

Cool-chic video: Learned video coding with 800 parameters

Authors: Thomas Leguay, Théo Ladune, Pierrick Philippe, Olivier Déforges

Abstract: We propose a lightweight learned video codec with 900 multiplications per decoded pixel and 800 parameters overall. To the best of our knowledge, this is one of the neural video codecs with the lowest decoding complexity. It is built upon the overfitted image codec Cool-chic and supplements it with an inter coding module to leverage the video's temporal redundancies. The proposed model is able to… ▽ More We propose a lightweight learned video codec with 900 multiplications per decoded pixel and 800 parameters overall. To the best of our knowledge, this is one of the neural video codecs with the lowest decoding complexity. It is built upon the overfitted image codec Cool-chic and supplements it with an inter coding module to leverage the video's temporal redundancies. The proposed model is able to compress videos using both low-delay and random access configurations and achieves rate-distortion close to AVC while out-performing other overfitted codecs such as FFNeRV. The system is made open-source: orange-opensource.github.io/Cool-Chic. △ Less

Submitted 6 February, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: 10 pages, published in Data Compression Conference 2024

arXiv:2310.05623 [pdf, other]

Efficient Predictive Coding of Intra Prediction Modes

Authors: Kevin Reuzé, Wassim Hamidouche, Pierrick Philippe, Olivier Déforges

Abstract: The high efficiency video coding (HEVC) standard and the joint exploration model (JEM) codec incorporate 35 and 67 intra prediction modes (IPMs) respectively, which are essential for efficient compression of Intra coded blocks. These IPMs are transmitted to the decoder through a coding scheme. In our paper, we present an innovative approach to construct a dedicated coding scheme for IPM based on c… ▽ More The high efficiency video coding (HEVC) standard and the joint exploration model (JEM) codec incorporate 35 and 67 intra prediction modes (IPMs) respectively, which are essential for efficient compression of Intra coded blocks. These IPMs are transmitted to the decoder through a coding scheme. In our paper, we present an innovative approach to construct a dedicated coding scheme for IPM based on contextual information. This approach comprises three key steps: prediction, clustering, and coding, each of which has been enhanced by introducing new elements, namely, labels for prediction, tests for clustering, and codes for coding. In this context, we have proposed a method that utilizes a genetic algorithm to minimize the rate cost, aiming to derive the most efficient coding scheme while leveraging the available labels, tests, and codes. The resulting coding scheme, expressed as a binary tree, achieves the highest coding efficiency for a given level of complexity. In our experimental evaluation under the HEVC standard, we observed significant bitrate gains while maintaining coding efficiency under the JEM codec. These results demonstrate the potential of our approach to improve compression efficiency, particularly under the HEVC standard, while preserving the coding efficiency of the JEM codec. △ Less

Submitted 9 October, 2023; originally announced October 2023.

arXiv:2307.09729 [pdf, other]

NTIRE 2023 Quality Assessment of Video Enhancement Challenge

Authors: Xiaohong Liu, Xiongkuo Min, Wei Sun, Yulun Zhang, Kai Zhang, Radu Timofte, Guangtao Zhai, Yixuan Gao, Yuqin Cao, Tengchuan Kou, Yunlong Dong, Ziheng Jia, Yilin Li, Wei Wu, Shuming Hu, Sibin Deng, Pengxiang Xiao, Ying Chen, Kai Li, Kai Zhao, Kun Yuan, Ming Sun, Heng Cong, Hao Wang, Lingzhi Fu , et al. (47 additional authors not shown)

Abstract: This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual… ▽ More This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual Video Enhancement (VDPVE), which has a total of 1211 enhanced videos, including 600 videos with color, brightness, and contrast enhancements, 310 videos with deblurring, and 301 deshaked videos. The challenge has a total of 167 registered participants. 61 participating teams submitted their prediction results during the development phase, with a total of 3168 submissions. A total of 176 submissions were submitted by 37 participating teams during the final testing phase. Finally, 19 participating teams submitted their models and fact sheets, and detailed the methods they used. Some methods have achieved better results than baseline methods, and the winning methods have demonstrated superior prediction performance. △ Less

Submitted 18 July, 2023; originally announced July 2023.

arXiv:2206.02131 [pdf, other]

Federated Adversarial Training with Transformers

Authors: Ahmed Aldahdooh, Wassim Hamidouche, Olivier Déforges

Abstract: Federated learning (FL) has emerged to enable global model training over distributed clients' data while preserving its privacy. However, the global trained model is vulnerable to the evasion attacks especially, the adversarial examples (AEs), carefully crafted samples to yield false classification. Adversarial training (AT) is found to be the most promising approach against evasion attacks and it… ▽ More Federated learning (FL) has emerged to enable global model training over distributed clients' data while preserving its privacy. However, the global trained model is vulnerable to the evasion attacks especially, the adversarial examples (AEs), carefully crafted samples to yield false classification. Adversarial training (AT) is found to be the most promising approach against evasion attacks and it is widely studied for convolutional neural network (CNN). Recently, vision transformers have been found to be effective in many computer vision tasks. To the best of the authors' knowledge, there is no work that studied the feasibility of AT in a FL process for vision transformers. This paper investigates such feasibility with different federated model aggregation methods and different vision transformer models with different tokenization and classification head techniques. In order to improve the robust accuracy of the models with the not independent and identically distributed (Non-IID), we propose an extension to FedAvg aggregation method, called FedWAvg. By measuring the similarities between the last layer of the global model and the last layer of the client updates, FedWAvg calculates the weights to aggregate the local models updates. The experiments show that FedWAvg improves the robust accuracy when compared with other state-of-the-art aggregation methods. △ Less

Submitted 5 June, 2022; originally announced June 2022.

arXiv:2202.00416 [pdf, other]

doi 10.1109/VCIP53242.2021.9675351

CAESR: Conditional Autoencoder and Super-Resolution for Learned Spatial Scalability

Authors: Charles Bonnineau, Wassim Hamidouche, Jean-François Travers, Naty Sidaty, Jean-Yves Aubié, Olivier Deforges

Abstract: In this paper, we present CAESR, an hybrid learning-based coding approach for spatial scalability based on the versatile video coding (VVC) standard. Our framework considers a low-resolution signal encoded with VVC intra-mode as a base-layer (BL), and a deep conditional autoencoder with hyperprior (AE-HP) as an enhancement-layer (EL) model. The EL encoder takes as inputs both the upscaled BL recon… ▽ More In this paper, we present CAESR, an hybrid learning-based coding approach for spatial scalability based on the versatile video coding (VVC) standard. Our framework considers a low-resolution signal encoded with VVC intra-mode as a base-layer (BL), and a deep conditional autoencoder with hyperprior (AE-HP) as an enhancement-layer (EL) model. The EL encoder takes as inputs both the upscaled BL reconstruction and the original image. Our approach relies on conditional coding that learns the optimal mixture of the source and the upscaled BL image, enabling better performance than residual coding. On the decoder side, a super-resolution (SR) module is used to recover high-resolution details and invert the conditional coding process. Experimental results have shown that our solution is competitive with the VVC full-resolution intra coding while being scalable. △ Less

Submitted 1 February, 2022; originally announced February 2022.

Journal ref: 2021 International Conference on Visual Communications and Image Processing (VCIP)

arXiv:2109.06555 [pdf, other]

Perceptual Quality Assessment of HEVC and VVC Standards for 8K Video

Authors: Charles Bonnineau, Wassim Hamidouche, Jerome Fournier, Naty Sidaty, Jean-Francois Travers, Olivier Deforges

Abstract: With the growing data consumption of emerging video applications and users requirement for higher resolutions, up to 8K, a huge effort has been made in video compression technologies. Recently, versatile video coding (VVC) has been standardized by the moving picture expert group (MPEG), providing a significant improvement in compression performance over its predecessor high efficiency video coding… ▽ More With the growing data consumption of emerging video applications and users requirement for higher resolutions, up to 8K, a huge effort has been made in video compression technologies. Recently, versatile video coding (VVC) has been standardized by the moving picture expert group (MPEG), providing a significant improvement in compression performance over its predecessor high efficiency video coding (HEVC). In this paper, we provide a comparative subjective quality evaluation between VVC and HEVC standards for 8K resolution videos. In addition, we evaluate the perceived quality improvement offered by 8K over UHD 4K resolution. The compression performance of both VVC and HEVC standards has been conducted in random access (RA) coding configuration, using their respective reference software, VVC test model (VTM-11) and HEVC test model (HM-16.20). Objective measurements, using PSNR, MS-SSIM and VMAF metrics have shown that the bitrate gains offered by VVC over HEVC for 8K video content are around 31%, 26% and 35%, respectively. Subjectively, VVC offers an average of 40% of bitrate reduction over HEVC for the same visual quality. A compression gain of 50% has been reached for some tested video sequences regarding a Student t-test analysis. In addition, for most tested scenes, a significant visual difference between uncompressed 4K and 8K has been noticed. △ Less

Submitted 20 December, 2021; v1 submitted 14 September, 2021; originally announced September 2021.

Comments: Paper under review

arXiv:2107.11659 [pdf, other]

Lightweight Hardware Transform Design for the Versatile Video Coding 4K ASIC Decoders

Authors: Ibrahim Farhat, Wassim Hamidouche, Adrien Grill, Daniel Ménard, Olivier Déforges

Abstract: Versatile Video Coding (VVC) is the next generation video coding standard finalized in July 2020. VVC introduces new coding tools enhancing the coding efficiency compared to its predecessor High Efficiency Video Coding (HEVC). These new tools have a significant impact on the VVC software decoder complexity estimated to 2 times HEVC decoder complexity. In particular, the transform module includes i… ▽ More Versatile Video Coding (VVC) is the next generation video coding standard finalized in July 2020. VVC introduces new coding tools enhancing the coding efficiency compared to its predecessor High Efficiency Video Coding (HEVC). These new tools have a significant impact on the VVC software decoder complexity estimated to 2 times HEVC decoder complexity. In particular, the transform module includes in VVC separable and non-separable transforms named Multiple Transform Selection (MTS) and Low Frequency Non-Separable Transform (LFNST) tools, respectively. In this paper, we present an area-efficient hardware architecture of the inverse transform module for a VVC decoder. The proposed design uses a total of 64 regular multipliers in a pipelined architecture targeting Application-Specific Integrated Circuit (ASIC) platforms. It consists in a multi-standard architecture that supports the transform modules of recent MPEG standards including Advanced Video Coding (AVC), HEVC and VVC. The architecture leverages all primary and secondary transforms optimisations including butterfly de-composition, coefficients zeroing and the inherent linear relationship between the transforms. The synthesized results show that the proposed method sustains a constant throughput of 1sample per cycle and a constant latency for all block sizes. The proposed hardware inverse transform module operates at 600MHz frequency enabling to decode in real-time 4K video at 30 frames per second in 4:2:2 chroma sub-sampling format. The proposed module has been integrated in an ASIC UHD decoder targeting energy-aware decoding of VVC videos on consumer devices. △ Less

Submitted 6 November, 2021; v1 submitted 24 July, 2021; originally announced July 2021.

arXiv:2106.03734 [pdf, other]

Reveal of Vision Transformers Robustness against Adversarial Attacks

Authors: Ahmed Aldahdooh, Wassim Hamidouche, Olivier Deforges

Abstract: The major part of the vanilla vision transformer (ViT) is the attention block that brings the power of mimicking the global context of the input image. For better performance, ViT needs large-scale training data. To overcome this data hunger limitation, many ViT-based networks, or hybrid-ViT, have been proposed to include local context during the training. The robustness of ViTs and its variants a… ▽ More The major part of the vanilla vision transformer (ViT) is the attention block that brings the power of mimicking the global context of the input image. For better performance, ViT needs large-scale training data. To overcome this data hunger limitation, many ViT-based networks, or hybrid-ViT, have been proposed to include local context during the training. The robustness of ViTs and its variants against adversarial attacks has not been widely investigated in the literature like CNNs. This work studies the robustness of ViT variants 1) against different Lp-based adversarial attacks in comparison with CNNs, 2) under adversarial examples (AEs) after applying preprocessing defense methods and 3) under the adaptive attacks using expectation over transformation (EOT) framework. To that end, we run a set of experiments on 1000 images from ImageNet-1k and then provide an analysis that reveals that vanilla ViT or hybrid-ViT are more robust than CNNs. For instance, we found that 1) Vanilla ViTs or hybrid-ViTs are more robust than CNNs under Lp-based attacks and under adaptive attacks. 2) Unlike hybrid-ViTs, Vanilla ViTs are not responding to preprocessing defenses that mainly reduce the high frequency components. Furthermore, feature maps, attention maps, and Grad-CAM visualization jointly with image quality measures, and perturbations' energy spectrum are provided for an insight understanding of attention-based models. △ Less

Submitted 20 September, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

arXiv:2105.11578 [pdf, other]

SHD360: A Benchmark Dataset for Salient Human Detection in 360° Videos

Authors: Yi Zhang, Lu Zhang, Kang Wang, Wassim Hamidouche, Olivier Deforges

Abstract: Salient human detection (SHD) in dynamic 360° immersive videos is of great importance for various applications such as robotics, inter-human and human-object interaction in augmented reality. However, 360° video SHD has been seldom discussed in the computer vision community due to a lack of datasets with large-scale omnidirectional videos and rich annotations. To this end, we propose SHD360, the f… ▽ More Salient human detection (SHD) in dynamic 360° immersive videos is of great importance for various applications such as robotics, inter-human and human-object interaction in augmented reality. However, 360° video SHD has been seldom discussed in the computer vision community due to a lack of datasets with large-scale omnidirectional videos and rich annotations. To this end, we propose SHD360, the first 360° video SHD dataset which contains various real-life daily scenes. Since so far there is no method proposed for 360° image/video SHD, we systematically benchmark 11 representative state-of-the-art salient object detection (SOD) approaches on our SHD360, and explore key issues derived from extensive experimenting results. We hope our proposed dataset and benchmark could serve as a good starting point for advancing human-centric researches towards 360° panoramic data. The dataset is available at https://github.com/PanoAsh/SHD360. △ Less

Submitted 22 December, 2021; v1 submitted 24 May, 2021; originally announced May 2021.

Comments: 21 pages, 13 figures, 5 tables; Project page: https://github.com/PanoAsh/SHD360; Technical report

arXiv:2105.00949 [pdf, other]

CMA-Net: A Cascaded Mutual Attention Network for Light Field Salient Object Detection

Authors: Yi Zhang, Lu Zhang, Wassim Hamidouche, Olivier Deforges

Abstract: In the past few years, numerous deep learning methods have been proposed to address the task of segmenting salient objects from RGB images. However, these approaches depending on single modality fail to achieve the state-of-the-art performance on widely used light field salient object detection (SOD) datasets, which collect large-scale natural images and provide multiple modalities such as multi-v… ▽ More In the past few years, numerous deep learning methods have been proposed to address the task of segmenting salient objects from RGB images. However, these approaches depending on single modality fail to achieve the state-of-the-art performance on widely used light field salient object detection (SOD) datasets, which collect large-scale natural images and provide multiple modalities such as multi-view, micro-lens images and depth maps. Most recently proposed light field SOD methods have acquired improving detecting accuracy, yet still predict rough objects' structures and perform slow inference speed. To this end, we propose CMA-Net, which consists of two novel cascaded mutual attention modules aiming at fusing the high level features from the modalities of all-in-focus and depth. Our proposed CMA-Net outperforms 30 SOD methods on two widely applied light field benchmark datasets. Besides, the proposed CMA-Net is able to inference at the speed of 53 fps. Extensive quantitative and qualitative experiments illustrate both the effectiveness and efficiency of our CMA-Net. △ Less

Submitted 7 December, 2021; v1 submitted 3 May, 2021; originally announced May 2021.

Comments: 6 pages, 4 figures, 2 tables

arXiv:2105.00203 [pdf, other]

doi 10.1007/s10462-021-10125-w

Adversarial Example Detection for DNN Models: A Review and Experimental Comparison

Authors: Ahmed Aldahdooh, Wassim Hamidouche, Sid Ahmed Fezza, Olivier Deforges

Abstract: Deep learning (DL) has shown great success in many human-related tasks, which has led to its adoption in many computer vision based applications, such as security surveillance systems, autonomous vehicles and healthcare. Such safety-critical applications have to draw their path to success deployment once they have the capability to overcome safety-critical challenges. Among these challenges are th… ▽ More Deep learning (DL) has shown great success in many human-related tasks, which has led to its adoption in many computer vision based applications, such as security surveillance systems, autonomous vehicles and healthcare. Such safety-critical applications have to draw their path to success deployment once they have the capability to overcome safety-critical challenges. Among these challenges are the defense against or/and the detection of the adversarial examples (AEs). Adversaries can carefully craft small, often imperceptible, noise called perturbations to be added to the clean image to generate the AE. The aim of AE is to fool the DL model which makes it a potential risk for DL applications. Many test-time evasion attacks and countermeasures,i.e., defense or detection methods, are proposed in the literature. Moreover, few reviews and surveys were published and theoretically showed the taxonomy of the threats and the countermeasure methods with little focus in AE detection methods. In this paper, we focus on image classification task and attempt to provide a survey for detection methods of test-time evasion attacks on neural network classifiers. A detailed discussion for such methods is provided with experimental results for eight state-of-the-art detectors under different scenarios on four datasets. We also provide potential challenges and future perspectives for this research direction. △ Less

Submitted 7 January, 2022; v1 submitted 1 May, 2021; originally announced May 2021.

Comments: Accepted and published in Artificial Intelligence Review journal

arXiv:2104.13916 [pdf, other]

Learning Synergistic Attention for Light Field Salient Object Detection

Authors: Yi Zhang, Geng Chen, Qian Chen, Yujia Sun, Yong Xia, Olivier Deforges, Wassim Hamidouche, Lu Zhang

Abstract: We propose a novel Synergistic Attention Network (SA-Net) to address the light field salient object detection by establishing a synergistic effect between multi-modal features with advanced attention mechanisms. Our SA-Net exploits the rich information of focal stacks via 3D convolutional neural networks, decodes the high-level features of multi-modal light field data with two cascaded synergistic… ▽ More We propose a novel Synergistic Attention Network (SA-Net) to address the light field salient object detection by establishing a synergistic effect between multi-modal features with advanced attention mechanisms. Our SA-Net exploits the rich information of focal stacks via 3D convolutional neural networks, decodes the high-level features of multi-modal light field data with two cascaded synergistic attention modules, and predicts the saliency map using an effective feature fusion module in a progressive manner. Extensive experiments on three widely-used benchmark datasets show that our SA-Net outperforms 28 state-of-the-art models, sufficiently demonstrating its effectiveness and superiority. Our code is available at https://github.com/PanoAsh/SA-Net. △ Less

Submitted 22 October, 2021; v1 submitted 28 April, 2021; originally announced April 2021.

Comments: 20 pages, 12 figures; Project Page https://github.com/PanoAsh/SA-Net ; Accepted to BMVC-21

arXiv:2104.09103 [pdf, other]

Conditional Coding and Variable Bitrate for Practical Learned Video Coding

Authors: Théo Ladune, Pierrick Philippe, Wassim Hamidouche, Lu Zhang, Olivier Déforges

Abstract: This paper introduces a practical learned video codec. Conditional coding and quantization gain vectors are used to provide flexibility to a single encoder/decoder pair, which is able to compress video sequences at a variable bitrate. The flexibility is leveraged at test time by choosing the rate and GOP structure to optimize a rate-distortion cost. Using the CLIC21 video test conditions, the prop… ▽ More This paper introduces a practical learned video codec. Conditional coding and quantization gain vectors are used to provide flexibility to a single encoder/decoder pair, which is able to compress video sequences at a variable bitrate. The flexibility is leveraged at test time by choosing the rate and GOP structure to optimize a rate-distortion cost. Using the CLIC21 video test conditions, the proposed approach shows performance on par with HEVC. △ Less

Submitted 20 April, 2021; v1 submitted 19 April, 2021; originally announced April 2021.

Journal ref: CLIC workshop, CVPR 2021, Jun 2021, Nashville, United States

arXiv:2104.08319 [pdf, other]

Multitask Learning for VVC Quality Enhancement and Super-Resolution

Authors: Charles Bonnineau, Wassim Hamidouche, Jean-Francois Travers, Naty Sidaty, Olivier Deforges

Abstract: The latest video coding standard, called versatile video coding (VVC), includes several novel and refined coding tools at different levels of the coding chain. These tools bring significant coding gains with respect to the previous standard, high efficiency video coding (HEVC). However, the encoder may still introduce visible coding artifacts, mainly caused by coding decisions applied to adjust th… ▽ More The latest video coding standard, called versatile video coding (VVC), includes several novel and refined coding tools at different levels of the coding chain. These tools bring significant coding gains with respect to the previous standard, high efficiency video coding (HEVC). However, the encoder may still introduce visible coding artifacts, mainly caused by coding decisions applied to adjust the bitrate to the available bandwidth. Hence, pre and post-processing techniques are generally added to the coding pipeline to improve the quality of the decoded video. These methods have recently shown outstanding results compared to traditional approaches, thanks to the recent advances in deep learning. Generally, multiple neural networks are trained independently to perform different tasks, thus omitting to benefit from the redundancy that exists between the models. In this paper, we investigate a learning-based solution as a post-processing step to enhance the decoded VVC video quality. Our method relies on multitask learning to perform both quality enhancement and super-resolution using a single shared network optimized for multiple degradation levels. The proposed solution enables a good performance in both mitigating coding artifacts and super-resolution with fewer network parameters compared to traditional specialized architectures. △ Less

Submitted 3 May, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

Comments: accepted as a conference paper to Picture Coding Symposium (PCS) 2021

arXiv:2104.07930 [pdf, other]

Conditional Coding for Flexible Learned Video Compression

Authors: Théo Ladune, Pierrick Philippe, Wassim Hamidouche, Lu Zhang, Olivier Déforges

Abstract: This paper introduces a novel framework for end-to-end learned video coding. Image compression is generalized through conditional coding to exploit information from reference frames, allowing to process intra and inter frames with the same coder. The system is trained through the minimization of a rate-distortion cost, with no pre-training or proxy loss. Its flexibility is assessed under three cod… ▽ More This paper introduces a novel framework for end-to-end learned video coding. Image compression is generalized through conditional coding to exploit information from reference frames, allowing to process intra and inter frames with the same coder. The system is trained through the minimization of a rate-distortion cost, with no pre-training or proxy loss. Its flexibility is assessed under three coding configurations (All Intra, Low-delay P and Random Access), where it is shown to achieve performance competitive with the state-of-the-art video codec HEVC. △ Less

Submitted 28 April, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

Comments: Neural Compression Workshop @ ICLR 2021

Report number: hal-03192548

arXiv:2103.05354 [pdf, other]

Revisiting Model's Uncertainty and Confidences for Adversarial Example Detection

Authors: Ahmed Aldahdooh, Wassim Hamidouche, Olivier Déforges

Abstract: Security-sensitive applications that rely on Deep Neural Networks (DNNs) are vulnerable to small perturbations that are crafted to generate Adversarial Examples(AEs). The AEs are imperceptible to humans and cause DNN to misclassify them. Many defense and detection techniques have been proposed. Model's confidences and Dropout, as a popular way to estimate the model's uncertainty, have been used fo… ▽ More Security-sensitive applications that rely on Deep Neural Networks (DNNs) are vulnerable to small perturbations that are crafted to generate Adversarial Examples(AEs). The AEs are imperceptible to humans and cause DNN to misclassify them. Many defense and detection techniques have been proposed. Model's confidences and Dropout, as a popular way to estimate the model's uncertainty, have been used for AE detection but they showed limited success against black- and gray-box attacks. Moreover, the state-of-the-art detection techniques have been designed for specific attacks or broken by others, need knowledge about the attacks, are not consistent, increase model parameters overhead, are time-consuming, or have latency in inference time. To trade off these factors, we revisit the model's uncertainty and confidences and propose a novel unsupervised ensemble AE detection mechanism that 1) uses the uncertainty method called SelectiveNet, 2) processes model layers outputs, i.e.feature maps, to generate new confidence probabilities. The detection method is called Selective and Feature based Adversarial Detection (SFAD). Experimental results show that the proposed approach achieves better performance against black- and gray-box attacks than the state-of-the-art methods and achieves comparable performance against white-box attacks. Moreover, results show that SFAD is fully robust against High Confidence Attacks (HCAs) for MNIST and partially robust for CIFAR10 datasets. △ Less

Submitted 21 June, 2021; v1 submitted 9 March, 2021; originally announced March 2021.

Comments: Under review

arXiv:2103.04203 [pdf, other]

Selective Encryption of the Versatile Video Coding Standard

Authors: Guillaume Gautier, Mousa FarajAllah, Wassim Hamidouche, Olivier Déforges, Safwan El Assad

Abstract: Versatile video coding (VVC) is the next generation video coding standard developed by the joint video experts team (JVET) and released in July 2020. VVC introduces several new coding tools providing a significant coding gain over the high efficiency video coding (HEVC) standard. It is well known that increasing the coding efficiency adds more dependencies in the video bitstream making format-comp… ▽ More Versatile video coding (VVC) is the next generation video coding standard developed by the joint video experts team (JVET) and released in July 2020. VVC introduces several new coding tools providing a significant coding gain over the high efficiency video coding (HEVC) standard. It is well known that increasing the coding efficiency adds more dependencies in the video bitstream making format-compliant encryption with the standard more challenging. In this paper we tackle the problem of selective encryption of the VVC standard in format-compliant and constant bitrate. These two constraints ensure that the encrypted bitstream can be decoded by any VVC decoder while the bitrate remains unchanged by the encryption. The selective encryption of all possible VVC syntax elements is investigated. A new algorithm is proposed to encrypt in format-compliant and constant bitrate the transform coefficients (TCs) together with other syntax elements at the level of the entropy encoder. The proposed solution was integrated and assessed under the VVC reference software model version 6.0. Experimental results showed that the encryption drastically decreases the video quality while the encryption is robust against several types of attacks. The encryption space is estimated in the range of 15% to 26% of the bitstream size resulting in a lightweight encryption process. The web page of this work is available at https://gugautie.github.io/sevvc/. △ Less

Submitted 6 March, 2021; originally announced March 2021.

arXiv:2103.04201 [pdf, other]

Light Field Image Coding Using VVC standard and View Synthesis based on Dual Discriminator GAN

Authors: Nader Bakir, Wassim Hamidouche, Sid Ahmed Fezza, Khouloud Samrouth, Olivier Deforges

Abstract: Light field (LF) technology is considered as a promising way for providing a high-quality virtual reality (VR) content. However, such an imaging technology produces a large amount of data requiring efficient LF image compression solutions. In this paper, we propose a LF image coding method based on a view synthesis and view quality enhancement techniques. Instead of transmitting all the LF views,… ▽ More Light field (LF) technology is considered as a promising way for providing a high-quality virtual reality (VR) content. However, such an imaging technology produces a large amount of data requiring efficient LF image compression solutions. In this paper, we propose a LF image coding method based on a view synthesis and view quality enhancement techniques. Instead of transmitting all the LF views, only a sparse set of reference views are encoded and transmitted, while the remaining views are synthesized at the decoder side. The transmitted views are encoded using the versatile video coding (VVC) standard and are used as reference views to synthesize the dropped views. The selection of non-reference dropped views is performed using a rate-distortion optimization based on the VVC temporal scalability. The dropped views are reconstructed using the LF dual discriminator GAN (LF-D2GAN) model. In addition, to ensure that the quality of the views is consistent, at the decoder, a quality enhancement procedure is performed on the reconstructed views allowing smooth navigation across views. Experimental results show that the proposed method provides high coding performance and overcomes the state-of-the-art LF image compression methods by -36.22% in terms of BD-BR and 1.35 dB in BD-PSNR. The web page of this work is available at https://naderbakir79.github.io/LFD2GAN.html. △ Less

Submitted 6 March, 2021; originally announced March 2021.

arXiv:2008.02580 [pdf, other]

Optical Flow and Mode Selection for Learning-based Video Coding

Authors: Théo Ladune, Pierrick Philippe, Wassim Hamidouche, Lu Zhang, Olivier Déforges

Abstract: This paper introduces a new method for inter-frame coding based on two complementary autoencoders: MOFNet and CodecNet. MOFNet aims at computing and conveying the Optical Flow and a pixel-wise coding Mode selection. The optical flow is used to perform a prediction of the frame to code. The coding mode selection enables competition between direct copy of the prediction or transmission through Codec… ▽ More This paper introduces a new method for inter-frame coding based on two complementary autoencoders: MOFNet and CodecNet. MOFNet aims at computing and conveying the Optical Flow and a pixel-wise coding Mode selection. The optical flow is used to perform a prediction of the frame to code. The coding mode selection enables competition between direct copy of the prediction or transmission through CodecNet. The proposed coding scheme is assessed under the Challenge on Learned Image Compression 2020 (CLIC20) P-frame coding conditions, where it is shown to perform on par with the state-of-the-art video codec ITU/MPEG HEVC. Moreover, the possibility of copying the prediction enables to learn the optical flow in an end-to-end fashion i.e. without relying on pre-training and/or a dedicated loss term. △ Less

Submitted 6 August, 2020; originally announced August 2020.

Comments: MMSP 2020, IEEE 22nd International Workshop on Multimedia Signal Processing, Sep 2020, Tampere, Finland

arXiv:2007.02532 [pdf, other]

ModeNet: Mode Selection Network For Learned Video Coding

Authors: Théo Ladune, Pierrick Philippe, Wassim Hamidouche, Lu Zhang, Olivier Déforges

Abstract: In this paper, a mode selection network (ModeNet) is proposed to enhance deep learning-based video compression. Inspired by traditional video coding, ModeNet purpose is to enable competition among several coding modes. The proposed ModeNet learns and conveys a pixel-wise partitioning of the frame, used to assign each pixel to the most suited coding mode. ModeNet is trained alongside the different… ▽ More In this paper, a mode selection network (ModeNet) is proposed to enhance deep learning-based video compression. Inspired by traditional video coding, ModeNet purpose is to enable competition among several coding modes. The proposed ModeNet learns and conveys a pixel-wise partitioning of the frame, used to assign each pixel to the most suited coding mode. ModeNet is trained alongside the different coding modes to minimize a rate-distortion cost. It is a flexible component which can be generalized to other systems to allow competition between different coding tools. Mod-eNet interest is studied on a P-frame coding task, where it is used to design a method for coding a frame given its prediction. ModeNet-based systems achieve compelling performance when evaluated under the Challenge on Learned Image Compression 2020 (CLIC20) P-frame coding track conditions. △ Less

Submitted 31 July, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

Journal ref: Machine Learning for Signal Processing (MLSP) 2020, Sep 2020, Espoo, Finland

arXiv:2003.12322 [pdf, other]

Light Field Image Coding Using Dual Discriminator Generative Adversarial Network and VVC Temporal Scalability

Authors: Nader Bakir, Wassim Hamidouche, Sid Fezza, Khouloud Samrouth, Olivier Déforges

Abstract: Light field technology represents a viable path for providing a high-quality VR content. However, such an imaging system generates a high amount of data leading to an urgent need for LF image compression solution. In this paper, we propose an efficient LF image coding scheme based on view synthesis. Instead of transmitting all the LF views, only some of them are coded and transmitted, while the re… ▽ More Light field technology represents a viable path for providing a high-quality VR content. However, such an imaging system generates a high amount of data leading to an urgent need for LF image compression solution. In this paper, we propose an efficient LF image coding scheme based on view synthesis. Instead of transmitting all the LF views, only some of them are coded and transmitted, while the remaining views are dropped. The transmitted views are coded using Versatile Video Coding (VVC) and used as reference views to synthesize the missing views at decoder side. The dropped views are generated using the efficient dual discriminator GAN model. The selection of reference/dropped views is performed using a rate distortion optimization based on the VVC temporal scalability. Experimental results show that the proposed method provides high coding performance and overcomes the state-of-the-art LF image compression solutions. △ Less

Submitted 27 March, 2020; originally announced March 2020.

Comments: IEEE International Conference on Multimedia and Expo (ICME), Jul 2020, Londre, United Kingdom

arXiv:2002.09259 [pdf, other]

Binary Probability Model for Learning Based Image Compression

Authors: Théo Ladune, Pierrick Philippe, Wassim Hamidouche, Lu Zhang, Olivier Deforges

Abstract: In this paper, we propose to enhance learned image compression systems with a richer probability model for the latent variables. Previous works model the latents with a Gaussian or a Laplace distribution. Inspired by binary arithmetic coding , we propose to signal the latents with three binary values and one integer, with different probability models. A relaxation method is designed to perform gra… ▽ More In this paper, we propose to enhance learned image compression systems with a richer probability model for the latent variables. Previous works model the latents with a Gaussian or a Laplace distribution. Inspired by binary arithmetic coding , we propose to signal the latents with three binary values and one integer, with different probability models. A relaxation method is designed to perform gradient-based training. The richer probability model results in a better entropy coding leading to lower rate. Experiments under the Challenge on Learned Image Compression (CLIC) test conditions demonstrate that this method achieves 18% rate saving compared to Gaussian or Laplace models. △ Less

Submitted 21 February, 2020; originally announced February 2020.

Journal ref: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2020, 2020

arXiv:2002.09196 [pdf, other]

Extending 2D Saliency Models for Head Movement Prediction in 360-degree Images using CNN-based Fusion

Authors: Ibrahim Djemai, Sid Fezza, Wassim Hamidouche, Olivier Deforges

Abstract: Saliency prediction can be of great benefit for 360-degree image/video applications, including compression, streaming , rendering and viewpoint guidance. It is therefore quite natural to adapt the 2D saliency prediction methods for 360-degree images. To achieve this, it is necessary to project the 360-degree image to 2D plane. However, the existing projection techniques introduce different distort… ▽ More Saliency prediction can be of great benefit for 360-degree image/video applications, including compression, streaming , rendering and viewpoint guidance. It is therefore quite natural to adapt the 2D saliency prediction methods for 360-degree images. To achieve this, it is necessary to project the 360-degree image to 2D plane. However, the existing projection techniques introduce different distortions, which provides poor results and makes inefficient the direct application of 2D saliency prediction models to 360-degree content. Consequently, in this paper, we propose a new framework for effectively applying any 2D saliency prediction method to 360-degree images. The proposed framework particularly includes a novel convolutional neural network based fusion approach that provides more accurate saliency prediction while avoiding the introduction of distortions. The proposed framework has been evaluated with five 2D saliency prediction methods, and the experimental results showed the superiority of our approach compared to the use of weighted sum or pixel-wise maximum fusion methods. △ Less

Submitted 21 February, 2020; originally announced February 2020.

Journal ref: IEEE International Symposium on Circuits and Systems (ISCAS), May 2020, Seville, Spain

arXiv:2002.07461 [pdf, other]

Lightweight hardware implementation of VVC transform block for ASIC decoder

Authors: I. Farhat, W. Hamidouche, A Grill, D. Ménard, O. Deforges

Abstract: Versatile Video Coding (VVC) is the next generation video coding standard expected by the end of 2020. Compared to its predecessor, VVC introduces new coding tools to make compression more efficient at the expense of higher computational complexity. This rises a need to design an efficient and optimised implementation especially for embedded platforms with limited memory and logic resources. One o… ▽ More Versatile Video Coding (VVC) is the next generation video coding standard expected by the end of 2020. Compared to its predecessor, VVC introduces new coding tools to make compression more efficient at the expense of higher computational complexity. This rises a need to design an efficient and optimised implementation especially for embedded platforms with limited memory and logic resources. One of the newly introduced tools in VVC is the Multiple Transform Selection (MTS). This latter involves three Discrete Cosine Transform (DCT)/Discrete Sine Transform (DST) types with larger and rectangular transform blocks. In this paper, an efficient hardware implementation of all DCT/DST transform types and sizes is proposed. The proposed design uses 32 multipliers in a pipelined architecture which targets an ASIC platform. It consists in a multi-standard architecture that supports the transform block of recent MPEG standards including AVC, HEVC and VVC. The architecture is optimized and removes unnecessary complexities found in other proposed architec-tures by using regular multipliers instead of multiple constant multi-pliers. The synthesized results show that the proposed method which sustain a constant throughput of two pixels/cycle and constant la-tency for all block sizes can reach an operational frequency of 600 Mhz enabling to decode in real-time 4K videos at 48 fps. △ Less

Submitted 18 February, 2020; originally announced February 2020.

Journal ref: International Conference on Acoustics, Speech, and Signal Processing, May 2020, Barcelone, Spain

arXiv:2002.06922 [pdf, other]

Versatile video coding and super-resolution for efficient delivery of 8K video with 4K backward-compatibility

Authors: Charles Bonnineau, Wassim Hamidouche, Jean-Francois Travers, Olivier Deforges

Abstract: In this paper, we propose, through an objective study, to compare and evaluate the performance of different coding approaches allowing the delivery of an 8K video signal with 4K backward-compatibility on broadcast networks. Presented approaches include simulcast of 8K and 4K single-layer signals encoded using High-Efficiency Video Coding (HEVC) and Versatile Video Coding (VVC) standards, spatial s… ▽ More In this paper, we propose, through an objective study, to compare and evaluate the performance of different coding approaches allowing the delivery of an 8K video signal with 4K backward-compatibility on broadcast networks. Presented approaches include simulcast of 8K and 4K single-layer signals encoded using High-Efficiency Video Coding (HEVC) and Versatile Video Coding (VVC) standards, spatial scalability using SHVC with 4K base layer (BL) and 8K enhancement-layer (EL), and super-resolution applied on 4K VVC signal after decoding to reach 8K resolution. For up-scaling, we selected the deep-learning-based super-resolution method called Super-Resolution with Feedback Network (SRFBN) and the Lanczos interpolation filter. We show that the deep-learning-based approach achieves visual quality gain over simulcast, especially on bit-rates lower than 30Mb/s with average gain of 0.77dB, 0.015, and 7.97 for PSNR, SSIM, and VMAF, respectively and out-performs the Lanczos filter in average by 29% of BD-rate savings. △ Less

Submitted 17 February, 2020; originally announced February 2020.

Journal ref: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2020, Barcelone, Spain

arXiv:2001.07960 [pdf, other]

A Fixation-based 360° Benchmark Dataset for Salient Object Detection

Authors: Yi Zhang, Lu Zhang, Wassim Hamidouche, Olivier Deforges

Abstract: Fixation prediction (FP) in panoramic contents has been widely investigated along with the booming trend of virtual reality (VR) applications. However, another issue within the field of visual saliency, salient object detection (SOD), has been seldom explored in 360° (or omnidirectional) images due to the lack of datasets representative of real scenes with pixel-level annotations. Toward this end,… ▽ More Fixation prediction (FP) in panoramic contents has been widely investigated along with the booming trend of virtual reality (VR) applications. However, another issue within the field of visual saliency, salient object detection (SOD), has been seldom explored in 360° (or omnidirectional) images due to the lack of datasets representative of real scenes with pixel-level annotations. Toward this end, we collect 107 equirectangular panoramas with challenging scenes and multiple object classes. Based on the consistency between FP and explicit saliency judgements, we further manually annotate 1,165 salient objects over the collected images with precise masks under the guidance of real human eye fixation maps. Six state-of-the-art SOD models are then benchmarked on the proposed fixation-based 360° image dataset (F-360iSOD), by applying a multiple cubic projection-based fine-tuning method. Experimental results show a limitation of the current methods when used for SOD in panoramic images, which indicates the proposed dataset is challenging. Key issues for 360° SOD is also discussed. The proposed dataset is available at https://github.com/PanoAsh/F-360iSOD. △ Less

Submitted 19 May, 2020; v1 submitted 22 January, 2020; originally announced January 2020.

Comments: 5 pages, 5 figures, accepted by ICIP2020

arXiv:1911.07036 [pdf, other]

Quality Assessment of DIBR-synthesized views: An Overview

Authors: Shishun Tian, Lu Zhang, Wenbin Zou, Xia Li, Ting Su, Luce Morin, Olivier Deforges

Abstract: The Depth-Image-Based-Rendering (DIBR) is one of the main fundamental technique to generate new views in 3D video applications, such as Multi-View Videos (MVV), Free-Viewpoint Videos (FVV) and Virtual Reality (VR). However, the quality assessment of DIBR-synthesized views is quite different from the traditional 2D images/videos. In recent years, several efforts have been made towards this topic, b… ▽ More The Depth-Image-Based-Rendering (DIBR) is one of the main fundamental technique to generate new views in 3D video applications, such as Multi-View Videos (MVV), Free-Viewpoint Videos (FVV) and Virtual Reality (VR). However, the quality assessment of DIBR-synthesized views is quite different from the traditional 2D images/videos. In recent years, several efforts have been made towards this topic, but there {is a lack of} detailed survey in {the} literature. In this paper, we provide a comprehensive survey on various current approaches for DIBR-synthesized views. The current accessible datasets of DIBR-synthesized views are firstly reviewed{, followed} by a summary analysis of the representative state-of-the-art objective metrics. Then, the performances of different objective metrics are evaluated and discussed on all available datasets. Finally, we discuss the potential challenges and suggest possible directions for future research. △ Less

Submitted 27 April, 2021; v1 submitted 16 November, 2019; originally announced November 2019.

arXiv:1906.00204 [pdf, other]

Perceptual Evaluation of Adversarial Attacks for CNN-based Image Classification

Authors: Sid Ahmed Fezza, Yassine Bakhti, Wassim Hamidouche, Olivier Déforges

Abstract: Deep neural networks (DNNs) have recently achieved state-of-the-art performance and provide significant progress in many machine learning tasks, such as image classification, speech processing, natural language processing, etc. However, recent studies have shown that DNNs are vulnerable to adversarial attacks. For instance, in the image classification domain, adding small imperceptible perturbatio… ▽ More Deep neural networks (DNNs) have recently achieved state-of-the-art performance and provide significant progress in many machine learning tasks, such as image classification, speech processing, natural language processing, etc. However, recent studies have shown that DNNs are vulnerable to adversarial attacks. For instance, in the image classification domain, adding small imperceptible perturbations to the input image is sufficient to fool the DNN and to cause misclassification. The perturbed image, called \textit{adversarial example}, should be visually as close as possible to the original image. However, all the works proposed in the literature for generating adversarial examples have used the $L_{p}$ norms ($L_{0}$, $L_{2}$ and $L_{\infty}$) as distance metrics to quantify the similarity between the original image and the adversarial example. Nonetheless, the $L_{p}$ norms do not correlate with human judgment, making them not suitable to reliably assess the perceptual similarity/fidelity of adversarial examples. In this paper, we present a database for visual fidelity assessment of adversarial examples. We describe the creation of the database and evaluate the performance of fifteen state-of-the-art full-reference (FR) image fidelity assessment metrics that could substitute $L_{p}$ norms. The database as well as subjective scores are publicly available to help designing new metrics for adversarial examples and to facilitate future research works. △ Less

Submitted 1 June, 2019; originally announced June 2019.

Comments: Eleventh International Conference on Quality of Multimedia Experience (QoMEX 2019)

Showing 1–29 of 29 results for author: Deforges, O