Search | arXiv e-print repository

doi 10.1016/j.imavis.2024.105023

A New Multi-Picture Architecture for Learned Video Deinterlacing and Demosaicing with Parallel Deformable Convolution and Self-Attention Blocks

Authors: Ronglei Ji, A. Murat Tekalp

Abstract: Despite the fact real-world video deinterlacing and demosaicing are well-suited to supervised learning from synthetically degraded data because the degradation models are known and fixed, learned video deinterlacing and demosaicing have received much less attention compared to denoising and super-resolution tasks. We propose a new multi-picture architecture for video deinterlacing or demosaicing b… ▽ More Despite the fact real-world video deinterlacing and demosaicing are well-suited to supervised learning from synthetically degraded data because the degradation models are known and fixed, learned video deinterlacing and demosaicing have received much less attention compared to denoising and super-resolution tasks. We propose a new multi-picture architecture for video deinterlacing or demosaicing by aligning multiple supporting pictures with missing data to a reference picture to be reconstructed, benefiting from both local and global spatio-temporal correlations in the feature space using modified deformable convolution blocks and a novel residual efficient top-$k$ self-attention (kSA) block, respectively. Separate reconstruction blocks are used to estimate different types of missing data. Our extensive experimental results, on synthetic or real-world datasets, demonstrate that the proposed novel architecture provides superior results that significantly exceed the state-of-the-art for both tasks in terms of PSNR, SSIM, and perceptual quality. Ablation studies are provided to justify and show the benefit of each novel modification made to the deformable convolution and residual efficient kSA blocks. Code is available: https://github.com/KUIS-AI-Tekalp-Research-Group/Video-Deinterlacing. △ Less

Submitted 19 April, 2024; originally announced April 2024.

Comments: 13 pages, 6 figures, accepted to IMAVIS

arXiv:2404.11273 [pdf, other]

Training Transformer Models by Wavelet Losses Improves Quantitative and Visual Performance in Single Image Super-Resolution

Authors: Cansu Korkmaz, A. Murat Tekalp

Abstract: Transformer-based models have achieved remarkable results in low-level vision tasks including image super-resolution (SR). However, early Transformer-based approaches that rely on self-attention within non-overlap** windows encounter challenges in acquiring global information. To activate more input pixels globally, hybrid attention models have been proposed. Moreover, training by solely minimiz… ▽ More Transformer-based models have achieved remarkable results in low-level vision tasks including image super-resolution (SR). However, early Transformer-based approaches that rely on self-attention within non-overlap** windows encounter challenges in acquiring global information. To activate more input pixels globally, hybrid attention models have been proposed. Moreover, training by solely minimizing pixel-wise RGB losses, such as L1, have been found inadequate for capturing essential high-frequency details. This paper presents two contributions: i) We introduce convolutional non-local sparse attention (NLSA) blocks to extend the hybrid transformer architecture in order to further enhance its receptive field. ii) We employ wavelet losses to train Transformer models to improve quantitative and subjective performance. While wavelet losses have been explored previously, showing their power in training Transformer-based SR models is novel. Our experimental results demonstrate that the proposed model provides state-of-the-art PSNR results as well as superior visual performance across various benchmark datasets. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: total of 10 pages including references, 5 tables and 5 figures, accepted for NTIRE 2024 Single Image Super Resolution (x4) challenge

arXiv:2404.09790 [pdf, other]

NTIRE 2024 Challenge on Image Super-Resolution ($\times$4): Methods and Results

Authors: Zheng Chen, Zongwei Wu, Eduard Zamfir, Kai Zhang, Yulun Zhang, Radu Timofte, Xiaokang Yang, Hongyuan Yu, Cheng Wan, Yuxin Hong, Zhijuan Huang, Yajun Zou, Yuan Huang, Jiamin Lin, Bingnan Han, Xianyu Guan, Yongsheng Yu, Daoan Zhang, Xuanwu Yin, Kunlong Zuo, **hua Hao, Kai Zhao, Kun Yuan, Ming Sun, Chao Zhou , et al. (63 additional authors not shown)

Abstract: This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained. The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs using prior information. The LR images originate from bicubic downsampling degradation. The aim of the challenge i… ▽ More This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained. The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs using prior information. The LR images originate from bicubic downsampling degradation. The aim of the challenge is to obtain designs/solutions with the most advanced SR performance, with no constraints on computational resources (e.g., model size and FLOPs) or training data. The track of this challenge assesses performance with the PSNR metric on the DIV2K testing dataset. The competition attracted 199 registrants, with 20 teams submitting valid entries. This collective endeavour not only pushes the boundaries of performance in single-image SR but also offers a comprehensive overview of current trends in this field. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: NTIRE 2024 webpage: https://cvlai.net/ntire/2024. Code: https://github.com/zhengchen1999/NTIRE2024_ImageSR_x4

arXiv:2403.11791 [pdf, other]

PAON: A New Neuron Model using Padé Approximants

Authors: Onur Keleş, A. Murat Tekalp

Abstract: Convolutional neural networks (CNN) are built upon the classical McCulloch-Pitts neuron model, which is essentially a linear model, where the nonlinearity is provided by a separate activation function. Several researchers have proposed enhanced neuron models, including quadratic neurons, generalized operational neurons, generative neurons, and super neurons, with stronger nonlinearity than that pr… ▽ More Convolutional neural networks (CNN) are built upon the classical McCulloch-Pitts neuron model, which is essentially a linear model, where the nonlinearity is provided by a separate activation function. Several researchers have proposed enhanced neuron models, including quadratic neurons, generalized operational neurons, generative neurons, and super neurons, with stronger nonlinearity than that provided by the pointwise activation function. There has also been a proposal to use Pade approximation as a generalized activation function. In this paper, we introduce a brand new neuron model called Pade neurons (Paons), inspired by the Pade approximants, which is the best mathematical approximation of a transcendental function as a ratio of polynomials with different orders. We show that Paons are a super set of all other proposed neuron models. Hence, the basic neuron in any known CNN model can be replaced by Paons. In this paper, we extend the well-known ResNet to PadeNet (built by Paons) to demonstrate the concept. Our experiments on the single-image super-resolution task show that PadeNets can obtain better results than competing architectures. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: Submitted to IEEE ICIP 2024

arXiv:2402.19215 [pdf, other]

Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts

Authors: Cansu Korkmaz, A. Murat Tekalp, Zafer Dogan

Abstract: Super-resolution (SR) is an ill-posed inverse problem, where the size of the set of feasible solutions that are consistent with a given low-resolution image is very large. Many algorithms have been proposed to find a "good" solution among the feasible solutions that strike a balance between fidelity and perceptual quality. Unfortunately, all known methods generate artifacts and hallucinations whil… ▽ More Super-resolution (SR) is an ill-posed inverse problem, where the size of the set of feasible solutions that are consistent with a given low-resolution image is very large. Many algorithms have been proposed to find a "good" solution among the feasible solutions that strike a balance between fidelity and perceptual quality. Unfortunately, all known methods generate artifacts and hallucinations while trying to reconstruct high-frequency (HF) image details. A fundamental question is: Can a model learn to distinguish genuine image details from artifacts? Although some recent works focused on the differentiation of details and artifacts, this is a very challenging problem and a satisfactory solution is yet to be found. This paper shows that the characterization of genuine HF details versus artifacts can be better learned by training GAN-based SR models using wavelet-domain loss functions compared to RGB-domain or Fourier-space losses. Although wavelet-domain losses have been used in the literature before, they have not been used in the context of the SR task. More specifically, we train the discriminator only on the HF wavelet sub-bands instead of on RGB images and the generator is trained by a fidelity loss over wavelet subbands to make it sensitive to the scale and orientation of structures. Extensive experimental results demonstrate that our model achieves better perception-distortion trade-off according to multiple objective measures and visual evaluations. △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: Accepted for IEEE CVPR 2024, total of 11 pages, 3 pages for references, 7 figures and 2 tables

arXiv:2402.08862 [pdf, other]

Saliency-aware End-to-end Learned Variable-Bitrate 360-degree Image Compression

Authors: Oguzhan Gungordu, A. Murat Tekalp

Abstract: Effective compression of 360$^\circ$ images, also referred to as omnidirectional images (ODIs), is of high interest for various virtual reality (VR) and related applications. 2D image compression methods ignore the equator-biased nature of ODIs and fail to address oversampling near the poles, leading to inefficient compression when applied to ODI. We present a new learned saliency-aware 360… ▽ More Effective compression of 360$^\circ$ images, also referred to as omnidirectional images (ODIs), is of high interest for various virtual reality (VR) and related applications. 2D image compression methods ignore the equator-biased nature of ODIs and fail to address oversampling near the poles, leading to inefficient compression when applied to ODI. We present a new learned saliency-aware 360$^\circ$ image compression architecture that prioritizes bit allocation to more significant regions, considering the unique properties of ODIs. By assigning fewer bits to less important regions, significant data size reduction can be achieved while maintaining high visual quality in the significant regions. To the best of our knowledge, this is the first study that proposes an end-to-end variable-rate model to compress 360$^\circ$ images leveraging saliency information. The results show significant bit-rate savings over the state-of-the-art learned and traditional ODI compression methods at similar perceptual visual quality. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: 7 pages with double column, 1 and a half for references, 6 figures and 4 tables, submitted to IEEE ICIP 2024

arXiv:2402.08550 [pdf, other]

Motion-Adaptive Inference for Flexible Learned B-Frame Compression

Authors: M. Akin Yilmaz, O. Ugur Ulas, Ahmet Bilican, A. Murat Tekalp

Abstract: While the performance of recent learned intra and sequential video compression models exceed that of respective traditional codecs, the performance of learned B-frame compression models generally lag behind traditional B-frame coding. The performance gap is bigger for complex scenes with large motions. This is related to the fact that the distance between the past and future references vary in hie… ▽ More While the performance of recent learned intra and sequential video compression models exceed that of respective traditional codecs, the performance of learned B-frame compression models generally lag behind traditional B-frame coding. The performance gap is bigger for complex scenes with large motions. This is related to the fact that the distance between the past and future references vary in hierarchical B-frame compression depending on the level of hierarchy, which causes motion range to vary. The inability of a single B-frame compression model to adapt to various motion ranges causes loss of performance. As a remedy, we propose controlling the motion range for flow prediction during inference (to approximately match the range of motions in the training data) by downsampling video frames adaptively according to amount of motion and level of hierarchy in order to compress all B-frames using a single flexible-rate model. We present state-of-the-art BD rate results to demonstrate the superiority of our proposed single-model motion-adaptive inference approach to all existing learned B-frame compression models. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: 7 pages, submitted to IEEE ICIP 2024

arXiv:2402.07597 [pdf, other]

Trustworthy SR: Resolving Ambiguity in Image Super-resolution via Diffusion Models and Human Feedback

Authors: Cansu Korkmaz, Ege Cirakman, A. Murat Tekalp, Zafer Dogan

Abstract: Super-resolution (SR) is an ill-posed inverse problem with a large set of feasible solutions that are consistent with a given low-resolution image. Various deterministic algorithms aim to find a single solution that balances fidelity and perceptual quality; however, this trade-off often causes visual artifacts that bring ambiguity in information-centric applications. On the other hand, diffusion m… ▽ More Super-resolution (SR) is an ill-posed inverse problem with a large set of feasible solutions that are consistent with a given low-resolution image. Various deterministic algorithms aim to find a single solution that balances fidelity and perceptual quality; however, this trade-off often causes visual artifacts that bring ambiguity in information-centric applications. On the other hand, diffusion models (DMs) excel in generating a diverse set of feasible SR images that span the solution space. The challenge is then how to determine the most likely solution among this set in a trustworthy manner. We observe that quantitative measures, such as PSNR, LPIPS, DISTS, are not reliable indicators to resolve ambiguous cases. To this effect, we propose employing human feedback, where we ask human subjects to select a small number of likely samples and we ensemble the averages of selected samples. This strategy leverages the high-quality image generation capabilities of DMs, while recognizing the importance of obtaining a single trustworthy solution, especially in use cases, such as identification of specific digits or letters, where generating multiple feasible solutions may not lead to a reliable outcome. Experimental results demonstrate that our proposed strategy provides more trustworthy solutions when compared to state-of-the art SR methods. △ Less

Submitted 12 February, 2024; originally announced February 2024.

Comments: total of 7 pages with double column, 1 and a half for references, 6 figures and 2 tables, submitted to IEEE ICIP 2024

arXiv:2307.01556 [pdf, other]

Spatio-Temporal Perception-Distortion Trade-off in Learned Video SR

Authors: Nasrin Rahimi, A. Murat Tekalp

Abstract: Perception-distortion trade-off is well-understood for single-image super-resolution. However, its extension to video super-resolution (VSR) is not straightforward, since popular perceptual measures only evaluate naturalness of spatial textures and do not take naturalness of flow (temporal coherence) into account. To this effect, we propose a new measure of spatio-temporal perceptual video quality… ▽ More Perception-distortion trade-off is well-understood for single-image super-resolution. However, its extension to video super-resolution (VSR) is not straightforward, since popular perceptual measures only evaluate naturalness of spatial textures and do not take naturalness of flow (temporal coherence) into account. To this effect, we propose a new measure of spatio-temporal perceptual video quality emphasizing naturalness of optical flow via the perceptual straightness hypothesis (PSH) for meaningful spatio-temporal perception-distortion trade-off. We also propose a new architecture for perceptual VSR (PSVR) to explicitly enforce naturalness of flow to achieve realistic spatio-temporal perception-distortion trade-off according to the proposed measures. Experimental results with PVSR support the hypothesis that a meaningful perception-distortion tradeoff for video should account for the naturalness of motion in addition to naturalness of texture. △ Less

Submitted 4 July, 2023; originally announced July 2023.

Comments: Accepted for publication in IEEE International Conference on Image Processing (ICIP) 2023

arXiv:2306.16544 [pdf, other]

Multi-Scale Deformable Alignment and Content-Adaptive Inference for Flexible-Rate Bi-Directional Video Compression

Authors: M. Akın Yılmaz, O. Ugur Ulas, A. Murat Tekalp

Abstract: The lack of ability to adapt the motion compensation model to video content is an important limitation of current end-to-end learned video compression models. This paper advances the state-of-the-art by proposing an adaptive motion-compensation model for end-to-end rate-distortion optimized hierarchical bi-directional video compression. In particular, we propose two novelties: i) a multi-scale def… ▽ More The lack of ability to adapt the motion compensation model to video content is an important limitation of current end-to-end learned video compression models. This paper advances the state-of-the-art by proposing an adaptive motion-compensation model for end-to-end rate-distortion optimized hierarchical bi-directional video compression. In particular, we propose two novelties: i) a multi-scale deformable alignment scheme at the feature level combined with multi-scale conditional coding, ii) motion-content adaptive inference. In addition, we employ a gain unit, which enables a single model to operate at multiple rate-distortion operating points. We also exploit the gain unit to control bit allocation among intra-coded vs. bi-directionally coded frames by fine tuning corresponding models for truly flexible-rate learned video coding. Experimental results demonstrate state-of-the-art rate-distortion performance exceeding those of all prior art in learned video coding. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: Accepted for publication in IEEE International Conference on Image Processing (ICIP) 2023

arXiv:2209.10192 [pdf, other]

doi 10.1109/ICIP46576.2022.9897353

Multi-Field De-interlacing using Deformable Convolution Residual Blocks and Self-Attention

Authors: Ronglei Ji, A. Murat Tekalp

Abstract: Although deep learning has made significant impact on image/video restoration and super-resolution, learned deinterlacing has so far received less attention in academia or industry. This is despite deinterlacing is well-suited for supervised learning from synthetic data since the degradation model is known and fixed. In this paper, we propose a novel multi-field full frame-rate deinterlacing netwo… ▽ More Although deep learning has made significant impact on image/video restoration and super-resolution, learned deinterlacing has so far received less attention in academia or industry. This is despite deinterlacing is well-suited for supervised learning from synthetic data since the degradation model is known and fixed. In this paper, we propose a novel multi-field full frame-rate deinterlacing network, which adapts the state-of-the-art superresolution approaches to the deinterlacing task. Our model aligns features from adjacent fields to a reference field (to be deinterlaced) using both deformable convolution residual blocks and self attention. Our extensive experimental results demonstrate that the proposed method provides state-of-the-art deinterlacing results in terms of both numerical and perceptual performance. At the time of writing, our model ranks first in the Full FrameRate LeaderBoard at https://videoprocessing.ai/benchmarks/deinterlacer.html △ Less

Submitted 21 September, 2022; originally announced September 2022.

Comments: 5 pages, 4 figures, accepted to ICIP 2022

arXiv:2209.08568 [pdf, other]

MMSR: Multiple-Model Learned Image Super-Resolution Benefiting From Class-Specific Image Priors

Authors: Cansu Korkmaz, A. Murat Tekalp, Zafer Dogan

Abstract: Assuming a known degradation model, the performance of a learned image super-resolution (SR) model depends on how well the variety of image characteristics within the training set matches those in the test set. As a result, the performance of an SR model varies noticeably from image to image over a test set depending on whether characteristics of specific images are similar to those in the trainin… ▽ More Assuming a known degradation model, the performance of a learned image super-resolution (SR) model depends on how well the variety of image characteristics within the training set matches those in the test set. As a result, the performance of an SR model varies noticeably from image to image over a test set depending on whether characteristics of specific images are similar to those in the training set or not. Hence, in general, a single SR model cannot generalize well enough for all types of image content. In this work, we show that training multiple SR models for different classes of images (e.g., for text, texture, etc.) to exploit class-specific image priors and employing a post-processing network that learns how to best fuse the outputs produced by these multiple SR models surpasses the performance of state-of-the-art generic SR models. Experimental results clearly demonstrate that the proposed multiple-model SR (MMSR) approach significantly outperforms a single pre-trained state-of-the-art SR model both quantitatively and visually. It even exceeds the performance of the best single class-specific SR model trained on similar text or texture images. △ Less

Submitted 18 September, 2022; originally announced September 2022.

Comments: 5 pages, 4 figures, accepted for publication in IEEE ICIP 2022 Conference

arXiv:2209.08564 [pdf, other]

Perception-Distortion Trade-off in the SR Space Spanned by Flow Models

Authors: Cansu Korkmaz, A. Murat Tekalp, Zafer Dogan, Erkut Erdem, Aykut Erdem

Abstract: Flow-based generative super-resolution (SR) models learn to produce a diverse set of feasible SR solutions, called the SR space. Diversity of SR solutions increases with the temperature ($τ$) of latent variables, which introduces random variations of texture among sample solutions, resulting in visual artifacts and low fidelity. In this paper, we present a simple but effective image ensembling/fus… ▽ More Flow-based generative super-resolution (SR) models learn to produce a diverse set of feasible SR solutions, called the SR space. Diversity of SR solutions increases with the temperature ($τ$) of latent variables, which introduces random variations of texture among sample solutions, resulting in visual artifacts and low fidelity. In this paper, we present a simple but effective image ensembling/fusion approach to obtain a single SR image eliminating random artifacts and improving fidelity without significantly compromising perceptual quality. We achieve this by benefiting from a diverse set of feasible photo-realistic solutions in the SR space spanned by flow models. We propose different image ensembling and fusion strategies which offer multiple paths to move sample solutions in the SR space to more desired destinations in the perception-distortion plane in a controllable manner depending on the fidelity vs. perceptual quality requirements of the task at hand. Experimental results demonstrate that our image ensembling/fusion strategy achieves more promising perception-distortion trade-off compared to sample SR images produced by flow models and adversarially trained models in terms of both quantitative metrics and visual quality. △ Less

Submitted 18 September, 2022; originally announced September 2022.

Comments: 5 pages, 4 figures, accepted for publication in IEEE ICIP 2022 Conference

arXiv:2206.13613 [pdf, other]

Flexible-Rate Learned Hierarchical Bi-Directional Video Compression With Motion Refinement and Frame-Level Bit Allocation

Authors: Eren Cetin, M. Akin Yilmaz, A. Murat Tekalp

Abstract: This paper presents improvements and novel additions to our recent work on end-to-end optimized hierarchical bi-directional video compression to further advance the state-of-the-art in learned video compression. As an improvement, we combine motion estimation and prediction modules and compress refined residual motion vectors for improved rate-distortion performance. As novel addition, we adapted… ▽ More This paper presents improvements and novel additions to our recent work on end-to-end optimized hierarchical bi-directional video compression to further advance the state-of-the-art in learned video compression. As an improvement, we combine motion estimation and prediction modules and compress refined residual motion vectors for improved rate-distortion performance. As novel addition, we adapted the gain unit proposed for image compression to flexible-rate video compression in two ways: first, the gain unit enables a single encoder model to operate at multiple rate-distortion operating points; second, we exploit the gain unit to control bit allocation among intra-coded vs. bi-directionally coded frames by fine tuning corresponding models for truly flexible-rate learned video coding. Experimental results demonstrate that we obtain state-of-the-art rate-distortion performance exceeding those of all prior art in learned video coding. △ Less

Submitted 27 June, 2022; originally announced June 2022.

Comments: Accepted for publication in IEEE International Conference on Image Processing (ICIP 2022)

Report number: 1850

arXiv:2112.09529 [pdf, other]

End-to-End Rate-Distortion Optimized Learned Hierarchical Bi-Directional Video Compression

Authors: M. Akın Yılmaz, A. Murat Tekalp

Abstract: Conventional video compression (VC) methods are based on motion compensated transform coding, and the steps of motion estimation, mode and quantization parameter selection, and entropy coding are optimized individually due to the combinatorial nature of the end-to-end optimization problem. Learned VC allows end-to-end rate-distortion (R-D) optimized training of nonlinear transform, motion and entr… ▽ More Conventional video compression (VC) methods are based on motion compensated transform coding, and the steps of motion estimation, mode and quantization parameter selection, and entropy coding are optimized individually due to the combinatorial nature of the end-to-end optimization problem. Learned VC allows end-to-end rate-distortion (R-D) optimized training of nonlinear transform, motion and entropy model simultaneously. Most works on learned VC consider end-to-end optimization of a sequential video codec based on R-D loss averaged over pairs of successive frames. It is well-known in conventional VC that hierarchical, bi-directional coding outperforms sequential compression because of its ability to use both past and future reference frames. This paper proposes a learned hierarchical bi-directional video codec (LHBDC) that combines the benefits of hierarchical motion-compensated prediction and end-to-end optimization. Experimental results show that we achieve the best R-D results that are reported for learned VC schemes to date in both PSNR and MS-SSIM. Compared to conventional video codecs, the R-D performance of our end-to-end optimized codec outperforms those of both x265 and SVT-HEVC encoders ("veryslow" preset) in PSNR and MS-SSIM as well as HM 16.23 reference software in MS-SSIM. We present ablation studies showing performance gains due to proposed novel tools such as learned masking, flow-field subsampling, and temporal flow vector prediction. The models and instructions to reproduce our results can be found in https://github.com/makinyilmaz/LHBDC/ △ Less

Submitted 17 December, 2021; originally announced December 2021.

Comments: Accepted for publication in IEEE Transactions on Image Processing on 15 Dec. 2021

arXiv:2106.00504 [pdf, other]

Two-stage domain adapted training for better generalization in real-world image restoration and super-resolution

Authors: Cansu Korkmaz, A. Murat Tekalp, Zafer Dogan

Abstract: It is well-known that in inverse problems, end-to-end trained networks overfit the degradation model seen in the training set, i.e., they do not generalize to other types of degradations well. Recently, an approach to first map images downsampled by unknown filters to bicubicly downsampled look-alike images was proposed to successfully super-resolve such images. In this paper, we show that any inv… ▽ More It is well-known that in inverse problems, end-to-end trained networks overfit the degradation model seen in the training set, i.e., they do not generalize to other types of degradations well. Recently, an approach to first map images downsampled by unknown filters to bicubicly downsampled look-alike images was proposed to successfully super-resolve such images. In this paper, we show that any inverse problem can be formulated by first map** the input degraded images to an intermediate domain, and then training a second network to form output images from these intermediate images. Furthermore, the best intermediate domain may vary according to the task. Our experimental results demonstrate that this two-stage domain-adapted training strategy does not only achieve better results on a given class of unknown degradations but can also generalize to other unseen classes of degradations better. △ Less

Submitted 1 June, 2021; originally announced June 2021.

Comments: Accepted for publication in IEEE ICIP 2021 Conference

arXiv:2105.14926 [pdf, other]

Self-Organized Residual Blocks for Image Super-Resolution

Authors: Onur Keleş, A. Murat Tekalp, Junaid Malik, Serkan Kıranyaz

Abstract: It has become a standard practice to use the convolutional networks (ConvNet) with RELU non-linearity in image restoration and super-resolution (SR). Although the universal approximation theorem states that a multi-layer neural network can approximate any non-linear function with the desired precision, it does not reveal the best network architecture to do so. Recently, operational neural networks… ▽ More It has become a standard practice to use the convolutional networks (ConvNet) with RELU non-linearity in image restoration and super-resolution (SR). Although the universal approximation theorem states that a multi-layer neural network can approximate any non-linear function with the desired precision, it does not reveal the best network architecture to do so. Recently, operational neural networks (ONNs) that choose the best non-linearity from a set of alternatives, and their "self-organized" variants (Self-ONN) that approximate any non-linearity via Taylor series have been proposed to address the well-known limitations and drawbacks of conventional ConvNets such as network homogeneity using only the McCulloch-Pitts neuron model. In this paper, we propose the concept of self-organized operational residual (SOR) blocks, and present hybrid network architectures combining regular residual and SOR blocks to strike a balance between the benefits of stronger non-linearity and the overall number of parameters. The experimental results demonstrate that the~proposed architectures yield performance improvements in both PSNR and perceptual metrics. △ Less

Submitted 31 May, 2021; originally announced May 2021.

Comments: Accepted for publication in IEEE International Conference on Image Processing (ICIP) 2021

arXiv:2105.12794 [pdf, other]

DFPN: Deformable Frame Prediction Network

Authors: M. Akın Yılmaz, A. Murat Tekalp

Abstract: Learned frame prediction is a current problem of interest in computer vision and video compression. Although several deep network architectures have been proposed for learned frame prediction, to the best of our knowledge, there is no work based on using deformable convolutions for frame prediction. To this effect, we propose a deformable frame prediction network (DFPN) for task oriented implicit… ▽ More Learned frame prediction is a current problem of interest in computer vision and video compression. Although several deep network architectures have been proposed for learned frame prediction, to the best of our knowledge, there is no work based on using deformable convolutions for frame prediction. To this effect, we propose a deformable frame prediction network (DFPN) for task oriented implicit motion modeling and next frame prediction. Experimental results demonstrate that the proposed DFPN model achieves state of the art results in next frame prediction. Our models and results are available at https://github.com/makinyilmaz/DFPN. △ Less

Submitted 26 May, 2021; originally announced May 2021.

Comments: Accepted for publication in IEEE International Conference on Image Processing (ICIP) 2021

arXiv:2105.12107 [pdf, other]

Self-Organized Variational Autoencoders (Self-VAE) for Learned Image Compression

Authors: M. Akın Yılmaz, Onur Keleş, Hilal Güven, A. Murat Tekalp, Junaid Malik, Serkan Kıranyaz

Abstract: In end-to-end optimized learned image compression, it is standard practice to use a convolutional variational autoencoder with generalized divisive normalization (GDN) to transform images into a latent space. Recently, Operational Neural Networks (ONNs) that learn the best non-linearity from a set of alternatives, and their self-organized variants, Self-ONNs, that approximate any non-linearity via… ▽ More In end-to-end optimized learned image compression, it is standard practice to use a convolutional variational autoencoder with generalized divisive normalization (GDN) to transform images into a latent space. Recently, Operational Neural Networks (ONNs) that learn the best non-linearity from a set of alternatives, and their self-organized variants, Self-ONNs, that approximate any non-linearity via Taylor series have been proposed to address the limitations of convolutional layers and a fixed nonlinear activation. In this paper, we propose to replace the convolutional and GDN layers in the variational autoencoder with self-organized operational layers, and propose a novel self-organized variational autoencoder (Self-VAE) architecture that benefits from stronger non-linearity. The experimental results demonstrate that the proposed Self-VAE yields improvements in both rate-distortion performance and perceptual image quality. △ Less

Submitted 28 May, 2021; v1 submitted 25 May, 2021; originally announced May 2021.

Comments: Accepted for publication in IEEE International Conference on Image Processing (ICIP) 2021

arXiv:2104.14868 [pdf, other]

On the Computation of PSNR for a Set of Images or Video

Authors: Onur Keleş, M. Akın Yılmaz, A. Murat Tekalp, Cansu Korkmaz, Zafer Dogan

Abstract: When comparing learned image/video restoration and compression methods, it is common to report peak-signal to noise ratio (PSNR) results. However, there does not exist a generally agreed upon practice to compute PSNR for sets of images or video. Some authors report average of individual image/frame PSNR, which is equivalent to computing a single PSNR from the geometric mean of individual image/fra… ▽ More When comparing learned image/video restoration and compression methods, it is common to report peak-signal to noise ratio (PSNR) results. However, there does not exist a generally agreed upon practice to compute PSNR for sets of images or video. Some authors report average of individual image/frame PSNR, which is equivalent to computing a single PSNR from the geometric mean of individual image/frame mean-square error (MSE). Others compute a single PSNR from the arithmetic mean of frame MSEs for each video. Furthermore, some compute the MSE/PSNR of Y-channel only, while others compute MSE/PSNR for RGB channels. This paper investigates different approaches to computing PSNR for sets of images, single video, and sets of video and the relation between them. We show the difference between computing the PSNR based on arithmetic vs. geometric mean of MSE depends on the distribution of MSE over the set of images or video, and that this distribution is task-dependent. In particular, these two methods yield larger differences in restoration problems, where the MSE is exponentially distributed and smaller differences in compression problems, where the MSE distribution is narrower. We hope this paper will motivate the community to clearly describe how they compute reported PSNR values to enable consistent comparison. △ Less

Submitted 30 April, 2021; originally announced April 2021.

Comments: accepted for publication in Picture Coding Symposium (PCS) 2021

arXiv:2104.14836 [pdf, ps, other]

A Practical Approach for Rate-Distortion-Perception Analysis in Learned Image Compression

Authors: Ogun Kirmemis, A. Murat Tekalp

Abstract: Rate-distortion optimization (RDO) of codecs, where distortion is quantified by the mean-square error, has been a standard practice in image/video compression over the years. RDO serves well for optimization of codec performance for evaluation of the results in terms of PSNR. However, it is well known that the PSNR does not correlate well with perceptual evaluation of images; hence, RDO is not wel… ▽ More Rate-distortion optimization (RDO) of codecs, where distortion is quantified by the mean-square error, has been a standard practice in image/video compression over the years. RDO serves well for optimization of codec performance for evaluation of the results in terms of PSNR. However, it is well known that the PSNR does not correlate well with perceptual evaluation of images; hence, RDO is not well suited for perceptual optimization of codecs. Recently, rate-distortion-perception trade-off has been formalized by taking the Kullback-Leibner (KL) divergence between the distributions of the original and reconstructed images as a perception measure. Learned image compression methods that simultaneously optimize rate, mean-square loss, VGG loss, and an adversarial loss were proposed. Yet, there exists no easy approach to fix the rate, distortion or perception at a desired level in a practical learned image compression solution to perform an analysis of the trade-off between rate, distortion and perception measures. In this paper, we propose a practical approach to fix the rate to carry out perception-distortion analysis at a fixed rate in order to perform perceptual evaluation of image compression results in a principled manner. Experimental results provide several insights for practical rate-distortion-perception analysis in learned image compression. △ Less

Submitted 30 April, 2021; originally announced April 2021.

Comments: accepted for publication in Picture Coding Symposium (PCS) 2021

arXiv:2102.06531 [pdf, ps, other]

Editorial: Introduction to the Issue on Deep Learning for Image/Video Restoration and Compression

Authors: A. Murat Tekalp, Michele Covell, Radu Timofte, Chao Dong

Abstract: Recent works have shown that learned models can achieve significant performance gains, especially in terms of perceptual quality measures, over traditional methods. Hence, the state of the art in image restoration and compression is getting redefined. This special issue covers the state of the art in learned image/video restoration and compression to promote further progress in innovative architec… ▽ More Recent works have shown that learned models can achieve significant performance gains, especially in terms of perceptual quality measures, over traditional methods. Hence, the state of the art in image restoration and compression is getting redefined. This special issue covers the state of the art in learned image/video restoration and compression to promote further progress in innovative architectures and training methods for effective and efficient networks for image/video restoration and compression. △ Less

Submitted 9 February, 2021; originally announced February 2021.

Journal ref: IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, vol. 15, no. 2, FEBRUARY 2021

arXiv:2008.06106 [pdf, other]

doi 10.1109/ICIP.2019.8803624

Effect of Architectures and Training Methods on the Performance of Learned Video Frame Prediction

Authors: M. Akin Yilmaz, A. Murat Tekalp

Abstract: We analyze the performance of feedforward vs. recurrent neural network (RNN) architectures and associated training methods for learned frame prediction. To this effect, we trained a residual fully convolutional neural network (FCNN), a convolutional RNN (CRNN), and a convolutional long short-term memory (CLSTM) network for next frame prediction using the mean square loss. We performed both statele… ▽ More We analyze the performance of feedforward vs. recurrent neural network (RNN) architectures and associated training methods for learned frame prediction. To this effect, we trained a residual fully convolutional neural network (FCNN), a convolutional RNN (CRNN), and a convolutional long short-term memory (CLSTM) network for next frame prediction using the mean square loss. We performed both stateless and stateful training for recurrent networks. Experimental results show that the residual FCNN architecture performs the best in terms of peak signal to noise ratio (PSNR) at the expense of higher training and test (inference) computational complexity. The CRNN can be trained stably and very efficiently using the stateful truncated backpropagation through time procedure, and it requires an order of magnitude less inference runtime to achieve near real-time frame prediction with an acceptable performance. △ Less

Submitted 13 August, 2020; originally announced August 2020.

Comments: Accepted for publication at IEEE ICIP 2019

arXiv:2008.05028 [pdf, other]

doi 10.1109/ICIP40778.2020.9190881

End-to-End Rate-Distortion Optimization for Bi-Directional Learned Video Compression

Authors: M. Akin Yilmaz, A. Murat Tekalp

Abstract: Conventional video compression methods employ a linear transform and block motion model, and the steps of motion estimation, mode and quantization parameter selection, and entropy coding are optimized individually due to combinatorial nature of the end-to-end optimization problem. Learned video compression allows end-to-end rate-distortion optimized training of all nonlinear modules, quantization… ▽ More Conventional video compression methods employ a linear transform and block motion model, and the steps of motion estimation, mode and quantization parameter selection, and entropy coding are optimized individually due to combinatorial nature of the end-to-end optimization problem. Learned video compression allows end-to-end rate-distortion optimized training of all nonlinear modules, quantization parameter and entropy model simultaneously. While previous work on learned video compression considered training a sequential video codec based on end-to-end optimization of cost averaged over pairs of successive frames, it is well-known in conventional video compression that hierarchical, bi-directional coding outperforms sequential compression. In this paper, we propose for the first time end-to-end optimization of a hierarchical, bi-directional motion compensated learned codec by accumulating cost function over fixed-size groups of pictures (GOP). Experimental results show that the rate-distortion performance of our proposed learned bi-directional {\it GOP coder} outperforms the state-of-the-art end-to-end optimized learned sequential compression as expected. △ Less

Submitted 26 May, 2021; v1 submitted 11 August, 2020; originally announced August 2020.

Comments: This work is accepted for publication in IEEE ICIP 2020

arXiv:2007.08922 [pdf, other]

doi 10.1007/s11760-020-01751-y

Can Learned Frame-Prediction Compete with Block-Motion Compensation for Video Coding?

Authors: Serkan Sulun, A. Murat Tekalp

Abstract: Given recent advances in learned video prediction, we investigate whether a simple video codec using a pre-trained deep model for next frame prediction based on previously encoded/decoded frames without sending any motion side information can compete with standard video codecs based on block-motion compensation. Frame differences given learned frame predictions are encoded by a standard still-imag… ▽ More Given recent advances in learned video prediction, we investigate whether a simple video codec using a pre-trained deep model for next frame prediction based on previously encoded/decoded frames without sending any motion side information can compete with standard video codecs based on block-motion compensation. Frame differences given learned frame predictions are encoded by a standard still-image (intra) codec. Experimental results show that the rate-distortion performance of the simple codec with symmetric complexity is on average better than that of x264 codec on 10 MPEG test videos, but does not yet reach the level of x265 codec. This result demonstrates the power of learned frame prediction (LFP), since unlike motion compensation, LFP does not use information from the current picture. The implications of training with L1, L2, or combined L2 and adversarial loss on prediction performance and compression efficiency are analyzed. △ Less

Submitted 17 July, 2020; originally announced July 2020.

Comments: Accepted for publication in Springer Journal of Signal, Image and Video Processing

arXiv:2002.05922 [pdf]

doi 10.1109/TIP.2020.2972112

Realizing a Low-Power Head-Mounted Phase-Only Holographic Display by Light-Weight Compression

Authors: Burak Soner, Erdem Ulusoy, A. Murat Tekalp, Hakan Urey

Abstract: Head-mounted holographic displays (HMHD) are projected to be the first commercial realization of holographic video display systems. HMHDs use liquid crystal on silicon (LCoS) spatial light modulators (SLM), which are best suited to display phase-only holograms (POH). The performance/watt requirement of a monochrome, 60 fps Full HD, 2-eye, POH HMHD system is about 10 TFLOPS/W, which is orders of ma… ▽ More Head-mounted holographic displays (HMHD) are projected to be the first commercial realization of holographic video display systems. HMHDs use liquid crystal on silicon (LCoS) spatial light modulators (SLM), which are best suited to display phase-only holograms (POH). The performance/watt requirement of a monochrome, 60 fps Full HD, 2-eye, POH HMHD system is about 10 TFLOPS/W, which is orders of magnitude higher than that is achievable by commercially available mobile processors. To mitigate this compute power constraint, display-ready POHs shall be generated on a nearby server and sent to the HMHD in compressed form over a wireless link. This paper discusses design of a feasible HMHD-based augmented reality system, focusing on compression requirements and per-pixel rate-distortion trade-off for transmission of display-ready POH from the server to HMHD. Since the decoder in the HMHD needs to operate on low power, only coding methods that have low-power decoder implementation are considered. Effects of 2D phase unwrap** and flat quantization on compression performance are also reported. We next propose a versatile PCM-POH codec with progressive quantization that can adapt to SLM-dynamic-range and available bitrate, and features per-pixel rate-distortion control to achieve acceptable POH quality at target rates of 60-200 Mbit/s that can be reliably achieved by current wireless technologies. Our results demonstrate feasibility of realizing a low-power, quality-ensured, multi-user, interactive HMHD augmented reality system with commercially available components using the proposed adaptive compression of display-ready POH with light-weight decoding. △ Less

Submitted 14 February, 2020; originally announced February 2020.

Comments: 10 pages, 6 figures, accepted for publication in the IEEE Transactions on Image Processing

Journal ref: IEEE Transactions on Image Processing, vol. 29, pp. 4505-4515, 2020

arXiv:1806.00333 [pdf, other]

Learned Compression Artifact Removal by Deep Residual Networks

Authors: Ogün Kırmemiş, Gonca Bakar, A. Murat Tekalp

Abstract: We propose a method for learned compression artifact removal by post-processing of BPG compressed images. We trained three networks of different sizes. We encoded input images using BPG with different QP values. We submitted the best combination of test images, encoded with different QP and post-processed by one of three networks, which satisfy the file size and decode time constraints imposed by… ▽ More We propose a method for learned compression artifact removal by post-processing of BPG compressed images. We trained three networks of different sizes. We encoded input images using BPG with different QP values. We submitted the best combination of test images, encoded with different QP and post-processed by one of three networks, which satisfy the file size and decode time constraints imposed by the Challenge. The selection of the best combination is posed as an integer programming problem. Although the visual improvements in image quality is impressive, the average PSNR improvement for the results is about 0.5 dB. △ Less

Submitted 1 June, 2018; originally announced June 2018.

Comments: Accepted for publication in the CVPR 2018, Challenge on Learned Image Compression (CLIC), Salt Lake City, Utah, USA, 18 June 2018 and appears in compression.cc

Showing 1–27 of 27 results for author: Tekalp, A M