[] \cormark[1]

[]

\cortext

[cor1]Corresponding author

Transformer-based Image and Video Inpainting: Current Challenges and Future Directions

Omar Elharrouss Omar Elharrouss    Rafat Damseh [email protected]    Abdelkader Nasreddine Belkacem [email protected]    Elarbi Badidi [email protected]    Abderrahmane Lakas [email protected] Department of Computer Science and Software Engineering, College of Information Technology, United Arab Emirates University. Department of Computer and Network Engineering, College of Information Technology, United Arab Emirates University.
Abstract

Image inpainting is currently a hot topic within the field of computer vision. It offers a viable solution for various applications, including photographic restoration, video editing, and medical imaging. Deep learning advancements, notably convolutional neural networks (CNNs) and generative adversarial networks (GANs), have significantly enhanced the inpainting task with an improved capability to fill missing or damaged regions in an image or video through the incorporation of contextually appropriate details. These advancements have improved other aspects, including efficiency, information preservation, and achieving both realistic textures and structures. Recently, visual transformers have been exploited and offer some improvements to image or video inpainting. The advent of transformer-based architectures, which were initially designed for natural language processing, has also been integrated into computer vision tasks. These methods utilize self-attention mechanisms that excel in capturing long-range dependencies within data; therefore, they are particularly effective for tasks requiring a comprehensive understanding of the global context of an image or video. In this paper, we provide a comprehensive review of the current image or video inpainting approaches, with a specific focus on transformer-based techniques, with the goal to highlight the significant improvements and provide a guideline for new researchers in the field of image or video inpainting using visual transformers. We categorized the transformer-based techniques by their architectural configurations, types of damage, and performance metrics. Furthermore, we present an organized synthesis of the current challenges, and suggest directions for future research in the field of image or video inpainting.

keywords:
Image inpainting\sepVideo inpainting\sepVisual transformers\sepReview

1 Introduction

Image inpainting is a fundamental task in computer vision and image processing that involves the restoration or completion of missing or damaged regions within an image. Image inpainting offers a significant solution for many tasks including photography restoration, video editing, and medical imaging. Over the years, different techniques have been developed to address image or video inpainting, which range from traditional methods based on patch-based or exemplar-based approaches to recent deep learning-based techniques [1]. Patch-based image inpainting functions by searching for patches (small regions) within the image to fill in the missing or damaged parts [2]. The algorithm looks for the best matching patches that align with the boundary conditions of the damaged area and uses these patches to reconstruct the missing parts. It is particularly effective for small, damaged regions. The exemplar-based technique is an extension of the patch-based approach that incorporates additional information, such as texture and structure, into the selection process [3]. These methods prioritize patches that fit with the surrounding region with similar structural or textural information, resulting in the inpainting results being more coherent with the overall image content. Recently, the advent of convolutional neural networks (CNNs) has revolutionized the field of image inpainting by training inpainting models on large-scale datasets [4, 5, 6, 7]. The learning capabilities of neural networks are forced to effectively capture the contextual information to fill the missing parts in the image whilst conserving the coherence and quality of the image. These approaches have demonstrated remarkable performance improvements over traditional methods, particularly in handling complex and large-scale inpainting tasks. In addition to CNN-based models, advanced techniques, such as generative adversarial networks (GANs), have been employed to generate high-quality inpainted images and to maintain the integrity of the original image in terms of the structural and textural components. Additionally, these techniques are powerful in handling complex textures and large missing regions in an image. Furthermore, the use of transformer-based architectures, initially proposed for natural language processing tasks, has gained significant interest in the field of computer vision including image inpainting [8]. Transformers with self- attention mechanisms succeed at capturing long-range dependencies, making them suitable for the tasks that require a global context understanding, such as image inpainting [9]. Transformer-based inpainting methods allow the capture of contextual information from the entire image for accurate and coherent inpainting results. In this paper, we aim to provide a comprehensive overview of transformer-based image or video inpainting methods by the category of approaches, architectures, and performance characteristics of these approaches.

We present a selection of influential and essential algorithms from prestigious journals and conferences. The focus of this study is on contemporary transformer-based image or video inpainting methodologies, which can provide a deeper understanding of the advancements in image or video inpainting. Additionally, a discussion of recent advancements, challenges, and future research directions in the field of image or video inpainting using transformer-based methodologies is provided. The content of this review is presented as follows:

  • Summarization of the existing surveys.

  • A description of different types of damage.

  • Classification of transformer-based image or video inpainting methods.

  • Public datasets deployed to evaluate the image or video inpainting methods.

  • Evaluation metrics are described with various comparisons of the most significant works.

  • Current challenges and future directions.

The remainder of the paper is organized as follows. An overview of the related studies, the scope, and previous surveys are conducted in Section 2. A taxonomy of the image and video inpainting is presented in Section 3. Video inpainting methods are discussed in Section 4. Used loss functions are presented in Section 5. Public datasets are briefly described in Section 6 before presenting the evaluation metrics and various comparisons of the most significant image or video inpainting methods in Section 7. Current challenges are presented in Section 8. Finally, a conclusion is provided in Section 9.

2 Related previous reviews and surveys

In the literature, reviews and surveys for image or video inpainting with different techniques and for different purposes were searched [10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22].

These reviews can be categorized based on the data used in inpainting, such as scratched pictures inpainting [10], Depth images inpainting using 3D to 2D representations [11, 12], RGB images [13, 14], or forensics imaging [17, 18]. In addition, the techniques-based reviews can be divided into two categories, including traditional-based methods and deep learning-based approaches. The first reviews of traditional techniques used in image or video inpainting, including texture and patch-based techniques, are proposed in [10], [11], and [12].In some papers, the authors briefly discuss the traditional methods and then the deep learning methods, as in [19]. For the deep learning-based reviews which were mostly published during the last 5 years, papers included those based on CNNs [13, 15, 20, 22], while [13, 14, 19, 21] focused on GANs. A summarization of the proposed survey is provided in Table 1.

Table 1: Summarization of crowd counting methods. Journal paper type is abbreviated to J and conference paper is abbreviated to Conf.
Reference Year Subject Tech Brief description
[10] 2015 Old pictures Patch and texture-based The paper presents the techniques used neighboring pixels and texture to restore missing or corrupted regions convincingly on old pictures. The various methods are discussed with their pros and cons as well as a comparative review of these techniques.
[11] 2016 Depth map inpainting Structure and Texture-based The paper discusses two main types of inpainting methods including 2D image inpainting and depth map inpainting. 2D inpainting restores corrupted parts using structural or textural methods, while depth map inpainting enhances 3D visualization effects.
[12] 2018 Depth image Traditional Depth Image Based Rendering (DIBR) uses a gray-level depth map to shift pixels in a 2D image. However, this can leave "empty" areas that degrade image quality. Inpainting techniques are then used to fill these gaps. The paper surveys different image inpainting methods to address dis-occlusions in Depth Image Rendering.
[13] 2020 RGB images CNN, GAN The paper reviews existing approaches categorized into sequential-based, CNN-based, and GAN-based methods. Each category includes methods tailored for different types of image distortion. The paper also presents available datasets and evaluates the performance of these methods on various distortions.
[14] 2020 RGB images GAN The paper discusses recent advancements in Generative Adversarial Networks (GANs) within image processing, editing, and inpainting. It analyzes methods used in these applications, discusses challenges faced by GANs, and suggests future research directions. Overall, it provides insights into GAN research and its diverse applications.
[15] 2021 RGB images Traditional and Deep learning The paper provides an overview of image inpainting methods over the past decade, highlighting the shift from traditional restoration to digital techniques. It categorizes traditional methods into five sub-categories and reviews deep learning methods such as Convolutional Neural Networks and Generative Adversarial Networks.
[16] 2021 Facial Various The paper explores automatic facial wrinkles detection and inpainting algorithms, emphasizing wrinkles as a key sign of aging. It surveys computer vision techniques and reviews datasets, detection, and inpainting algorithms.
[17] 2022 Forensic Various The article examines the advancements in image painting technology and its implications for image forgery. It discusses the development of image forensics to detect forged images, noting current limitations compared to inpainting technology. .
[19] 2022 RGB images CNN, GAN The paper discusses traditional methods briefly and then delves into deep learning-based approaches, particularly those using generative adversarial network models. The paper covers the problems addressed, advantages and disadvantages, areas for improvement, and analyzes challenges in image painting.
[20] 2022 RGB CNN This paper reviews image similarity measures used for inpainting and super-resolution. It categorizes existing methods, discusses their applications in inpainting with deep learning, and compares their pros and cons.
[21] 2023 Loss-function datasets Deep learning This survey explores the advancements in deep learning-based image inpainting based on network structures, and loss functions. It also summarizes available open-source codes, datasets, and metrics. Real-world applications are discussed, along with performance analysis of different algorithms.
[22] 2023 RGB images Fusion CNNs This article reviews deep learning-based image inpainting, highlighting its significance in computer vision. It examines progress over the past 15 years, neural network structures, fusion methods, and different inpainting tasks.

With the introduction of transformer techniques in the field of computer vision, studies have taken preference for the proposed architectures, in addition to the improvements added, especially for image or video inpainting. Unlike previous surveys, our work is focused on the transformer-based techniques for image or video inpainting. We summarize the existing image or video inpainting models from different aspects and list some of the effective approaches in terms of qualitative results on severe image or video inpainting datasets. Furthermore, we describe and analyze the most commonly proposed architectures in terms of the technique used and solved challenges in image inpainting. Finally, we list the open issues and challenges for image inpainting, in addition to the future directions. Through this survey, we expect to make reasonable inferences and predictions for the future development of image or video inpainting, and provide feasible solutions and guidance for the problem of image processing in other domains.

3 Taxonomy of image or video inpainting

In this section, we review transformer-based image inpainting algorithms in the following taxonomies. First, we discuss the different types of damages (masks) in the image or video. Then, we present the different types of transformer-based architectures for image or video inpainting in detail. The important models are described in chronological order.

Refer to caption
(a) Object             (b) Block            (c) Noise            (d) Scribble             (e) Text             (f) Scratch
Figure 1: Types of masks added to the edited images.

3.1 Mask types

Image inpainting was originally the operation of restoring old images by eradicating scratches and enhancing damaged portions. Presently, it is also employed to eliminate unwanted objects by substituting them with estimated values within the target area. In addition, it is used to repair various distortions or as masks, such text, blocks, noise, scratches, lines, and diverse masks. These masks are used to indicate the areas in an image or video that need to be filled or reconstructed. Figure 1 illustrates the existing distortion or mask types that are used in different image inpainting methods. Several types of masks commonly used in image inpainting are described as follows:

  • Blocks: a simple mask where a rectangular or a square region in the image is selected for inpainting. It is easy to create and use; however, it may not always be the most accurate representation of the damaged area.

  • Object: in some cases, only specific objects or regions within an image need to be inpainted. Object masks are used to specify these areas for reconstruction while leaving other parts of the image untouched.

  • Noise: random variations in brightness or color within an image, typically introduced by factors such as low light conditions, sensor limitations, or compression.

  • Scribble: involve marking the areas to be inpainted with simple strokes or scribbles. These masks provide a rough guideline for the inpainting algorithm and are often used in interactive inpainting systems. These masks are irregularly shaped and can be drawn by hand or generated using algorithms, such as segmentation techniques.

  • Text: unwanted text overlaid on an image, such as watermarks, captions, or annotations. Inpainting text in an image is the operation of removing or replacing the text with the original content. Text masks are often used in document restoration or editing applications.

  • Scratches: thin, elongated marks or lines on the sur- face of the image, often caused by physical damage or degradation of the image medium. Generally, this type of mask is used in old pictures.

Refer to caption
Figure 2: Image inpainting category of methods.

3.2 Transformer network representations

To differentiate various types of transformer-based network architectures, we divided the image inpainting models into three categories: blind image inpainting networks, mask-required networks, and GAN-based methods. These categories are illustrated in Figure 2, and a description of each method is detailed in Table 2.

3.2.1 Blind image inpainting

The blind image (single-stream) inpainting network is a neural network architecture designed to fill in missing or corrupted regions within an image using only the corrupted image as the input. Through a series of convolutional and/or transformer layers and dedicated inpainting modules, the network predicts the missing regions based on the available information from the input image. Despite its simplicity compared to mask-required (multi-stream) networks, single-stream architectures can still produce impressive inpainting results by learning to infer missing details solely from the provided input image.
•  CTN [23] introduces the contextual transformer network (CTN) for image inpainting. CTN tackles the challenge of modeling relationships between corrupted and uncorrupted regions while considering their internal structures. Unlike traditional methods, CTN utilizes transformer blocks to capture long-range dependencies and a multi-scale attention module to model intricate connections within different image regions.
•  ICT [25] combines transformers and CNNs for superior image completion. Transformers capture global structures and various outputs, while CNNs refine textures. This method achieves high fidelity, diverse results.
•  MAT [26] introduces a transformer-based model named the mask-aware transformer (MAT) to efficiently inpaint large holes in high-resolution images. It combines the strengths of transformers for long-range interactions and convolutions for efficient processing. The model incorporates a customized transformer block with a dynamic mask to focus on relevant information and achieve high-fidelity, various image reconstructions.
•  BAT-Fill [28] is a novel image inpainting method that addresses the limitations of existing CNN-based approaches. While CNNs struggle with capturing long-range features, BAT-Fill leverages a "bidirectional autoregressive transformer" (BAT) for diverse and realistic content generation. Unlike traditional autoregressive transformers, BAT incorporates masked language modeling to analyze context from any direction, enabling better handling of irregularly shaped missing regions.
•  T-former [30] proposes a new image inpainting method called T-former, aiming to overcome the limitations of CNNs. While CNNs struggle with complex and diverse image damage, T-former leverages a novel attention mechanism inspired by transformers, which offers efficient long-range modeling capabilities while maintaining computational efficiency compared to traditional transformers.
•  PUT [32] is a novel transformer-based method for image inpainting that addresses the information loss issues in existing approaches. Existing methods down-sample images and quantize pixel values, losing information. PUT uses a patch-based autoencoder and an un-quantized transformer that processes the image in patches without down-sampling. The un-quantized transformer directly uses features from the autoencoder, avoiding information loss from quantization.
•  TransCNN-HAE [34] ] is used for blind image inpainting, which addresses the challenges of unknown and various image damage. Unlike existing two-stage approaches, TransCNN-HAE operates in a single stage that combines transformers and CNNs; it leverages transformers for global context modelling to repair damaged regions and CNNs for local context modelling to reconstruct the repaired image. The crosslayer dissimilarity prompt (CDP) accelerates identifying and inpainting damaged areas.
•  InstaFormer [35] is a novel network architecture for image-to-image translation that effectively combines global and instance-level information. Global context: utilizes transformers to analyze the overall content of an image, capturing relationships between different parts. (1) Instance-awareness: Incorporates bounding box information to understand individual objects within the image and their interactions with the background. (2) Style control: Enables the application of different artistic styles to the translated image using adaptive instance normalization. (3) Improved instance translation: Introduces a specific loss function to enhance the quality and faithfulness of translated object regions.
•  Campana et al. [37] Advanced image inpainting has evolved with the integration of transformers in computer vision; however, their high computational costs pose challenges, especially with large, damaged regions. To overcome this, a novel variable-hyperparameter visual transformer architecture is proposed, showing superior performance in reconstructing semantic content, such as human faces.
U2AFN [40] is an uncertainty-aware adaptive feedback network (U2AFN) used to enhance image inpainting for large holes. Unlike conventional methods, U2AFN predicts uncertainty alongside inpainting results and employs an adaptive feedback mechanism. This mechanism progressively refines inpainting regions by utilizing low-uncertainty pixels from previous iterations to guide subsequent learning.
CBNet [44] effective at completing small-sized or specifically masked corruptions, but struggles with large-proportion corrupted images due to limited consideration of semantic relevance. To address this, the authors propose CBNet, a novel image inpainting approach. CBNet combines an adjacent transfer attention (ATA) module in the decoder to preserve contour structure and blend structure–texture information. Additionally, a multi-scale contextual blend (MCB) block assembles multi-stage feature information. Extra deep supervision through a cascaded loss ensures high-quality feature representation.
CoordFill [47] is a novel method using continuous implicit representation to address limitations in image restoration. By utilizing an attentional fast Fourier convolution (FFC)-based parameter generation network, the degraded image is down-sampled and encoded to derive spatial–adaptive parameters. These parameters are then used in a series of multi-layer perceptrons (MLP) to synthesize color values from encoded continuous coordinates. This approach allows the capturing of larger reception fields by encoding high-resolution images at lower resolutions, while continuous position encoding enhances the synthesis of high-frequency textures. Additionally, the framework enables efficient parallel querying of missing pixel coordinates.
•  CMT [52] is a continuous mask-aware transformer for image inpainting. CMT utilizes a continuous mask to represent error amounts in tokens. It employs masked self-attention with overlap** tokens and updates the mask to model error propagation. Through multiple masked self-attention and mask update layers, CMT predicts initial inpainting results, which are further refined for improved image reconstruction.
•  TransInpaint [53] is a model for image inpainting that generates realistic content for missing regions while ensuring consistency with the overall context of the image. It utilizes a context-adaptive transformer and a texture enhancement network to produce superior results compared to existing methods.
•  NDMA [55] is a lightweight architecture for image inpainting, leveraging nested deformable attention-based transformer layers. These layers efficiently extract contextual information, particularly for facial image inpainting tasks. Comparative evaluations on Celeb HQ and Places2 datasets demonstrate the superiority of the proposed approach.
•  Blind-Omni-Wav-Net [56] restores corrupted regions without additional mask information. This is challenging due to difficulties in distinguishing between corrupted and valid areas. Existing approaches often struggle to produce plausible results by predicting corrupted regions first. To address this, we propose an end- to-end architecture combining a wavelet query multi-head attention transformer block with omni-dimensional gated attention. The wavelet query multi-head attention provides encoder features using processed wavelet coefficients, while the omni-dimensional gated attention facilitates effective feature transmission from the encoder to decoder. Comparative evaluations on standard datasets demonstrate the superiority of our approach for blind image inpainting through numerical and visual comparisons with state-of-the-art methods.

Table 2: Summarization of crowd counting methods
Network type Method Architecture Technique Mask type Input type Dataset

PSV

Places2

CelebA

FFHQ

ImageNet

ImageNet

Blind image inpainting CTN [23] Enc-TR-Dec Stacked-Transformer Block, Scribble Image
ICT [25] TR-Enc-Dec Bi-directional transformer Object, Scribble Patch
MAT [26] Conv-TR-Conv Mask-aware transformer Block, Scribble Image
BAT-Fill [28] TR-Dec Diver Structure, Texture generation networks Scribble Image
T-former [30] TR Transformer Unet Scribble Image
PUT ENc-TR-Dec Un-Quantized Transformer (UQ-Transformer) Scribble Patch
TransCNN-HAE [34] Tr-Deco Transformer-CNN Hybrid AutoEncoder Scribble Patch
InstaFormer [35] Enc-TR-Dec ViT block Obejct Image
Campana et al. [37] Enc-TR-Dec Cariable-hyperparameter visual transformer Scribble Image
U2AFN [40] Enc-TR-Dec Uncertainty-aware adaptive feedback network Block, Scribble Image
CBNet [44] Enc-TR-Dec Cascading blend network Scribble Image
CoordFill [47] Enc-TR Attentional Fast Fourier Conv (FFC), multi-layer perceptrons (MLP) Object, Block Image
CMT [52] TR Self-attention and mask update (MSAU) layers Scribble, scratch Image
TransInpaint [53] TR-Enc-Dec DETR, context-adaptive transformer Block, Scribble Image
NDMA [55] Enc-Dec-Enc-Att Deformable attention-based transformer Scribble Face image
Blind-Omni-Wav-Net [56] TR Wavelet query multi-head attention (WQMA), omnidimensional gated attention (OGA) Scribble, Scratch Image
Mask-required ZITS [27] Enc-TR-Dec Consecutive encoder decoder networks with transfromer TR Scribble Four image
APT [29] TR Atrous Pyramid Transformer Scribble Two images
SPN [38] Enc-TR-Dec Semantic Pyramid Networ Scribble Image
SWMH [42] Enc-TR-Dec SWMH transformer blocks Object, Scribble Mask, Image
ZITS++ [43] Enc-TR-Dec Transformer Structure Restorer (TSR) module for holistic structural priors at low resolution Scribble Four Image
TransRef [45] TR Reference-patch alignment (Ref-PA) module,reference-patch transformer (Ref-PT) module Object, Scribble Image
GAN-based ACCP-GAN [24] Unet-Enc-TR-Dec Consecutive networks with TR Scribble Image
AOT-GAN Enc-TR-Dec Aggregated Contextual-Transformation Scribble Image
HiMFR [33] GAN-TR Vision Transformer Object Image
Wang et al.[34] Enc-Tr-Dec Deep semantic structure modeling module (U-CITB) Scribble Image
Li et al. [39] Enc-TR-Dec Prior-Driven Fused Contextual Transformation Network Scribble Image
Swin-GAN [41] Enc-Att-Dec Transformer-based Descriminator Scribble Image
SFI-Swin [46] TR Swin-Tr Block, Scribble Face image
IIN-GCMAM [47] Enc-(Dec+Atts) CNN Encoder +multi-level attention mechanism decoder (MAM-decoder) Scribble Image
WAT-GAN [49] Enc-TR-Dec Window Aggregation Transformer (WAT), GAN Scribble Image
UFFC [50] - Unbiased Fast Fourier Convolution (UFFC) Block, Scribble Image
PATMAT [51] GAN-TR Mask-Aware Transformer (MAT) Object Face image
GCAM [54] Unet-TR Lightweight Attention using Group Convolution module (LAGC) Block, Scribble Image

3.2.2 Mask-required image inpainting

Mask-required networks for inpainting have a neural network architecture that utilizes multiple input streams to perform the inpainting task. The mask is fed into the network with the distorted image. This architecture is designed to handle various types of input information, which can improve the inpainting performance by leveraging different features and representations. The network takes in multiple streams of input data, each representing different types of information relevant to the inpainting task. For example, one stream could contain the corrupted image, another stream could contain additional contextual information, such as edge maps or semantic segmentation masks, and another stream could contain guidance from reference images.
•  ZITS [27] tackles the challenge of restoring both textures and structures in corrupted images. While CNNs struggle with capturing holistic structures, attention-based models are computationally expensive for large images. For that, a structure restorer network uses a transformer model in a low-resolution space to efficiently recover the overall structure of the image. The recovered structure is then integrated with existing inpainting models to add details and textures.
•  APT [29] is a two-stage image inpainting framework using a novel "atrous pyramid transformer" (APT). APT captures long-range dependencies to reconstruct damaged areas, while a "dual spectral transform convolution" (DSTC) module refines textures.
•  SPN [38] restoring realistic content in images with missing regions is challenging. Existing image inpainting models often produce blurred textures or distorted structures in complex scenes due to contextual ambiguity. To address this, we propose the semantic pyramid network (SPN), leveraging multi-scale semantic priors learned from pretext tasks. SPN comprises two components: a prior learner distilling semantic priors into a multi-scale feature pyramid, ensuring a coherent understanding of global context and local structures, and a fully context-aware image generator progressively refining visual representations with the prior pyramid. Optionally, variational inference enables probabilistic inpainting.
•  SWMH [42] is a model that combines a specialized transformer, named the stripe window multi-head (SWMH) transformer, with a traditional CNN. It has a novel loss function to enhance color details beyond RGB channels.

•  ZITS++ [43] is an improved model of the authors previous work, ZITS. ZITS++ combines a specialized transformer with a traditional CNN. It introduces the transformer structure restorer (TSR) module for holistic structural priors at low resolution, upsampled by the simple structure upsampler (SSU). Texture details are restored using the Fourier CNN texture restoration (FTR) module, enhanced by Fourier and large-kernel attention convolutions. The upsampled structural priors from TSR are further processed by the structure feature encoder (SFE) and optimized incrementally with the zero-initialized residual addition (ZeroRA). Additionally, a new masking positional encoding addresses large, irregular masks.
•  TransRef [45] is a transformer-based encoder–decoder network for reference-guided image inpainting. The guidance process involves progressively aligning and fusing referencing features with the features of the corrupted image. To precisely utilize reference features, they introduce the reference-patch alignment (Ref-PA) module, which aligns patch features from both reference and corrupted images while harmonizing style differences. Additionally, the reference-patch transformer (Ref-PT) module refines the embedded reference feature. •  UFFC [50] examines the limitations of using the vanilla FFC module in image inpainting, including spectrum shifting and limited receptive fields. To address these issues, a novel unbiased fast Fourier convolution (UFFC) module was proposed, incorporating range transform, absolute position embedding, dynamic skip connection, and adaptive clip**. The experimental results demonstrate that the UFFC module outperforms existing methods in capturing texture and achieving faithful reconstruction in image inpainting tasks.

Refer to caption
Figure 3: Transformer -based architectures used in different image inpainting methods. The transformer blocks are combined with different convolutional neural network (CNN)-based parts including encoder–decoders such as in Enc-TR-Dec, TR-Dec, and Enc-Tr representations. Some architectures used a pure transformer-based network, such as TR and UNet-TR, which utilizes a transformer-based encoder and decoder.

3.2.3 GAN With Transformer image inpainting:

GAN-based image inpainting utilizes GANs to fill in missing or corrupted regions of an image. In this approach, two neural networks are trained simultaneously: a generator network and a discriminator network. The generator generates realistic content for the missing regions, while the discriminator tries to distinguish between the inpainted images and real images. Through adversarial training, the generator learns to produce convincing inpainted images that fool the discriminator. This method often produces visually appealing results by leveraging the adversarial loss to capture high-level image structures and textures. However, GAN-based inpainting is prone to issues such as mode collapse or blurriness, requiring careful optimization and architectural choices to address these challenges.
•  ACCP-GAN [24]is a method for automatically repairing defects in serial section images used in histology studies. ACCP-GAN combines two stages: one to detect and roughly fix damaged areas, and another to precisely refine the repairs. The model leverages transformers and convolutions to analyze neighboring images and healthy regions within the defective image, achieving high accuracy in both segmentation and restoration tasks.
•  AOT-GAN [31] tackles challenges in high-resolution image inpainting. It improves both context reasoning and texture synthesis, leading to more realistic reconstructions compared to existing approaches. This method is particularly effective for large, irregular missing regions.
•  HiMFR[33] is a system for recognizing masked faces. HiMFR first detects masked faces using a pre- trained model and inpaints the occluded regions using a GAN-based method. Finally, it recognizes the face, masked or reconstructed, using a hybrid recognition module. Experiments show competitive performance on benchmark datasets.
•  Wnag et al.[34] addresses limitations in reconstructing large, damaged areas with image inpainting. The authors propose enhanced gated convolution to extract detailed features from the masked region using a gating mechanism. U-net-like deep structure modeling combines transformers’ long-range modeling with CNNs’ texture learning to capture global structures. Next, the reconstruction module merges shallow and deep features to generate the final inpainted image.
•  Li et al. [39] recent image inpainting advancements perform well on simple backgrounds but struggle with complex images due to the lack of semantic understanding and distant context. To address this, a semantic prior-driven fused contextual transformation network was proposed. It utilizes a semantic prior generator to map features from ground truth and damaged images, followed by a fusion strategy to enhance multi-scale texture features and an attention aware module for structure restoration. Additionally, a mask-guided discriminator improves output quality. Results on various datasets show significant improvements over existing methods.
•  Swin-GAN [41] presents a transformer-based method for image inpainting, aiming to overcome limitations in capturing global and semantic information. The technique utilizes self-supervised attention and a hierarchical Swin transformer in the discriminator. Experimental results show superior performance compared to existing approaches, demonstrating the effectiveness of the proposed transformer-based approach.
•  SFI-Swin [46] image inpainting involves filling in the holes or missing parts of an image. When it comes to inpainting face images with symmetric characteristics, the challenge is even greater than for natural scenes. Existing powerful models struggle to fill in missing parts while considering both symmetry and homogeneity. Additionally, standard metrics for assessing repaired face image quality fail to capture the preservation of symmetry between rebuilt and existing facial features. To address this, the authors propose a GAN-transformer-based solution: multiple discriminators that independently verify the reality of each facial organ, combined with a transformer-based network. They also introduce a novel metric called the "symmetry concentration score" to measure the symmetry of repaired face images.
•  IIN-GCMAM [47] is an image inpainting network using gated convolution and a multi-level attention mechanism to address deficiencies in existing methods. By weighing features with gated convolutions and employing multi-level attention, it enhances global structure consistency and repair result precision. Extensive experiments on datasets, such as Paris Street View (PSV) and CelebA, have validated its effectiveness.
•  WAT-GAN [49] is a novel transformer network with cross-window aggregated attention use to address limitations of convolutional networks, such as over-smoothing and limited long-range dependencies. Integrated into a generative adversarial network model, this approach embeds the window aggregation transformer (WAT) module to enhance information aggregation between windows without increasing computational complexity. Initially, the encoder extracts multi-scale features using convolution kernels of varying scales. These features are then input into the WAT module for inter-window aggregation, followed by reconstruction by the decoder. The resulting image undergoes assessment by a global discriminator for authenticity. Experimental validation demonstrates that the transformer window attention network enhances the structured texture of restored images, particularly in scenarios involving large or complex structural restoration tasks.
•  PATMAT [51] is a method for face inpainting that enhances the preservation of facial details and identity. By fine-tuning a MAT with reference images, it outperforms existing models in quality and identity preservation.
•  GCAM [54] is a lightweight image inpainting method that emphasizes both restoration quality and efficiency on limited processing platforms. By combining group convolution and a rotating attention mechanism, the traditional convolution module is enhanced or replaced. Group convolution enables multi-level inpainting, while the rotating attention mechanism addresses information mobility issues between channels. A parallel discriminator structure ensures local and global consistency in the inpainting process. Experimental results show that the proposed method achieves high-quality inpainting while significantly reducing inference time and resource usage compared to other lightweight approaches.

4 Video inpainting

Video inpainting using transformers is a technique that uses visual transformer models to fill in missing or corrupted parts of a video sequence. By utilizing the transformer’s ability to capture long-range dependencies in the data, this approach aims to seamlessly reconstruct the missing regions in video frames based on the surrounding context. Many method have been proposed. For that, in this section a description of each one these method will be described.

•  FuseFormer [57] is a transformer model tailored for video inpainting tasks to address issues with blurry edges. It utilizes Soft Split and Soft Composition operations to enhance fine-grained feature fusion. Soft Split divides feature maps into patches with overlap, while Soft Composition stitches patches together, allowing for more effective interaction between neighboring patches. These operations are integrated into tokenization and de-tokenization processes for better feature propagation. Additionally, FuseFormer enhances the capability of 1D linear layers to model 2D structures, improving sub-patch level feature fusion. The evaluation results demonstrated the superiority of FuseFormer over existing methods in both quantitative and qualitative assessments.
•  FAST [58] is a frequency-aware spatiotemporal transformer used for video inpainting detection. It utilizes global self-attention mechanisms to capture long-range relations and employs a spatiotemporal transformer framework to detect spatial and temporal connections. Additionally, FAST exploits frequency domain information using a specially designed decoder. Experimental results show competitive performance and good generalization.
•  DSTT [59] is a decoupled spatial–temporal trans- former used for efficient video inpainting. It separates learning spatial–temporal attention into two tasks: one for temporal object movements and another for background textures. This allows precise inpainting. Additionally, a hierarchical encoder is used for robust feature learning.
•  E2FGVI [60] is an end-to-end framework for flow-guided video inpainting to improve the efficiency and effectiveness compared to existing methods. It replaces separate hand-crafted processes with three trainable modules: flow completion, feature propagation, and content hallucination. These modules correspond to previous stages but can be jointly optimized, leading to better results.
•  FGT [61] is a flow-guided transformer used for high- fidelity video inpainting, which utilizes motion discrepancy from optical flows to guide attention retrieval in transformers. A flow completion network is introduced to restore corrupted flows by leveraging relevant flow features within a local temporal window. With completed flows, the content is propagated across frames and flow-guided transformers are employed to fill in corrupted regions. Transformers are decoupled along temporal and spatial dimensions to integrate the completed flows for spatial attention. Additionally, a flow-reweight module controls the impact of completed flows on each spatial transformer. For efficiency, a window partition strategy is employed in both the spatial and temporal transformers.
•  DeViT [61] ], deformed vision transformer (DeViT), presents three key innovations: DePtH for patch alignment, MPPA for enhanced feature matching, and STA for accurate attention assignment. DeViT outperforms previous methods in quality and quantity, setting a new state-of-the-art for video inpainting.
•  DLFormer [63] is a discrete latent transformer. Unlike previous methods operating in continuous feature spaces, DLFormer utilizes a discrete latent space, leveraging a compact codebook and autoencoder to represent the target video. By inferring proper codes for unknown areas via self-attention, DLFormer produces fine-grained content with long-term spatial–temporal consistency. Additionally, it enforces short-term consistency to reduce temporal visual jitters.

Refer to caption
(a) Image inpainting
Refer to caption
(a) Video inpainting
Figure 4: Loss functions used in each transformer-based image or video inpainting method. Five loss function are combined for transformer-based learning.

•  DMT [64] is a dual-modality-compatible inpainting framework used to address deficiencies in video inpainting. DMT_img, a pretrained image inpainting model, serves as a prior for distilling DMT_vid, enhancing performance in deficiency cases. The self-attention module selectively incorporates spatiotemporal tokens, accelerating inference and removing noise signals. Additionally, a receptive field contextualizer improves performance further.
•  FGT++ [65] is an enhanced version of the flow-guided transformer (FGT), resulting in more effective and efficient video inpainting. FGT++ addresses query degradation using a lightweight flow completion network and introduces flow guidance feature integration and flow-guided feature propagation modules. The transformer is decoupled along the temporal and spatial dimensions, utilizing flows for token selection and employing a dual-perspective multi-head self-attention (MHSA) mechanism. Experimental results show that FGT++ outperforms existing video inpainting networks in both quality and efficiency.
•  Liao et al., [66] propose an automatic video inpainting algorithm for clear street views in autonomous driving. Using depth/point cloud guidance, this method removes traffic agents from videos and fills missing regions. By creating a dense 3D map from point clouds, frames are geometrically correlated, allowing for straightforward pixel transformation. Multiple videos can be fused through 3D point cloud registration, addressing long-time occlusion challenges.
•  FITer [67] is a video inpainting method that enhances missing region representations using a feature pre-inpainting network (FPNet) before the transformer stage. This improves the accuracy of self-attention weights and dependency learning. FITer also employs an interleaving transformer with global and window-based local self-attention mechanisms for efficient aggregation of spatial– temporal features into missing regions.
•  ProPainter [68] is an enhanced framework for video inpainting that addresses the limitations in flow-based propagation and spatiotemporal transformers. ProPainter combines image and feature war** for more reliable global correspondence and employs a mask-guided sparse video transformer for increased efficiency.
•  SViT [69] is a new transformer-based video inpainting technique that leverages semantic information to enhance reconstruction quality. Using a mixture-of-experts scheme, multiple experts can be trained to handle mixed scenes with various semantics. By producing different local network parameters at the token level, this method achieves semantic-aware inpainting results
•  FSTT [70] uses a flow-guided spatial temporal transformer (FSTT) for video inpainting, which effectively utilizes optical flow to establish correspondence between missing and valid regions in spatial and temporal dimensions. FSTT incorporates a flow-guided fusion feed-forward module to enhance features with optical flow guidance, reducing inaccuracies during MHSA. Additionally, a decomposed spatiotemporal MHSA module captures dependencies effectively. To improve efficiency, a global–local temporal MHSA module was designed.

5 Loss Functions:

Following a review of the cited transformer-based methods for image or video inpainting, various loss functions have been utilized to guide the generation of realistic results. Generally, to train these methods, authors combined more than one loss function in their implementation due to the difference in the objective of each one of these functions. The most used loss functions in image inpainting include mean absolute error loss (L1) [71], Adversarial Loss, Perceptual Loss [75], Reconstruction Loss [72], Style Loss [74], and Feature Map Loss. Also some other loss functions are used in a small number of papers like Mask, SSIM loss function used in ACCP-GAN [24], binary cross entropy loss [76], Cross entropy loss [77], and diversified Markov random field loss [78].

]. For image inpainting and image translation tasks, including image generation and image segmentation, incorporating various loss functions can produce better results visually and is semantically effective. These loss functions can be categorized into three classes: contextual-based, style-based, and structure-based loss. Contextual-based loss functions focus on preserving the content or semantic information of the image, ensuring that the inpainted regions are coherent and homogeneous with the neighboring regions. Furthermore, they can be used to measure the similarity between the inpainted image and the ground truth in terms of both low-level details and high-level structures, preserving realistic content. For this category, L1 and reconstruction loss functions can be found [71, 72].The style-based loss category is focused on capturing high-level semantic information rather than pixel-level details. It specifically targets the texture and artistic style of the original image, which is achieved by comparing the statistics of feature maps across different layers of the network. In this category exists perceptual loss, style loss and adversarial loss [74]. Structural loss that can be categorized as a type of contextual loss, which emphasizes maintaining the contextual coherence and structural integrity of the inpainted image, preserving the surrounding content. Figure 4 summarizes the used loss functions in this review. By the following a description of each one of the used loss functions.
Mean Absolute Error (L1) lossmeasures the absolute pixel-wise differences between the inpainted image and the ground-truth. It allows the difference between the generated image to be minimized to be close to the original image in terms of pixel values.
Adversarial loss introduced for GANs, consists of a generator and a discriminator. The generator aims to produce realistic images, while the discriminator learns to distinguish between real and generated images. The adversarial loss encourages the generator to generate visually robust content.
Perceptual loss or High Receptive Field (HRF), focuses on capturing high-level semantic information. It defines the difference between original input features and reconstructed features. By minimizing perceptual loss, the inpainted image is encouraged to capture the structural similarities with the original image.
Reconstruction loss evaluates the comparison between the inpainted and the original image after each transformation. It also assesses the definite differences between the generated images and real images.
Style loss captures the texture and artistic style of the original image. It is computed by comparing the statistics of feature maps across different layers. By minimizing style loss, the inpainted regions are encouraged to mimic the artistic style of the surrounding image content.
Feature Map lossmeasures the similarity between feature maps extracted from the inpainted image and those from the ground-truth image. It encourages the inpainted regions to preserve important visual structures and textures present in the original image. Feature map loss is often used in conjunction with perceptual loss to guide the inpainting process effectively.
Hinge loss for video inpainting is used in adversarial settings. which can help in training discriminators to distinguish between real and inpainted video frames, ensuring that inpainted images are indistinguishable from the original content. It is not commonly used for the direct inpainting tasks but can improve the quality of inpainted videos in GAN-based approaches.
Cross-entropy loss in the context of video inpainting, is mainly suitable for classification tasks within the inpainting process, such as segmenting regions to be inpainted. It measures the difference between the predicted and actual distribution of classes (e.g., inpainted vs. not inpainted pixels).

6 Image and video inpainting datasets

To evaluate the image inpainting method, various datasets have been used. Paris Street View dataset [79] was created for image inpainting, and others are from other tasks, such as Places2 [80] for scenes recognition, CelebA-HQ [81] and FFHQ [82] for Face recognition, Youtube-VOS [83] for video object segmentation. In this section, the most frequent datasets used in the transformed-based image inpainting methods are reviewed.

Paris Street View Dataset: The Paris Street View dataset [79] consists of 14,900 training images and 100 test images captured from street views in Paris. These images primarily focus on the city’s buildings, making the dataset valuable for tasks related to urban scenes and architectural elements.

CelebA-HQ Dataset: CelebA-HQ [81] is an extension of the CelebA dataset, providing high-quality images of celebrities with diverse attributes. It contains over 30,000 high-resolution images (1024 × 1024 pixels) of celebrities in various poses, lighting conditions, and backgrounds. CelebA-HQ is commonly used for tasks such as facial recognition, attribute classification, and image generation, including image inpainting.

Places2 Dataset: Places2 [80] is a large-scale dataset focusing on scene understanding, containing images of various indoor and outdoor scenes from around the world. It includes over 10 million images covering 365 scene categories, ranging from natural landscapes to urban environments. Places2 is used for several tasks such as scene classification, semantic segmentation, and image inpainting.

FFHQ Dataset: Flickr-Faces-HQ (FFHQ) [82] dataset is a high-quality image collection of human faces: 70,000 high-quality images with a resolution of 1024 × 1024 pixels. The dataset contains various images with variations in terms of age, ethnicity, and image background. In addition, it includes a diverse range of attributes such as eyeglasses, sunglasses, and hats. FFHQ is used in different tasks, such as image generation, super-resolution, denoising, and inpainting

YouTube-VOS Dataset: The YouTube-VOS (Video Object Segmentation) [83] dataset is designed for the task of semi-supervised video object segmentation, where the goal is to segment objects of interest in videos. It contains high-resolution video sequences with pixel-level annotations for foreground objects across multiple frames. The YouTube-VOS dataset is used for video object segmentation and video inpainting tasks.

DAVIS Dataset: The Densely Annotated Video Segmentation (DAVIS) dataset is a comprehensive resource designed specifically for the task of video object segmentation, offering high-quality annotations across consecutive frames for precise delineation of object boundaries. It contains 50 high-quality video sequences. Furthermore, DAVIS provides a benchmark for evaluating algorithms in the field of video segmentation, in addition to video inpainting.

7 Results and discussion

7.1 Evaluation Metrics

To evaluate the performance of image or video inpainting methods, a set of metrics are used to compare between the generated image and the ground-truth. In this section, we selected the most used metrics for image and video inpainting, including peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), learned perceptual image patch similarity (LPIPS), and Frechet inception distance (FID). The metrics can be divided into two categories: pixel-based metric to evaluate the quality of the generated images and patch-based metric to compute the perceptual similarity between two images. FID and LPIPS are patch-based metrics, and PSNR and SSIM are pixel-based metrics.

7.2 Pixel-based metrics:

Pixel-based metrics evaluate images at the level of individual pixels. The PSNR and SSIM are pixel-based metrics used to evaluate image inpainting method.

Peak Signal-to-Noise Ratio (PSNR ): Used to evaluate the quality of the generated image by comparing it with the ground-truth. A higher PSNR indicates less noise and better quality. The PSNR is defined as follows:

PSNR=10log10(MAXI2MSE)𝑃𝑆𝑁𝑅10subscript10𝑀𝐴superscriptsubscript𝑋𝐼2𝑀𝑆𝐸PSNR=10\log_{10}\left(\frac{MAX_{I}^{2}}{MSE}\right)italic_P italic_S italic_N italic_R = 10 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( divide start_ARG italic_M italic_A italic_X start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_M italic_S italic_E end_ARG ) (1)

Where MAXI𝑀𝐴subscript𝑋𝐼MAX_{I}italic_M italic_A italic_X start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT is the maximum pixel value of the image, and MSE𝑀𝑆𝐸MSEitalic_M italic_S italic_E is the Mean Squared Error between the ground-truth and generated image.

Structural Similarity Index (SSIM): This metric compares how similar two images are in terms of luminance, contrast, and structure, mimicking human perception. A higher SSIM indicates that the images are more alike. The SSIM is defined as follows:

SSIM(x,y)=(2μxμy+C1)(2σxy+C2)(μx2+μy2+C1)(σx2+σy2+C2)𝑆𝑆𝐼𝑀𝑥𝑦2subscript𝜇𝑥subscript𝜇𝑦subscript𝐶12subscript𝜎𝑥𝑦subscript𝐶2superscriptsubscript𝜇𝑥2superscriptsubscript𝜇𝑦2subscript𝐶1superscriptsubscript𝜎𝑥2superscriptsubscript𝜎𝑦2subscript𝐶2SSIM(x,y)=\frac{(2\mu_{x}\mu_{y}+C_{1})(2\sigma_{xy}+C_{2})}{(\mu_{x}^{2}+\mu_% {y}^{2}+C_{1})(\sigma_{x}^{2}+\sigma_{y}^{2}+C_{2})}italic_S italic_S italic_I italic_M ( italic_x , italic_y ) = divide start_ARG ( 2 italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( 2 italic_σ start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG (2)

Where μxsubscript𝜇𝑥\mu_{x}italic_μ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and μysubscript𝜇𝑦\mu_{y}italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT denote the mean luminance values of images x𝑥xitalic_x and y𝑦yitalic_y respectively. the σxsubscript𝜎𝑥\sigma_{x}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and σysubscript𝜎𝑦\sigma_{y}italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT denote the standard deviations of x𝑥xitalic_x and y𝑦yitalic_y respectively. And σxysubscript𝜎𝑥𝑦\sigma_{xy}italic_σ start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT is the covariance between x𝑥xitalic_x and y𝑦yitalic_y. While C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are constant.

7.3 Patch-based metrics:

Patch-based metrics evaluate images by comparing patches or local regions instead of individual pixels. These metrics typically use deep learning techniques to extract features from the patches. LPIPS and FID are patch-based metrics used to evaluate image inpainting methods.

Learned Perceptual Image Patch Similarity (LPIPS): This metric is used to evaluate the perceptual similarity between two images. Unlike traditional metrics, such as the MSE or PSNR, which measure pixel-wise differences, LPIPS is designed to capture perceptual differences that align more closely with human perception. LPIPS calculates the average Euclidean distance between the feature representations of corresponding patches or layers extracted from the images. This distance reflects the perceptual difference between the two images. The LPIPS metric can be expressed as follows:

LPIPS(I1,I2)=1Ni=1Nϕ(I1)iϕ(I2)i2LPIPSsubscript𝐼1subscript𝐼21𝑁superscriptsubscript𝑖1𝑁subscriptnormitalic-ϕsubscriptsubscript𝐼1𝑖italic-ϕsubscriptsubscript𝐼2𝑖2\text{LPIPS}(I_{1},I_{2})=\frac{1}{N}\sum_{i=1}^{N}\|\phi(I_{1})_{i}-\phi(I_{2% })_{i}\|_{2}LPIPS ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_ϕ ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_ϕ ( italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (3)

Where I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and I2subscript𝐼2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represent the input image and the ground-truth. ϕ(I1)italic-ϕsubscript𝐼1\phi(I_{1})italic_ϕ ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) denotes the feature representation of image I𝐼Iitalic_I. ϕ(I)italic-ϕ𝐼\phi(I)italic_ϕ ( italic_I ) represents the feature vector of the ith𝑖𝑡i-thitalic_i - italic_t italic_h patch or layer in the feature representation of image I𝐼Iitalic_I. N𝑁Nitalic_N is the total number of patches or layers in the feature representation. .2\|.\|_{2}∥ . ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes the Euclidean distance.

Fréchet Inception Distance (FID): This is used to evaluate the quality of generated images in generative models such as those using GANs. This metric is considered as to be patch-based metrics in some papers [46], while in other, it is considered to be a features-based metric. A lower FID value signifies a higher consistency between the two image sets. For video inpainting, researchers used VFID. The FID is formulated as follows:

FID=μrealμgen2+Tr(Σreal+Σgen2(ΣrealΣgen)0.5)𝐹𝐼𝐷superscriptdelimited-∥∥subscript𝜇realsubscript𝜇gen2TrsubscriptΣrealsubscriptΣgen2superscriptsubscriptΣrealsubscriptΣgen0.5FID=\lVert\mu_{\text{real}}-\mu_{\text{gen}}\rVert^{2}+\text{Tr}(\Sigma_{\text% {real}}+\Sigma_{\text{gen}}-2(\Sigma_{\text{real}}\Sigma_{\text{gen}})^{0.5})italic_F italic_I italic_D = ∥ italic_μ start_POSTSUBSCRIPT real end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT gen end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + Tr ( roman_Σ start_POSTSUBSCRIPT real end_POSTSUBSCRIPT + roman_Σ start_POSTSUBSCRIPT gen end_POSTSUBSCRIPT - 2 ( roman_Σ start_POSTSUBSCRIPT real end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT gen end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 0.5 end_POSTSUPERSCRIPT ) (4)

Where .2\|.\|_{2}∥ . ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes the Euclidean distance. μrealsubscript𝜇real\mu_{\text{real}}italic_μ start_POSTSUBSCRIPT real end_POSTSUBSCRIPT and μgensubscript𝜇gen\mu_{\text{gen}}italic_μ start_POSTSUBSCRIPT gen end_POSTSUBSCRIPT are the means of the feature representations of real and generated images respectively. ΣrealsubscriptΣreal\Sigma_{\text{real}}roman_Σ start_POSTSUBSCRIPT real end_POSTSUBSCRIPT and ΣgensubscriptΣgen\Sigma_{\text{gen}}roman_Σ start_POSTSUBSCRIPT gen end_POSTSUBSCRIPT are the covariance matrices of the feature representations of real and generated images respectively. Tr()𝑇𝑟Tr(\cdot)italic_T italic_r ( ⋅ ) denotes the trace of a matrix.

Table 3: The performance of each method on PSV and Places2 image inpainting datasets using 40-50% as mask ratio. The bold and underline fonts respectively represent the first and second place.
Method Image size Paris Street View Places2 Para (M)
PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow
CTN [23] 256×256256256256\times 256256 × 256 24.91 .812 - - 21.52 .78 - - 21
ICT [25] 256×256256256256\times 256256 × 256 - - 22.635 0.739 - 34.206 122
MAT [26] 512×512512512512\times 512512 × 512 - - - - 24.169 0.900 - 22.90 60
ZITS [27] 256×256256256256\times 256256 × 256 - - - 24.42 0.870 0.133 26.08 68
ZITS++ [43] 256×256256256256\times 256256 × 256 - - - 24.50 0.885 0.118 25.64 83
BAT-Fill [28] 256×256256256256\times 256256 × 256 21.76 0.865 0.14 63.81 21.74 0.70 - 32.55 -
APT [29] 256×256256256256\times 256256 × 256 24.44 0.858 - - 21.72 0.820 - - -
T-former [30] 256×256256256256\times 256256 × 256 25.37 0.825 - 46.60 21.52 0.770 - 34.52 14
PUT [32] 256×256256256256\times 256256 × 256 - - 22.94 0.75 - 31.48 -
TransCNN-HAE [34] 256×256256256256\times 256256 × 256 24.482 0.841 - 68.791 24.471 0.843 - 28.226 19
Campana et al. [37] 256×256256256256\times 256256 × 256 25.821 0.820 0.112 47.26 22.317 0.775 0.140 4.640 17
SPN [38] 256×256256256256\times 256256 × 256 24.64 0.795 - 70.76 22.18 0.763 - 4.73 50
U2AFN [40] 256×256256256256\times 256256 × 256 - - - - 20.36 0.615 - - -
SWMH [42] 256×256256256256\times 256256 × 256 24.127 0.804 0.121 - - - - - 6
CBNet [44] 256×256256256256\times 256256 × 256 24.72 0.796 - 11.90 21.48 0.753 - 10.74 21
CoordFill [47] 512×512512512512\times 512512 × 512 - - - - 26.365 0.912 0.068 -
CMT [52] 256×256256256256\times 256256 × 256 21.70 0.8408 - 19.36 - - - -
NDMAL [55] 256×256256256256\times 256256 × 256 - - - - 21.89 0.776 0.312 37.88 4
TransInpaint [53] 256×256256256256\times 256256 × 256 - 0.884 0.104 8.05 - - - - -
Blind-Omni-Wav-Net [56] 256×256256256256\times 256256 × 256 27.81 0.905 - 40.646 27.55 0.918 - 17.521 16
GAN
Wang et al. [34] 256×256256256256\times 256256 × 256 25.827 0.862 - 114.51 22.850 0.838 - 9.20 -
Swin-GAN [41] - 22.301 0.777 - - - - - - -
UFFC [50] 256×256256256256\times 256256 × 256 - - - - 26.41 0.81 20.24 - -

7.4 Results discussion

In this section, we performed a comparison of the proposed methods in terms of the obtained results using the evaluation metrics on various image and video inpainting datasets. For image inpainting, the most commonly used datasets were PSV, Places2, CelebA-HQ, and FHHQ. For video inpainting, the proposed method mostly commonly used the YouTube-HQ and DAVIS datasets. A comparison of the number of parameters of each model is performed. This allows the researchers assess the lightweight models regarding the computational resource challenges, especially for image and video inpainting with high-resolution images/videos.

Evaluation on PSV and Places2 dataset

Table 3 shows the results obtained from transformer-based image inpainting methods with the PSNR, SSIM, LPIPS, and FID metrics applied to the PSV and Places2 datasets. In this comparison, we present the results for the mask ratio 40–50%, which is the most used ratio for the majority of the papers. In addition, the Table shows the image input size used in the experiments.

For the PSV dataset, the obtained results show that the Blind-Omni-Wav-Net outperforms the other methods, achieving the highest PSNR values, which demonstrate it efficiency in reconstructing high-fidelity images. The methods proposed by Compana et al. and Wang et al. obtained the second best results in terms of the PSNR with a difference of two points from the Blind-Omni-Wav-Net. The majority of remaining methods exceed 24. In terms of SSIM metrics, the Blind-Omni-Wav-Net method also had the highest result, which was 0.905 better than the TransInpaint and BAT-Fill methods. This indicates that the Blind-Omni-Wav-Net preserves the structural and textural integrity of the inpainted images. On the other hand, the TransInpaint and Campana et al. methods achieved the lowest LPIPS values, reflecting superior texture and detail accuracy. In terms of the LPIPS metric, the values are close for all methods. For the FID metric, TransInpaint obtained the minimal FID score, indicating its effectiveness in generating images close to real images. This due the results obtained for SSIM, LPIPS and FID.

In the same context, the comparison was performed on the Places2 dataset using the same evaluation metrics. Almost all methods used this dataset for their experiments. The Blind-Omni-Wav-Net method obtained the highest performance in terms of the PSNR and SSIM metrics, which demonstrate the effectiveness of this method against the other proposed methods. In the second place, CoordFill reached 26.365 for PSNR and 0.912 for SSIM. This is proven by the use of attentional FFC, in addition to analyzing just the missing regions during the network process while the other regions are not analyzed and keep the same pixels values. For that, CoordFill methods can work on high- resolution images. For the other methods, including AMT ZITS, ZITS++, and TransCNN-HAE, the PSNR values were close. The CoordFill and ZITS++ methods obtained the lowest LPIPS scores, highlighting their proficiency in capturing and reproducing the complicated textures and details to different scenes in the Places2 dataset. In terms of the FID metric, the Compana et al. and SPN reached the lowest values. The differences between the obtained metrics values for each method were compared, and we found that some methods are good using one metric yet were not the best for others. This can be explained by the effectiveness of each method in specific tasks, such as preserving high-resolution quality, preserving the semantic similarity, or generating effective texture.

The presented results in 3 are for the mask ratio of 40–50%; however, some methods are performed with different mask ratios. These methods were collected and compared in terms of PSNR based on each mask ratio and illustrated in Figure 5. We observed that the PSNR of these methods decreases when we increase the ratio of the mask. On the PSV dataset, the method by Li et al. performed well when the ratio was at 10–20%. Furthermore, for the ratio of 50–60%, the APT method was the best one in terms of the PSNR value. With the Places2 dataset, the same observation was found using the method by Li et al., with a 10–20% and 20–30%, ratio, while SwMH was the best for the 40–50% and 50–60% ratio. Some methods performed their experiment using two or three of these ratios, such as Compana et al., APT, and GCMAM.

Refer to caption
(a) PSV
Refer to caption
(b) Places2
Refer to caption
(c) CelebA
Figure 5: Performance of image inpainting method based on mask ratio.
Table 4: The performance of each method on the CelebA-HQ and FHHQ image inpainting datasets using 40-50% as mask ratio. The bold and underline fonts respectively represent the first and second place
Method Image size CelebA-HQ FHHQ Para. (M)
PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow FID\downarrow
CTN [23] 256×256256256256\times 256256 × 256 25.43 .909 - - - - - - 21
MAT [26] 512×512512512512\times 512512 × 512 25.167 0.917 - 4.86 - - 60
BAT-Fill [28] 256×256256256256\times 256256 × 256 22.53 0.843 - 11.07 - - - - -
T-former [30] 256×256256256256\times 256256 × 256 25.67 0.915 - 5.42 - - 14
PUT [32] 256×256256256256\times 256256 × 256 - - 24.12 0.888 - 19.93 -
TransCNN-HAE [34] 256×256256256256\times 256256 × 256 24.148 0.902 - 8.783 23.292 0.885 - 9.740 19
Campana et al. [37] 256×256256256256\times 256256 × 256 26.572 0.872 0.068 2.253 - - - - 17
SPN [38] 256×256256256256\times 256256 × 256 26.54 0.871 - 12.38 - - - - 50
U2AFN [40] 256×256256256256\times 256256 × 256 22.10 0.905 - - - - - - -
SWMH [42] 256×256256256256\times 256256 × 256 24.616 0.822 0.103 - - - - - 6
ZITS++ [43] 256×256256256256\times 256256 × 256 - - - 27.56 0.918 0.069 5.50 83
CBNet [44] 256×256256256256\times 256256 × 256 25.04 0.827 - 6.307 - - - - 21
CoordFill [47] 512×512512512512\times 512512 × 512 28.756 0.934 0.065 - - - - - -
CMT [52] 256×256256256256\times 256256 × 256 23.78 0.899 - 5.23 - - - - -
TransInpaint [53] 256×256256256256\times 256256 × 256 - 0.941 0.079 4.46 - - - - -
NDMAL [55] 256×256256256256\times 256256 × 256 23.14 0.858 0.1479 12.897 - - 4
Blind-Omni-Wav-Net [56] 256×256256256256\times 256256 × 256 28.21 0.951 - 7.235 28.19 0.952 - 8.639 16
GAN
Wang et al. [34] 256×256256256256\times 256256 × 256 26.35 0.926 - 11.01 25.388 0.914 - 6.315 -
Li et al. [39] 256×256256256256\times 256256 × 256 27.51 0.89 0.09 31.9 - - - - -
Swin-GAN [41] 256×256256256256\times 256256 × 256 26.012 0,870 - - - - - -
GCAM [54] 512×512512512512\times 512512 × 512 26.63 0.857 - 4.203 - - 7

Evaluation on CelebA and FHHQ dataset

Related to the PSV and Places2, the two other datasets used in different image inpainting methods using transformers which include the CelebA-HQ and FHHQ datasets, which are two datasets are of human faces. To compare the proposed methods on these two datasets, we present the obtained results using various metrics in Table 4. Most of the proposed methods reached convincing results in terms of the quality of the generated images and the precision of the filled region represented by SSIM metrics. For example, using the PSNR metric on the CelebA-HQ dataset, we found that 15 of the 20 methods reached a PSNR value >24, while all SSIM values were >80%. The Blind-Omni-Wav-Net reached the best PSNR result of 28.21, followed by CoordFill. While using SSIM metric, Blind-Omni-Wav-Net and TransInpaint generated the best results. For the LPIPS metric, the Compana et al. and CoordFill methods were the best. Each one these methods work to solve a specific challenge; thus, each is best in some metrics. For example, CoordFill used a technique that can preserve the region pixels that are not missing, making it better in terms of PSNR and SSIM metrics.

Using the different mask ratios represented in Figure 5, we can detect the differences between the methods in terms of the PSNR values. Some methods are not represented in this figure because they did not use different mask ratio, such as Blind-Omni-Wav-Net and CoorFill. From the methods that use various mask ratios, the method of Li et al. was the best in almost all ratios, except for 10–20%. Furthermore, the SwMH method was not the best for the ratios <50%, but the results were close to the best for the mask ratio of 50–60%.

Table 5: The performance of each method on the YouTube-VOS and DAVIS image datasets. The bold and underline fonts respectively represent the first and second place.
Method YouTube-VOS DAVIS
PSNR\uparrow SSIM\uparrow VFID\downarrow Ewarpsubscript𝐸𝑤𝑎𝑟𝑝E_{warp}italic_E start_POSTSUBSCRIPT italic_w italic_a italic_r italic_p end_POSTSUBSCRIPT\downarrow LPIPS\downarrow PSNR\uparrow SSIM\uparrow VFID\downarrow Ewarpsubscript𝐸𝑤𝑎𝑟𝑝E_{warp}italic_E start_POSTSUBSCRIPT italic_w italic_a italic_r italic_p end_POSTSUBSCRIPT\downarrow LPIPS\downarrow
FuseFormer [57] 33.16 0.9673 0.051 0.0875 - 32.54 0.9700 0.138 0.1336 -
DSTT [58] 32.66 0.9646 0.052 0.1430 31.75 0.9650 0.148 0.1716 -
E2FGVI [60] 33.71 0.9700 0.046 0.0864 33.01 0.9721 0.116 0.1315 -
FGT [61] 34.53 0.976 - - - 33.41 0.974 - - 0.023
DeViT [62] 33.42 0.9732 0.1429 0.049 - 32.43 0.9721 0.1663 0.133 -
DLFormer [63] 33.95 0.970 0.082 - - 34.22 0.977 0.062 - -
DMT [64] 34.27 0.973 0.044 - - 33.82 0.976 0.104 - -
FGT++ [65] 35.02 0.976 - - 0.025 33.18 0.971 - - 0.028
ProPainter [68] 34.43 0.9735 0.042 0.974 - 34.47 0.9776 0.098 1.187 -
SAVIT [69] 33.97 0.9727 0.043 0.0436 - 33.14 0.9748 0.107 0.0673 -
FSTT [70] 34.33 0.9731 0.044 - - 33.77 0.9756 0.109 - -

In the same context, some methods were evaluated on the FHHQ dataset resulted in lower values than the other datasets. The Blind-Omni-Wav-Net method achieved the best PSNR and SSIM values, while ZITS++ was the best using LPIPS and FID metrics. The number of parameters of the models can be significant for the robustness of a model, while its can be also a challenge in terms of computational resources. NDMAL and SWMH have the lowest number of parameters; however, in terms of the obtained results, they were less also than the others.

In conclusion, the obtained results of these inpainting methods across the datasets indicates not only the improvements made in the field, but also the impact of transformer-based techniques on the inpainting task. furthermore, the diversity of the methods enables the possibility of working on different aspects, such as the image quality, structural similarity, or computational efficiency, for the purpose of generating realistic images.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 6: The obtained results on Three videos from DAVIS dataset. First column: video fram with mask image. Second column: DSTT. Third column: FuseFormer. Fourth column: E2FGVI. Fifth column: ProPainter.

Evaluation on YouTube-VOS and DAVIS datasets

To evaluate the proposed method for video inpainting, we performed a comparison on the most used datasets, YouTube-VOS and DAVIS. The obtained results are presented in Table 5 using different metrics, including the PSNR, SSIM, VFID, LPIPS, and Ewarpsubscript𝐸𝑤𝑎𝑟𝑝E_{w}arpitalic_E start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_a italic_r italic_p. On both datasets, all methods used PSNR and SSIM, and VFID metrics for evaluation, while the other two other metrics used some of these methods.

For the YouTube dataset, all the obtained PSNR values exceed 33; FGT++ was the best with a value of 35.02, followed by FGT, DMT, ProPainter, and FSTT, which reached 34. The same methods reached close results in terms of the SSIM and VFID metrics. The best value for SSIM was 97%, which reveals the improvement reached in the inpainting video with a high quality. In addition, compared to image inpainting results, the obtained results on videos are better.

Using the DAVIS dataset for video inpainting, the ProPainter and DLFormer methods reached better results in terms of the PSNR values, with a difference of 0.2 between them. The other method also achieved close PSNR values, with a difference of 1 point for most of them. The same observation was found for SSIM metrics. These results demonstrate the capability of these methods in inpainting videos, with a convincing performance. This remains true when we compare the results with the YouTube-VOS and DAVIS datasets; the results are similar even though the scenes of the two datasets are different.

To exemplify the obtained results using metrics we tested the proposed methods with source codes, including DSTT, FuseFormer, E2FGVI, and ProPainter, on three videos from the DAVIS dataset. The obtained results are illustrated in Figure 6. The used masks are presented in the first images and the remaining images are the obtained results. The methods do not succeed in inpainting the bus, with ProPaint inpainting it better than the other methods. For the second video the results are good. While for the third video, the method succeeds in inpainting most parts of the object; however, the shadow of the player still exists using the E2FGVI and DSTT. Overall, the transformer-based methods improve the video inpainting task in terms of quality.

8 Image inpainting challenges

Deep learning is a trending technology for all computer science and robotics tasks to help and assist human actions. Using artificial neural networks, which are supposed to work like a human brain, deep learning is an aspect of artificial intelligence (AI) that consists of solving the classification and recognition goals for machine learning from specific data for specific scenarios [85]. For image inpainting, the process of filling in missing or damaged areas of an image poses several challenges. These challenges can be divided into the challenges related to computer vision and architecture challenges and those related to images inpainting, such as the quality of the images. Furthermore, some challenges are related to transformer-based methods. We discuss the following set of challenges in detail.

Preservation of Semantics: Inpainting algorithms must preserve the semantic content of the image while filling in missing regions. The filled-in areas should blend seamlessly with the surrounding context and maintain the overall meaning of the image. Transformer-based models may struggle with preserving spatial coherence in inpainted regions, especially when dealing with complex textures or intricate structures. Ensuring smooth transitions and consistent patterns across the inpainted areas remains a challenge. In addition, inpainting requires synthesizing textures and structures to replace missing regions. Generating realistic textures that match the surrounding areas and maintaining structural coherence is essential for producing convincing inpainted results.

Context Understanding: Transformer models are effective at capturing long-range dependencies in sequential data, while understanding contextual information in images can be challenging [21]. For image or video inpainting, understanding the global context of the image, including scene semantics and object relationships, is crucial for generating realistic and coherent inpainted results. In addition, inpainting algorithms need to accurately reconstruct missing edges and ensure smooth transitions between filled-in and original regions. This can be a challenge for deep learning models, including transformer-based models. Furthermore, there are challenges in handling the missing regions of different scales; from small scratches to large objects, removing noise or artifacts can complicate the inpainting process.

Complexity of Architecture: In the literature, the effective architectures used as feature extraction are generally complex, making them challenging to train, interpret, and optimize [87, 88]. Balancing model complexity with performance requirements is crucial but it can be difficult to achieve due to the other parameters, such as computational resources, especially for large scale datasets, and the number of parameters of each model (GFLOPS). In addition, training a complex architecture on some specific tasks can be more time-consuming than others. For transformer-based models based on self-attention techniques, this can make these models more complex. Thus, efficient implementation strategies and optimization techniques are required to make transformer-based inpainting methods practical for real-world use cases.

Overfitting: Deeper feature extraction architectures are sensitive to overfitting, where the model memorizes the training data rather than learning generalizable features [89]. Finding the best parameters, such as dropout and weight decay, can minimize the impact of this challenge; however, finding the right combination require a lot of tests and it can change from one task to another.

Data quality requirements: Training a CNN model requires large-scale annotated datasets, which can be expensive, time-consuming, or even unavailable for certain domains or applications. Data augmentation techniques can help to handle this challenge for some cases but may not address all scenarios for representative training data. The quality of data also represents a challenge for deep learning architecture; for example, high resolution images robustly obtain good results, but training requires a computational resource, which is another challenge. For transformer models in the case of high-resolution images, the training of the model requires dividing the image into smaller patches or applying hierarchical approaches, which can affect the quality.

Computational Resources: Deep-learning-based models require significant computational resources, including powerful GPUs or TPUs, for training and inference. Scaling CNNs to handle larger datasets or more complex architectures increases the demand for computational resources and limits the accessibility for researchers. The same observation was made for transformer-based models that are computationally intensive; generating real-time or interactive inpainting applications represents a challenge and requires large-scale training datasets to obtain meaningful representations.

Domain Adaptation: CNNs trained on specific datasets or for a task may not be suitable for different datasets or real-world environments due to domain shifts or biases. Adapting pre-trained CNNs to new domains or tasks with limited annotated data represents a challenge, especially when the target domain differs significantly from the source domain [90]. This is shown in the feature extraction models used for some specific tasks in the previous section. For, transformer-based models, generating diverse and high-quality training data for image inpainting tasks, particularly for specific image types, can be challenging. Furthermore, it is a challenge to ensure that the model generalizes well to unseen data and various inpainting scenarios.

9 Concluding Remarks and Future Directions

In this paper, we reviewed several research papers on image and video inpainting techniques based on visual transformers, including their ability to capture long-range dependencies and model complex relationships within images. The proposed methods attempt to enhance the task, including efficiency and information preservation, achieving both realistic textures and structures.

Image and video inpainting have advanced significantly with the rise of deep learning, notably CNNs and GANs, which excel at filling missing or damaged regions while preserving context. Recently, transformer-based architectures have emerged as promising alternatives, leveraging self-attention mechanisms to understand global context effectively. This paper undertakes a comprehensive review, focusing on transformer-based techniques for image and video inpainting. Through a systematic categorization based on architectural configurations, types of damages, and performance metrics, we aim to demonstrate the significant progress and offer guidance to aspiring researchers in the field.

In the domain of transformer-based image and video in- painting, a notable challenge lies in refining the model’s ability to effectively handle complex and dynamic visual contexts. This requires develo** mechanisms that can seamlessly integrate temporal information in video sequences while preserving spatial coherence, thus ensuring the faithful reconstruction of missing regions. Additionally, addressing the computational cost associated with the large-scale transformer architectures demands innovative strategies for optimizing efficiency without compromising performance, thereby enabling real-time inpainting for practical applications. Furthermore, enhancing the model’s robustness to diverse and challenging inpainting scenarios, such as occlusions, irregular shapes, and varying textures, remains a critical frontier in advancing the capabilities of this transformative technology.

In terms of future research directions, several open questions remain in the realm of image and video inpainting, particularly concerning transformer-based techniques. Key avenues for further exploration include enhancing the handling of long-range dependencies to improve inpainting accuracy, investigating the performance of transformer-based approaches on diverse datasets beyond standard image and video formats to uncover new challenges and opportunities, and refining the realism and consistency of inpainted regions, especially in scenarios involving intricate textures or complex structures. Additionally, addressing temporal consistency across frames in video inpainting and ensuring robustness to various damage types, such as occlusions, corruptions, and missing data, are crucial areas for future research. Furthermore, optimizing the efficiency and scalability of transformer-based architectures for large-scale datasets or real-time applications remains an ongoing challenge. By ad- dressing these open questions, the field can advance towards more versatile, robust, and efficient inpainting solutions.

Acknowledgments

The project is funded by the College of Information Technology (CIT), United Arab Emirates University (UAEU).

References

  • [1] Wang, M., Yan, B., & Ngan, K. N. (2013). An efficient framework for image or video inpainting. Signal Processing: Image Communication, 28(7), 753-762.
  • [2] Newson, A., Almansa, A., Gousseau, Y., & Pérez, P. (2017). Non-local patch-based image inpainting. Image Processing On Line, 7, 373-385.
  • [3] Abdulla, A. A., & Ahmed, M. W. (2021). An improved image quality algorithm for exemplar-based image inpainting. Multimedia tools and applications, 80(9), 13143-13156.
  • [4] Elharrouss, O., Akbari, Y., Almadeed, N., & Al-Maadeed, S. (2024). Backbones-review: Feature extractor networks for deep learning and deep reinforcement learning approaches in computer vision. Computer Science Review, 53, 100645.
  • [5] Elharrouss, O., ElKaitouni, S. E., Akbari, Y., Al-Maadeed, S., & Bouridane, A. (2023, September). Attention-based Network for Image/Video Salient Object Detection. In 2023 11th European Workshop on Visual Information Processing (EUVIP) (pp. 1-6). IEEE.
  • [6] Himeur, Y., Al-Maadeed, S., Almaadeed, N., Abualsaud, K., Mohamed, A., Khattab, T., & Elharrouss, O. (2022). Deep visual social distancing monitoring to combat COVID-19: A comprehensive survey. Sustainable cities and society, 85, 104064.
  • [7] ViDMASK dataset for face mask detection with social distance measurement
  • [8] Lin, T., Wang, Y., Liu, X., & Qiu, X. (2022). A survey of transformers. AI open, 3, 111-132.
  • [9] Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., … & Tao, D. (2022). A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1), 87-110.
  • [10] Patel, K.R., Jain, L., & Patel, A.G. (2015). Image inpainting, a review of the underlying different algorithms and comparative study of the inpainting techniques. International Journal of Computer Applications , 118 (10), 32-38.
  • [11] Pushpalwar, RT, & Bhandari, SH (2016, February). Image inpainting approaches-a review. In 2016 IEEE 6th International Conference on Advanced Computing (IACC) (pp. 340-345). IEEE.
  • [12] Ahire, B. A., & Deshpande, N. A. (2018, August). Image Inpainting Techniques Applicable To Depth Image Based Rendering: A Review. In 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA) (pp. 1-5). IEEE.
  • [13] Elharrouss, O., Almaadeed, N., Al-Maadeed, S., & Akbari, Y. (2020). Image inpainting: A review. Neural Processing Letters, 51, 2007-2028.
  • [14] Wang, L., Chen, W., Yang, W., Bi, F., & Yu, F. R. (2020). A state-of-the-art review on image synthesis with generative adversarial networks. IEEE Access, 8, 63514-63537.
  • [15] Jam, J., Kendrick, C., Walker, K., Drouard, V., Hsu, J. G. S., & Yap, M. H. (2021). A comprehensive review of past and present image inpainting methods. Computer vision and image understanding, 203, 103147.
  • [16] Yap, M. H., Batool, N., Ng, C. C., Rogers, M., & Walker, K. (2021). A survey on facial wrinkles detection and inpainting: datasets, methods, and challenges. IEEE Transactions on Emerging Topics in Computational Intelligence, 5(4), 505-519.
  • [17] Liu, K., Li, J., & Hussain Bukhari, S. S. (2022). Overview of image inpainting and forensic technology. Security and Communication Networks, 2022.
  • [18] Zhu, X., Lu, J., Ren, H., Wang, H., & Sun, B. (2023). A transformer–CNN for deep image inpainting forensics. The Visual Computer, 39(10), 4721-4735.
  • [19] Weng, Y., Ding, S., & Zhou, T. (2022, January). A survey on improved GAN based image inpainting. In 2022 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE) (pp. 319-322). IEEE.
  • [20] He, L., Qiang, Z., Wu, Y., & Wang, M. (2022, December). Survey on measures of image similarity applied in image inpainting base on deep learning. In Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence (pp. 1-6).
  • [21] Xiang, H., Zou, Q., Nawaz, M. A., Huang, X., Zhang, F., & Yu, H. (2023). Deep learning for image inpainting: A survey. Pattern Recognition, 134, 109046.
  • [22] Zhang, X., Zhai, D., Li, T., Zhou, Y., & Lin, Y. (2023). Image inpainting based on deep learning: A review. Information Fusion, 90, 74-94.
  • [23] Deng, Y., Hui, S., Zhou, S., Meng, D., & Wang, J. (2021, October). Learning contextual transformer network for image inpainting. In Proceedings of the 29th ACM international conference on multimedia (pp. 2529-2538).
  • [24] Wang, L., Zhang, S., Gu, L., Zhang, J., Zhai, X., Sha, X., & Chang, S. (2021). Automatic consecutive context perceived transformer GAN for serial sectioning image blind inpainting. Computers in Biology and Medicine , 136 , 104751.
  • [25] Wan, Z., Zhang, J., Chen, D., & Liao, J. (2021). High-fidelity pluralistic image completion with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4692-4701).
  • [26] Li, W., Lin, Z., Zhou, K., Qi, L., Wang, Y., & Jia, J. (2022). Mat: Mask-aware transformer for large hole image inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10758-10768).
  • [27] Dong, Q., Cao, C., & Fu, Y. (2022). Incremental transformer structure enhanced image inpainting with masking positional encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11358-11368).
  • [28] Yu, Y., Zhan, F., Wu, R., Pan, J., Cui, K., Lu, S., … & Miao, C. (2021, October). Diverse image inpainting with bidirectional and autoregressive transformers. In Proceedings of the 29th ACM International Conference on Multimedia (pp. 69-78).
  • [29] Huang, M., & Zhang, L. (2022, October). Atrous pyramid transformer with spectral convolution for image inpainting. In Proceedings of the 30th ACM International Conference on Multimedia (pp. 4674-4683).
  • [30] Deng, Y., Hui, S., Zhou, S., Meng, D., & Wang, J. (2022, October). T-former: an efficient transformer for image inpainting. In Proceedings of the 30th ACM International Conference on Multimedia (pp. 6559-6568).
  • [31] Zeng, Y., Fu, J., Chao, H., & Guo, B. (2022). Aggregated contextual transformations for high-resolution image inpainting. IEEE Transactions on Visualization and Computer Graphics.
  • [32] Liu, Q., Tan, Z., Chen, D., Chu, Q., Dai, X., Chen, Y., … & Yu, N. (2022). Reduce information loss in transformers for pluralistic image inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11347-11357).
  • [33] Hosen, M. I., & Islam, M. B. (2022). Himfr: A hybrid masked face recognition through face inpainting. arXiv preprint arXiv:2209.08930.
  • [34] Wang, M., Lu, W., Lyu, J., Shi, K., & Zhao, H. (2022). Generative image inpainting with enhanced gated convolution and Transformers. Displays, 75, 102321.
  • [35] Zhao, H., Gu, Z., Zheng, B., & Zheng, H. (2022, October). Transcnn-hae: Transformer-cnn hybrid autoencoder for blind image inpainting. In Proceedings of the 30th ACM International Conference on Multimedia (pp. 6813-6821).
  • [36] Kim, S., Baek, J., Park, J., Kim, G., & Kim, S. (2022). Instaformer: Instance-aware image-to-image translation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18321-18331).
  • [37] Campana, J. L. F., Decker, L. G. L., e Souza, M. R., de Almeida Maia, H., & Pedrini, H. (2023). Variable-hyperparameter visual transformer for efficient image inpainting. Computers & Graphics, 113, 57-68.
  • [38] Zhang, W., Wang, Y., Ni, B., & Yang, X. (2023). Fully context-aware image inpainting with a learned semantic pyramid. Pattern Recognition, 143, 109741.
  • [39] Li, H., Song, Y., Li, H., & Wang, Z. (2023). Semantic prior-driven fused contextual transformation network for image inpainting. Journal of Visual Communication and Image Representation, 91, 103777.
  • [40] Ma, X., Zhou, X., Huang, H., Jia, G., Wang, Y., Chen, X., & Chen, C. (2023). Uncertainty-aware image inpainting with adaptive feedback network. Expert Systems with Applications , 121148.
  • [41] Zhou, M., Liu, X., Yi, T., Bai, Z., & Zhang, P. (2023). A superior image inpainting scheme using Transformer-based self-supervised attention GAN model. Expert Systems with Applications , 233 , 120906.
  • [42] Chen, B. W., Liu, T. J., & Liu, K. H. (2023, September). Lightweight Image Inpainting By Stripe Window Transformer With Joint Attention To CNN. In 2023 IEEE 33rd International Workshop on Machine Learning for Signal Processing (MLSP) (pp. 1-6). IEEE.
  • [43] Cao, C., Dong, Q., & Fu, Y. (2023). ZITS++: image inpainting by improving the incremental transformer on structural priors. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • [44] **, Y., Wu, J., Wang, W., Yan, Y., Jiang, J., & Zheng, J. (2023). Cascading Blend Network for Image Inpainting. ACM Transactions on Multimedia Computing, Communications and Applications, 20(1), 1-21.
  • [45] Liao, L., Liu, T., Chen, D., Xiao, J., Wang, Z., & Lin, C. W. (2023). TransRef: Multi-Scale Reference Embedding Transformer for Reference-Guided Image Inpainting. arXiv preprint arXiv:2306.11528.
  • [46] Naderi, M., Givkashi, M., Karimi, N., Shirani, S., & Samavi, S. (2023). SFI-Swin: Symmetric Face Inpainting with Swin Transformer by Distinctly Learning Face Components Distributions. arXiv preprint arXiv:2301.03130.
  • [47] Liu, W., Cun, X., Pun, C. M., Xia, M., Zhang, Y., & Wang, J. (2023, June). Coordfill: Efficient high-resolution image inpainting via parameterized coordinate querying. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 37, No. 2, pp. 1746-1754).
  • [48] Xiang, H., Min, W., Wei, Z., Zhu, M., Liu, M., & Deng, Z. (2024). Image inpainting network based on multi-level attention mechanism. IET Image Processing, 18(2), 428-438.
  • [49] Chen, M., Liu, T., Xiong, X., Duan, Z., & Cui, A. (2023). A transformer-based cross-window aggregated attentional image inpainting model. Electronics, 12(12), 2726.
  • [50] Chu, T., Chen, J., Sun, J., Lian, S., Wang, Z., Zuo, Z., … & Lu, D. (2023). Rethinking fast fourier convolution in image inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 23195-23205).
  • [51] Motamed, S., Xu, J., Wu, CH, Häne, C., Bazin, JC, & De la Torre, F. (2023). Patmat: Person aware tuning of mask-aware transformer for face inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 22778-22787).
  • [52] Ko, K., & Kim, C. S. (2023). Continuously masked transformer for image inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 13169-13178).
  • [53] Shamsolmoali, P., Zareapoor, M., & Granger, E. (2023). Transinpaint: Transformer-based image inpainting with context adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 849-858).
  • [54] Chen, Y., Xia, R., Yang, K., & Zou, K. (2023). GCAM: lightweight image inpainting via group convolution and attention mechanism. International Journal of Machine Learning and Cybernetics, 1-11.
  • [55] Phutke, S. S., & Murala, S. (2023). Nested deformable multi-head attention for facial image inpainting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 6078-6087).
  • [56] Phutke, S. S., Kulkarni, A., Vipparthi, S. K., & Murala, S. (2023). Blind image inpainting via omni-dimensional gated attention and wavelet queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1251-1260).
  • [57] Liu, R., Deng, H., Huang, Y., Shi, X., Lu, L., Sun, W., … & Li, H. (2021). Fuseformer: Fusing fine-grained information in transformers for video inpainting. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 14040-14049).
  • [58] Yu, B., Li, W., Li, X., Lu, J., & Zhou, J. (2021). Frequency-aware spatiotemporal transformers for video inpainting detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8188-8197).
  • [59] Liu, R., Deng, H., Huang, Y., Shi, X., Lu, L., Sun, W., … & Li, H. (2021). Decoupled spatial-temporal transformer for video inpainting. arXiv preprint arXiv:2104.06637.
  • [60] Li, Z., Lu, CZ, Qin, J., Guo, CL, & Cheng, MM (2022). Towards an end-to-end framework for flow-guided video inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 17562-17571).
  • [61] Zhang, K., Fu, J., & Liu, D. (2022, October). Flow-guided transformer for video inpainting. In European Conference on Computer Vision (pp. 74-90). Cham: Springer Nature Switzerland.
  • [62] Cai, J., Li, C., Tao, X., Yuan, C., & Tai, Y. W. (2022, October). Devit: Deformed vision transformers in video inpainting. In Proceedings of the 30th ACM International Conference on Multimedia (pp. 779-789).
  • [63] Ren, J., Zheng, Q., Zhao, Y., Xu, X., & Li, C. (2022). Dlformer: Discrete latent transformer for video inpainting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3511-3520).
  • [64] Yu, Y., Fan, H., & Zhang, L. (2023). Deficiency-aware masked transformer for video inpainting. arXiv preprint arXiv:2307.08629.
  • [65] Zhang, K., Peng, Zhang, K., Peng, J., Fu, J., & Liu, D. (2024). Exploiting optical flow guidance for transformer-based video inpainting. IEEE Transactions on Pattern Analysis and Machine Intelligence.J., Fu, J., & Liu, D. (2024).
  • [66] Liao, M., Lu, F., Zhou, D., Zhang, S., Li, W., & Yang, R. (2020). Dvi: Depth guided video inpainting for autonomous driving. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16 (pp. 1-17). Springer International Publishing.
  • [67] Li, G., Zhang, K., Su, Y., & Wang, J. (2023). Feature pre-inpainting enhanced transformer for video inpainting. Engineering Applications of Artificial Intelligence, 123, 106323.
  • [68] Zhou, S., Li, C., Chan, K. C., & Loy, C. C. (2023). ProPainter: Improving propagation and transformer for video inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10477-10486).
  • [69] Lee, E., Yoo, J., Yang, Y., Baik, S., & Kim, T. H. (2023). Semantic-Aware Dynamic Parameter for Video Inpainting Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 12949-12958).
  • [70] Liu, R., & Zhu, Y. (2023). FSTT: Flow-Guided Spatial Temporal Transformer for Deep Video Inpainting. Electronics , 12 (21), 4452.
  • [71] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018
  • [72] Y. Wang, X. Tao, X. Qi, X. Shen, J. Jia, Image inpainting via generative multi-column convolutional neural networks, in: Advances in Neural Information Processing Systems (NeurIPS), 2018, pp. 331–340
  • [73] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
  • [74] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer and super-resolution, in: European conference on computer vision, Springer, Cham, 2016, pp. 694–711.
  • [75] L.A. Gatys, A.S. Ecker, M. Bethge, Image style transfer using convolutional neural networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2414–2423.
  • [76] Ruby, U., & Yendapalli, V. (2020). Binary cross entropy with deep learning technique for image classification. Int. J. Adv. Trends Comput. Sci. Eng, 9(10).
  • [77] Zhou, Z., Huang, H., & Fang, B. (2021). Application of weighted cross-entropy loss function in intrusion detection. Journal of Computers and Communications , 9 (11), 1-21.
  • [78] He, X., & Yin, Y. (2021). Non-local and multi-scale mechanisms for image inpainting. Sensors, 21(9), 3281.
  • [79] Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016). Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2536-2544).
  • [80] Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2017). Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence , 40 (6), 1452-1464.
  • [81] Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.
  • [82] Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4401-4410).
  • [83] Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., Yang, J., & Huang, T. (2018). Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 .
  • [84] Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., & Sorkine-Hornung, A. (2016). A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 724-732).
  • [85] Elharrouss, O., Almaadeed, N., Al-Maadeed, S., Bouridane, A., & Beghdadi, A. (2021). A combined multiple action recognition and summarization for surveillance video sequences. Applied Intelligence, 51, 690-712.
  • [86] Al-Ali, A., Elharrouss, O., Qidwai, U., & Al-Maaddeed, S. (2021). ANFIS-Net for automatic detection of COVID-19. Scientific Reports, 11(1), 17318.
  • [87] Riahi, A., Elharrouss, O., & Al-Maadeed, S. (2022). BEMD-3DCNN-based method for COVID-19 detection. Computers in biology and medicine, 142, 105188.
  • [88] Hu, X., Chu, L., Pei, J., Liu, W., & Bian, J. (2021). Model complexity of deep learning: A survey. Knowledge and Information Systems, 63, 2585-2619.
  • [89] Rice, L., Wong, E., & Kolter, Z. (2020, November). Overfitting in adversarially robust deep learning. In International conference on machine learning (pp. 8093-8104). PMLR.
  • [90] Farahani, A., Voghoei, S., Rasheed, K., & Arabnia, H. R. (2021). A brief review of domain adaptation. Advances in data science and information engineering: proceedings from ICDATA 2020 and IKE 2020, 877-894.