-
Resolution-robust Large Mask Inpainting with Fourier Convolutions
Authors:
Roman Suvorov,
Elizaveta Logacheva,
Anton Mashikhin,
Anastasia Remizova,
Arsenii Ashukha,
Aleksei Silvestrov,
Nae** Kong,
Harshith Goka,
Kiwoong Park,
Victor Lempitsky
Abstract:
Modern image inpainting systems, despite the significant progress, often struggle with large missing areas, complex geometric structures, and high-resolution images. We find that one of the main reasons for that is the lack of an effective receptive field in both the inpainting network and the loss function. To alleviate this issue, we propose a new method called large mask inpainting (LaMa). LaMa…
▽ More
Modern image inpainting systems, despite the significant progress, often struggle with large missing areas, complex geometric structures, and high-resolution images. We find that one of the main reasons for that is the lack of an effective receptive field in both the inpainting network and the loss function. To alleviate this issue, we propose a new method called large mask inpainting (LaMa). LaMa is based on i) a new inpainting network architecture that uses fast Fourier convolutions (FFCs), which have the image-wide receptive field; ii) a high receptive field perceptual loss; iii) large training masks, which unlocks the potential of the first two components. Our inpainting network improves the state-of-the-art across a range of datasets and achieves excellent performance even in challenging scenarios, e.g. completion of periodic structures. Our model generalizes surprisingly well to resolutions that are higher than those seen at train time, and achieves this at lower parameter&time costs than the competitive baselines. The code is available at \url{https://github.com/saic-mdal/lama}.
△ Less
Submitted 10 November, 2021; v1 submitted 15 September, 2021;
originally announced September 2021.
-
DeepLandscape: Adversarial Modeling of Landscape Video
Authors:
Elizaveta Logacheva,
Roman Suvorov,
Oleg Khomenko,
Anton Mashikhin,
Victor Lempitsky
Abstract:
We build a new model of landscape videos that can be trained on a mixture of static landscape images as well as landscape animations. Our architecture extends StyleGAN model by augmenting it with parts that allow to model dynamic changes in a scene. Once trained, our model can be used to generate realistic time-lapse landscape videos with moving objects and time-of-the-day changes. Furthermore, by…
▽ More
We build a new model of landscape videos that can be trained on a mixture of static landscape images as well as landscape animations. Our architecture extends StyleGAN model by augmenting it with parts that allow to model dynamic changes in a scene. Once trained, our model can be used to generate realistic time-lapse landscape videos with moving objects and time-of-the-day changes. Furthermore, by fitting the learned models to a static landscape image, the latter can be reenacted in a realistic way. We propose simple but necessary modifications to StyleGAN inversion procedure, which lead to in-domain latent codes and allow to manipulate real images. Quantitative comparisons and user studies suggest that our model produces more compelling animations of given photographs than previously proposed methods. The results of our approach including comparisons with prior art can be seen in supplementary materials and on the project page https://saic-mdal.github.io/deep-landscape
△ Less
Submitted 21 August, 2020;
originally announced August 2020.
-
AI2D-RST: A multimodal corpus of 1000 primary school science diagrams
Authors:
Tuomo Hiippala,
Malihe Alikhani,
Jonas Haverinen,
Timo Kalliokoski,
Evanfiya Logacheva,
Serafina Orekhova,
Aino Tuomainen,
Matthew Stone,
John A. Bateman
Abstract:
This article introduces AI2D-RST, a multimodal corpus of 1000 English-language diagrams that represent topics in primary school natural sciences, such as food webs, life cycles, moon phases and human physiology. The corpus is based on the Allen Institute for Artificial Intelligence Diagrams (AI2D) dataset, a collection of diagrams with crowd-sourced descriptions, which was originally developed to…
▽ More
This article introduces AI2D-RST, a multimodal corpus of 1000 English-language diagrams that represent topics in primary school natural sciences, such as food webs, life cycles, moon phases and human physiology. The corpus is based on the Allen Institute for Artificial Intelligence Diagrams (AI2D) dataset, a collection of diagrams with crowd-sourced descriptions, which was originally developed to support research on automatic diagram understanding and visual question answering. Building on the segmentation of diagram layouts in AI2D, the AI2D-RST corpus presents a new multi-layer annotation schema that provides a rich description of their multimodal structure. Annotated by trained experts, the layers describe (1) the grou** of diagram elements into perceptual units, (2) the connections set up by diagrammatic elements such as arrows and lines, and (3) the discourse relations between diagram elements, which are described using Rhetorical Structure Theory (RST). Each annotation layer in AI2D-RST is represented using a graph. The corpus is freely available for research and teaching.
△ Less
Submitted 20 March, 2020; v1 submitted 9 December, 2019;
originally announced December 2019.
-
Learning State Representations in Complex Systems with Multimodal Data
Authors:
Pavel Solovev,
Vladimir Aliev,
Pavel Ostyakov,
Gleb Sterkin,
Elizaveta Logacheva,
Stepan Troeshestov,
Roman Suvorov,
Anton Mashikhin,
Oleg Khomenko,
Sergey I. Nikolenko
Abstract:
Representation learning becomes especially important for complex systems with multimodal data sources such as cameras or sensors. Recent advances in reinforcement learning and optimal control make it possible to design control algorithms on these latent representations, but the field still lacks a large-scale standard dataset for unified comparison. In this work, we present a large-scale dataset a…
▽ More
Representation learning becomes especially important for complex systems with multimodal data sources such as cameras or sensors. Recent advances in reinforcement learning and optimal control make it possible to design control algorithms on these latent representations, but the field still lacks a large-scale standard dataset for unified comparison. In this work, we present a large-scale dataset and evaluation framework for representation learning for the complex task of landing an airplane. We implement and compare several approaches to representation learning on this dataset in terms of the quality of simple supervised learning tasks and disentanglement scores. The resulting representations can be used for further tasks such as anomaly detection, optimal control, model-based reinforcement learning, and other applications.
△ Less
Submitted 15 January, 2019; v1 submitted 27 November, 2018;
originally announced November 2018.
-
SEIGAN: Towards Compositional Image Generation by Simultaneously Learning to Segment, Enhance, and Inpaint
Authors:
Pavel Ostyakov,
Roman Suvorov,
Elizaveta Logacheva,
Oleg Khomenko,
Sergey I. Nikolenko
Abstract:
We present a novel approach to image manipulation and understanding by simultaneously learning to segment object masks, paste objects to another background image, and remove them from original images. For this purpose, we develop a novel generative model for compositional image generation, SEIGAN (Segment-Enhance-Inpaint Generative Adversarial Network), which learns these three operations together…
▽ More
We present a novel approach to image manipulation and understanding by simultaneously learning to segment object masks, paste objects to another background image, and remove them from original images. For this purpose, we develop a novel generative model for compositional image generation, SEIGAN (Segment-Enhance-Inpaint Generative Adversarial Network), which learns these three operations together in an adversarial architecture with additional cycle consistency losses. To train, SEIGAN needs only bounding box supervision and does not require pairing or ground truth masks. SEIGAN produces better generated images (evaluated by human assessors) than other approaches and produces high-quality segmentation masks, improving over other adversarially trained approaches and getting closer to the results of fully supervised training.
△ Less
Submitted 15 January, 2019; v1 submitted 19 November, 2018;
originally announced November 2018.
-
Label Denoising with Large Ensembles of Heterogeneous Neural Networks
Authors:
Pavel Ostyakov,
Elizaveta Logacheva,
Roman Suvorov,
Vladimir Aliev,
Gleb Sterkin,
Oleg Khomenko,
Sergey I. Nikolenko
Abstract:
Despite recent advances in computer vision based on various convolutional architectures, video understanding remains an important challenge. In this work, we present and discuss a top solution for the large-scale video classification (labeling) problem introduced as a Kaggle competition based on the YouTube-8M dataset. We show and compare different approaches to preprocessing, data augmentation, m…
▽ More
Despite recent advances in computer vision based on various convolutional architectures, video understanding remains an important challenge. In this work, we present and discuss a top solution for the large-scale video classification (labeling) problem introduced as a Kaggle competition based on the YouTube-8M dataset. We show and compare different approaches to preprocessing, data augmentation, model architectures, and model combination. Our final model is based on a large ensemble of video- and frame-level models but fits into rather limiting hardware constraints. We apply an approach based on knowledge distillation to deal with noisy labels in the original dataset and the recently developed mixup technique to improve the basic models.
△ Less
Submitted 15 January, 2019; v1 submitted 12 September, 2018;
originally announced September 2018.