-
A3D: Does Diffusion Dream about 3D Alignment?
Authors:
Savva Ignatyev,
Nina Konovalova,
Daniil Selikhanovych,
Nikolay Patakin,
Oleg Voynov,
Dmitry Senushkin,
Alexander Filippov,
Anton Konushin,
Peter Wonka,
Evgeny Burnaev
Abstract:
We tackle the problem of text-driven 3D generation from a geometry alignment perspective. We aim at the generation of multiple objects which are consistent in terms of semantics and geometry. Recent methods based on Score Distillation have succeeded in distilling the knowledge from 2D diffusion models to high-quality objects represented by 3D neural radiance fields. These methods handle multiple t…
▽ More
We tackle the problem of text-driven 3D generation from a geometry alignment perspective. We aim at the generation of multiple objects which are consistent in terms of semantics and geometry. Recent methods based on Score Distillation have succeeded in distilling the knowledge from 2D diffusion models to high-quality objects represented by 3D neural radiance fields. These methods handle multiple text queries separately, and therefore, the resulting objects have a high variability in object pose and structure. However, in some applications such as geometry editing, it is desirable to obtain aligned objects. In order to achieve alignment, we propose to optimize the continuous trajectories between the aligned objects, by modeling a space of linear pairwise interpolations of the textual embeddings with a single NeRF representation. We demonstrate that similar objects, consisting of semantically corresponding parts, can be well aligned in 3D space without costly modifications to the generation process. We provide several practical scenarios including mesh editing and object hybridization that benefit from geometry alignment and experimentally demonstrate the efficiency of our method. https://voyleg.github.io/a3d/
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Features Fusion for Dual-View Mammography Mass Detection
Authors:
Arina Varlamova,
Valery Belotsky,
Grigory Novikov,
Anton Konushin,
Evgeny Sidorov
Abstract:
Detection of malignant lesions on mammography images is extremely important for early breast cancer diagnosis. In clinical practice, images are acquired from two different angles, and radiologists can fully utilize information from both views, simultaneously locating the same lesion. However, for automatic detection approaches such information fusion remains a challenge. In this paper, we propose…
▽ More
Detection of malignant lesions on mammography images is extremely important for early breast cancer diagnosis. In clinical practice, images are acquired from two different angles, and radiologists can fully utilize information from both views, simultaneously locating the same lesion. However, for automatic detection approaches such information fusion remains a challenge. In this paper, we propose a new model called MAMM-Net, which allows the processing of both mammography views simultaneously by sharing information not only on an object level, as seen in existing works, but also on a feature level. MAMM-Net's key component is the Fusion Layer, based on deformable attention and designed to increase detection precision while kee** high recall. Our experiments show superior performance on the public DDSM dataset compared to the previous state-of-the-art model, while introducing new helpful features such as lesion annotation on pixel-level and classification of lesions malignancy.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
TETRIS: Towards Exploring the Robustness of Interactive Segmentation
Authors:
Andrey Moskalenko,
Vlad Shakhuro,
Anna Vorontsova,
Anton Konushin,
Anton Antonov,
Alexander Krapukhin,
Denis Shepelev,
Konstantin Soshin
Abstract:
Interactive segmentation methods rely on user inputs to iteratively update the selection mask. A click specifying the object of interest is arguably the most simple and intuitive interaction type, and thereby the most common choice for interactive segmentation. However, user clicking patterns in the interactive segmentation context remain unexplored. Accordingly, interactive segmentation evaluatio…
▽ More
Interactive segmentation methods rely on user inputs to iteratively update the selection mask. A click specifying the object of interest is arguably the most simple and intuitive interaction type, and thereby the most common choice for interactive segmentation. However, user clicking patterns in the interactive segmentation context remain unexplored. Accordingly, interactive segmentation evaluation strategies rely more on intuition and common sense rather than empirical studies (e.g., assuming that users tend to click in the center of the area with the largest error). In this work, we conduct a real user study to investigate real user clicking patterns. This study reveals that the intuitive assumption made in the common evaluation strategy may not hold. As a result, interactive segmentation models may show high scores in the standard benchmarks, but it does not imply that they would perform well in a real world scenario. To assess the applicability of interactive segmentation methods, we propose a novel evaluation strategy providing a more comprehensive analysis of a model's performance. To this end, we propose a methodology for finding extreme user inputs by a direct optimization in a white-box adversarial attack on the interactive segmentation model. Based on the performance with such adversarial user inputs, we assess the robustness of interactive segmentation models w.r.t click positions. Besides, we introduce a novel benchmark for measuring the robustness of interactive segmentation, and report the results of an extensive evaluation of dozens of models.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
OneFormer3D: One Transformer for Unified Point Cloud Segmentation
Authors:
Maxim Kolodiazhnyi,
Anna Vorontsova,
Anton Konushin,
Danila Rukhovich
Abstract:
Semantic, instance, and panoptic segmentation of 3D point clouds have been addressed using task-specific models of distinct design. Thereby, the similarity of all segmentation tasks and the implicit relationship between them have not been utilized effectively. This paper presents a unified, simple, and effective model addressing all these tasks jointly. The model, named OneFormer3D, performs insta…
▽ More
Semantic, instance, and panoptic segmentation of 3D point clouds have been addressed using task-specific models of distinct design. Thereby, the similarity of all segmentation tasks and the implicit relationship between them have not been utilized effectively. This paper presents a unified, simple, and effective model addressing all these tasks jointly. The model, named OneFormer3D, performs instance and semantic segmentation consistently, using a group of learnable kernels, where each kernel is responsible for generating a mask for either an instance or a semantic category. These kernels are trained with a transformer-based decoder with unified instance and semantic queries passed as an input. Such a design enables training a model end-to-end in a single run, so that it achieves top performance on all three segmentation tasks simultaneously. Specifically, our OneFormer3D ranks 1st and sets a new state-of-the-art (+2.1 mAP50) in the ScanNet test leaderboard. We also demonstrate the state-of-the-art results in semantic, instance, and panoptic segmentation of ScanNet (+21 PQ), ScanNet200 (+3.8 mAP50), and S3DIS (+0.8 mIoU) datasets.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
Single-Stage 3D Geometry-Preserving Depth Estimation Model Training on Dataset Mixtures with Uncalibrated Stereo Data
Authors:
Nikolay Patakin,
Mikhail Romanov,
Anna Vorontsova,
Mikhail Artemyev,
Anton Konushin
Abstract:
Nowadays, robotics, AR, and 3D modeling applications attract considerable attention to single-view depth estimation (SVDE) as it allows estimating scene geometry from a single RGB image. Recent works have demonstrated that the accuracy of an SVDE method hugely depends on the diversity and volume of the training data. However, RGB-D datasets obtained via depth capturing or 3D reconstruction are typ…
▽ More
Nowadays, robotics, AR, and 3D modeling applications attract considerable attention to single-view depth estimation (SVDE) as it allows estimating scene geometry from a single RGB image. Recent works have demonstrated that the accuracy of an SVDE method hugely depends on the diversity and volume of the training data. However, RGB-D datasets obtained via depth capturing or 3D reconstruction are typically small, synthetic datasets are not photorealistic enough, and all these datasets lack diversity. The large-scale and diverse data can be sourced from stereo images or stereo videos from the web. Typically being uncalibrated, stereo data provides disparities up to unknown shift (geometrically incomplete data), so stereo-trained SVDE methods cannot recover 3D geometry. It was recently shown that the distorted point clouds obtained with a stereo-trained SVDE method can be corrected with additional point cloud modules (PCM) separately trained on the geometrically complete data. On the contrary, we propose GP$^{2}$, General-Purpose and Geometry-Preserving training scheme, and show that conventional SVDE models can learn correct shifts themselves without any post-processing, benefiting from using stereo data even in the geometry-preserving setting. Through experiments on different dataset mixtures, we prove that GP$^{2}$-trained models outperform methods relying on PCM in both accuracy and speed, and report the state-of-the-art results in the general-purpose geometry-preserving SVDE. Moreover, we show that SVDE models can learn to predict geometrically correct depth even when geometrically complete data comprises the minor part of the training set.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Independent Component Alignment for Multi-Task Learning
Authors:
Dmitry Senushkin,
Nikolay Patakin,
Arseny Kuznetsov,
Anton Konushin
Abstract:
In a multi-task learning (MTL) setting, a single model is trained to tackle a diverse set of tasks jointly. Despite rapid progress in the field, MTL remains challenging due to optimization issues such as conflicting and dominating gradients. In this work, we propose using a condition number of a linear system of gradients as a stability criterion of an MTL optimization. We theoretically demonstrat…
▽ More
In a multi-task learning (MTL) setting, a single model is trained to tackle a diverse set of tasks jointly. Despite rapid progress in the field, MTL remains challenging due to optimization issues such as conflicting and dominating gradients. In this work, we propose using a condition number of a linear system of gradients as a stability criterion of an MTL optimization. We theoretically demonstrate that a condition number reflects the aforementioned optimization issues. Accordingly, we present Aligned-MTL, a novel MTL optimization approach based on the proposed criterion, that eliminates instability in the training process by aligning the orthogonal components of the linear system of gradients. While many recent MTL approaches guarantee convergence to a minimum, task trade-offs cannot be specified in advance. In contrast, Aligned-MTL provably converges to an optimal point with pre-defined task-specific weights, which provides more control over the optimization result. Through experiments, we show that the proposed approach consistently improves performance on a diverse set of MTL benchmarks, including semantic and instance segmentation, depth estimation, surface normal estimation, and reinforcement learning. The source code is publicly available at https://github.com/SamsungLabs/MTL .
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
Contour-based Interactive Segmentation
Authors:
Danil Galeev,
Polina Popenova,
Anna Vorontsova,
Anton Konushin
Abstract:
Recent advances in interactive segmentation (IS) allow speeding up and simplifying image editing and labeling greatly. The majority of modern IS approaches accept user input in the form of clicks. However, using clicks may require too many user interactions, especially when selecting small objects, minor parts of an object, or a group of objects of the same type. In this paper, we consider such a…
▽ More
Recent advances in interactive segmentation (IS) allow speeding up and simplifying image editing and labeling greatly. The majority of modern IS approaches accept user input in the form of clicks. However, using clicks may require too many user interactions, especially when selecting small objects, minor parts of an object, or a group of objects of the same type. In this paper, we consider such a natural form of user interaction as a loose contour, and introduce a contour-based IS method. We evaluate the proposed method on the standard segmentation benchmarks, our novel UserContours dataset, and its subset UserContours-G containing difficult segmentation cases. Through experiments, we demonstrate that a single contour provides the same accuracy as multiple clicks, thus reducing the required amount of user interactions.
△ Less
Submitted 5 December, 2023; v1 submitted 13 February, 2023;
originally announced February 2023.
-
Top-Down Beats Bottom-Up in 3D Instance Segmentation
Authors:
Maksim Kolodiazhnyi,
Anna Vorontsova,
Anton Konushin,
Danila Rukhovich
Abstract:
Most 3D instance segmentation methods exploit a bottom-up strategy, typically including resource-exhaustive post-processing. For point grou**, bottom-up methods rely on prior assumptions about the objects in the form of hyperparameters, which are domain-specific and need to be carefully tuned. On the contrary, we address 3D instance segmentation with a TD3D: the pioneering cluster-free, fully-co…
▽ More
Most 3D instance segmentation methods exploit a bottom-up strategy, typically including resource-exhaustive post-processing. For point grou**, bottom-up methods rely on prior assumptions about the objects in the form of hyperparameters, which are domain-specific and need to be carefully tuned. On the contrary, we address 3D instance segmentation with a TD3D: the pioneering cluster-free, fully-convolutional and entirely data-driven approach trained in an end-to-end manner. This is the first top-down method outperforming bottom-up approaches in 3D domain. With its straightforward pipeline, it demonstrates outstanding accuracy and generalization ability on the standard indoor benchmarks: ScanNet v2, its extension ScanNet200, and S3DIS, as well as on the aerial STPLS3D dataset. Besides, our method is much faster on inference than the current state-of-the-art grou**-based approaches: our flagship modification is 1.9x faster than the most accurate bottom-up method, while being more accurate, and our faster modification shows state-of-the-art accuracy running at 2.6x speed. Code is available at https://github.com/SamsungLabs/td3d .
△ Less
Submitted 11 September, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
TR3D: Towards Real-Time Indoor 3D Object Detection
Authors:
Danila Rukhovich,
Anna Vorontsova,
Anton Konushin
Abstract:
Recently, sparse 3D convolutions have changed 3D object detection. Performing on par with the voting-based approaches, 3D CNNs are memory-efficient and scale to large scenes better. However, there is still room for improvement. With a conscious, practice-oriented approach to problem-solving, we analyze the performance of such methods and localize the weaknesses. Applying modifications that resolve…
▽ More
Recently, sparse 3D convolutions have changed 3D object detection. Performing on par with the voting-based approaches, 3D CNNs are memory-efficient and scale to large scenes better. However, there is still room for improvement. With a conscious, practice-oriented approach to problem-solving, we analyze the performance of such methods and localize the weaknesses. Applying modifications that resolve the found issues one by one, we end up with TR3D: a fast fully-convolutional 3D object detection model trained end-to-end, that achieves state-of-the-art results on the standard benchmarks, ScanNet v2, SUN RGB-D, and S3DIS. Moreover, to take advantage of both point cloud and RGB inputs, we introduce an early fusion of 2D and 3D features. We employ our fusion module to make conventional 3D object detection methods multimodal and demonstrate an impressive boost in performance. Our model with early feature fusion, which we refer to as TR3D+FF, outperforms existing 3D object detection approaches on the SUN RGB-D dataset. Overall, besides being accurate, both TR3D and TR3D+FF models are lightweight, memory-efficient, and fast, thereby marking another milestone on the way toward real-time 3D object detection. Code is available at https://github.com/SamsungLabs/tr3d .
△ Less
Submitted 5 December, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
Floorplan-Aware Camera Poses Refinement
Authors:
Anna Sokolova,
Filipp Nikitin,
Anna Vorontsova,
Anton Konushin
Abstract:
Processing large indoor scenes is a challenging task, as scan registration and camera trajectory estimation methods accumulate errors across time. As a result, the quality of reconstructed scans is insufficient for some applications, such as visual-based localization and navigation, where the correct position of walls is crucial.
For many indoor scenes, there exists an image of a technical floor…
▽ More
Processing large indoor scenes is a challenging task, as scan registration and camera trajectory estimation methods accumulate errors across time. As a result, the quality of reconstructed scans is insufficient for some applications, such as visual-based localization and navigation, where the correct position of walls is crucial.
For many indoor scenes, there exists an image of a technical floorplan that contains information about the geometry and main structural elements of the scene, such as walls, partitions, and doors. We argue that such a floorplan is a useful source of spatial information, which can guide a 3D model optimization.
The standard RGB-D 3D reconstruction pipeline consists of a tracking module applied to an RGB-D sequence and a bundle adjustment (BA) module that takes the posed RGB-D sequence and corrects the camera poses to improve consistency. We propose a novel optimization algorithm expanding conventional BA that leverages the prior knowledge about the scene structure in the form of a floorplan. Our experiments on the Redwood dataset and our self-captured data demonstrate that utilizing floorplan improves accuracy of 3D reconstructions.
△ Less
Submitted 10 October, 2022;
originally announced October 2022.
-
FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection
Authors:
Danila Rukhovich,
Anna Vorontsova,
Anton Konushin
Abstract:
Recently, promising applications in robotics and augmented reality have attracted considerable attention to 3D object detection from point clouds. In this paper, we present FCAF3D - a first-in-class fully convolutional anchor-free indoor 3D object detection method. It is a simple yet effective method that uses a voxel representation of a point cloud and processes voxels with sparse convolutions. F…
▽ More
Recently, promising applications in robotics and augmented reality have attracted considerable attention to 3D object detection from point clouds. In this paper, we present FCAF3D - a first-in-class fully convolutional anchor-free indoor 3D object detection method. It is a simple yet effective method that uses a voxel representation of a point cloud and processes voxels with sparse convolutions. FCAF3D can handle large-scale scenes with minimal runtime through a single fully convolutional feed-forward pass. Existing 3D object detection methods make prior assumptions on the geometry of objects, and we argue that it limits their generalization ability. To get rid of any prior assumptions, we propose a novel parametrization of oriented bounding boxes that allows obtaining better results in a purely data-driven way. The proposed method achieves state-of-the-art 3D object detection results in terms of [email protected] on ScanNet V2 (+4.5), SUN RGB-D (+3.5), and S3DIS (+20.5) datasets. The code and models are available at https://github.com/samsunglabs/fcaf3d.
△ Less
Submitted 24 March, 2022; v1 submitted 1 December, 2021;
originally announced December 2021.
-
ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection
Authors:
Danila Rukhovich,
Anna Vorontsova,
Anton Konushin
Abstract:
In this paper, we introduce the task of multi-view RGB-based 3D object detection as an end-to-end optimization problem. To address this problem, we propose ImVoxelNet, a novel fully convolutional method of 3D object detection based on monocular or multi-view RGB images. The number of monocular images in each multi-view input can variate during training and inference; actually, this number might be…
▽ More
In this paper, we introduce the task of multi-view RGB-based 3D object detection as an end-to-end optimization problem. To address this problem, we propose ImVoxelNet, a novel fully convolutional method of 3D object detection based on monocular or multi-view RGB images. The number of monocular images in each multi-view input can variate during training and inference; actually, this number might be unique for each multi-view input. ImVoxelNet successfully handles both indoor and outdoor scenes, which makes it general-purpose. Specifically, it achieves state-of-the-art results in car detection on KITTI (monocular) and nuScenes (multi-view) benchmarks among all methods that accept RGB images. Moreover, it surpasses existing RGB-based 3D object detection methods on the SUN RGB-D dataset. On ScanNet, ImVoxelNet sets a new benchmark for multi-view 3D object detection. The source code and the trained models are available at https://github.com/saic-vul/imvoxelnet.
△ Less
Submitted 15 October, 2021; v1 submitted 2 June, 2021;
originally announced June 2021.
-
Reviving Iterative Training with Mask Guidance for Interactive Segmentation
Authors:
Konstantin Sofiiuk,
Ilia A. Petrov,
Anton Konushin
Abstract:
Recent works on click-based interactive segmentation have demonstrated state-of-the-art results by using various inference-time optimization schemes. These methods are considerably more computationally expensive compared to feedforward approaches, as they require performing backward passes through a network during inference and are hard to deploy on mobile frameworks that usually support only forw…
▽ More
Recent works on click-based interactive segmentation have demonstrated state-of-the-art results by using various inference-time optimization schemes. These methods are considerably more computationally expensive compared to feedforward approaches, as they require performing backward passes through a network during inference and are hard to deploy on mobile frameworks that usually support only forward passes. In this paper, we extensively evaluate various design choices for interactive segmentation and discover that new state-of-the-art results can be obtained without any additional optimization schemes. Thus, we propose a simple feedforward model for click-based interactive segmentation that employs the segmentation masks from previous steps. It allows not only to segment an entirely new object, but also to start with an external mask and correct it. When analyzing the performance of models trained on different datasets, we observe that the choice of a training dataset greatly impacts the quality of interactive segmentation. We find that the models trained on a combination of COCO and LVIS with diverse and high-quality annotations show performance superior to all existing models. The code and trained models are available at https://github.com/saic-vul/ritm_interactive_segmentation.
△ Less
Submitted 12 February, 2021;
originally announced February 2021.
-
Road images augmentation with synthetic traffic signs using neural networks
Authors:
Anton Konushin,
Boris Faizov,
Vlad Shakhuro
Abstract:
Traffic sign recognition is a well-researched problem in computer vision. However, the state of the art methods works only for frequent sign classes, which are well represented in training datasets. We consider the task of rare traffic sign detection and classification. We aim to solve that problem by using synthetic training data. Such training data is obtained by embedding synthetic images of si…
▽ More
Traffic sign recognition is a well-researched problem in computer vision. However, the state of the art methods works only for frequent sign classes, which are well represented in training datasets. We consider the task of rare traffic sign detection and classification. We aim to solve that problem by using synthetic training data. Such training data is obtained by embedding synthetic images of signs in the real photos. We propose three methods for making synthetic signs consistent with a scene in appearance. These methods are based on modern generative adversarial network (GAN) architectures. Our proposed methods allow realistic embedding of rare traffic sign classes that are absent in the training set. We adapt a variational autoencoder for sampling plausible locations of new traffic signs in images. We demonstrate that using a mixture of our synthetic data with real data improves the accuracy of both classifier and detector.
△ Less
Submitted 13 January, 2021;
originally announced January 2021.
-
Towards General Purpose Geometry-Preserving Single-View Depth Estimation
Authors:
Mikhail Romanov,
Nikolay Patatkin,
Anna Vorontsova,
Sergey Nikolenko,
Anton Konushin,
Dmitry Senyushkin
Abstract:
Single-view depth estimation (SVDE) plays a crucial role in scene understanding for AR applications, 3D modeling, and robotics, providing the geometry of a scene based on a single image. Recent works have shown that a successful solution strongly relies on the diversity and volume of training data. This data can be sourced from stereo movies and photos. However, they do not provide geometrically c…
▽ More
Single-view depth estimation (SVDE) plays a crucial role in scene understanding for AR applications, 3D modeling, and robotics, providing the geometry of a scene based on a single image. Recent works have shown that a successful solution strongly relies on the diversity and volume of training data. This data can be sourced from stereo movies and photos. However, they do not provide geometrically complete depth maps (as disparities contain unknown shift value). Therefore, existing models trained on this data are not able to recover correct 3D representations. Our work shows that a model trained on this data along with conventional datasets can gain accuracy while predicting correct scene geometry. Surprisingly, only a small portion of geometrically correct depth maps are required to train a model that performs equally to a model trained on the full geometrically correct dataset. After that, we train computationally efficient models on a mixture of datasets using the proposed method. Through quantitative comparison on completely unseen datasets and qualitative comparison of 3D point clouds, we show that our model defines the new state of the art in general-purpose SVDE.
△ Less
Submitted 9 February, 2021; v1 submitted 25 September, 2020;
originally announced September 2020.
-
Learning High-Resolution Domain-Specific Representations with a GAN Generator
Authors:
Danil Galeev,
Konstantin Sofiiuk,
Danila Rukhovich,
Mikhail Romanov,
Olga Barinova,
Anton Konushin
Abstract:
In recent years generative models of visual data have made a great progress, and now they are able to produce images of high quality and diversity. In this work we study representations learnt by a GAN generator. First, we show that these representations can be easily projected onto semantic segmentation map using a lightweight decoder. We find that such semantic projection can be learnt from just…
▽ More
In recent years generative models of visual data have made a great progress, and now they are able to produce images of high quality and diversity. In this work we study representations learnt by a GAN generator. First, we show that these representations can be easily projected onto semantic segmentation map using a lightweight decoder. We find that such semantic projection can be learnt from just a few annotated images. Based on this finding, we propose LayerMatch scheme for approximating the representation of a GAN generator that can be used for unsupervised domain-specific pretraining. We consider the semi-supervised learning scenario when a small amount of labeled data is available along with a large unlabeled dataset from the same domain. We find that the use of LayerMatch-pretrained backbone leads to superior accuracy compared to standard supervised pretraining on ImageNet. Moreover, this simple approach also outperforms recent semi-supervised semantic segmentation methods that use both labeled and unlabeled data during training. Source code for reproducing our experiments will be available at the time of publication.
△ Less
Submitted 18 June, 2020;
originally announced June 2020.
-
Foreground-aware Semantic Representations for Image Harmonization
Authors:
Konstantin Sofiiuk,
Polina Popenova,
Anton Konushin
Abstract:
Image harmonization is an important step in photo editing to achieve visual consistency in composite images by adjusting the appearances of foreground to make it compatible with background. Previous approaches to harmonize composites are based on training of encoder-decoder networks from scratch, which makes it challenging for a neural network to learn a high-level representation of objects. We pr…
▽ More
Image harmonization is an important step in photo editing to achieve visual consistency in composite images by adjusting the appearances of foreground to make it compatible with background. Previous approaches to harmonize composites are based on training of encoder-decoder networks from scratch, which makes it challenging for a neural network to learn a high-level representation of objects. We propose a novel architecture to utilize the space of high-level features learned by a pre-trained classification network. We create our models as a combination of existing encoder-decoder architectures and a pre-trained foreground-aware deep high-resolution network. We extensively evaluate the proposed method on existing image harmonization benchmark and set up a new state-of-the-art in terms of MSE and PSNR metrics. The code and trained models are available at \url{https://github.com/saic-vul/image_harmonization}.
△ Less
Submitted 1 June, 2020;
originally announced June 2020.
-
Decoder Modulation for Indoor Depth Completion
Authors:
Dmitry Senushkin,
Mikhail Romanov,
Ilia Belikov,
Anton Konushin,
Nikolay Patakin
Abstract:
Depth completion recovers a dense depth map from sensor measurements. Current methods are mostly tailored for very sparse depth measurements from LiDARs in outdoor settings, while for indoor scenes Time-of-Flight (ToF) or structured light sensors are mostly used. These sensors provide semi-dense maps, with dense measurements in some regions and almost empty in others. We propose a new model that t…
▽ More
Depth completion recovers a dense depth map from sensor measurements. Current methods are mostly tailored for very sparse depth measurements from LiDARs in outdoor settings, while for indoor scenes Time-of-Flight (ToF) or structured light sensors are mostly used. These sensors provide semi-dense maps, with dense measurements in some regions and almost empty in others. We propose a new model that takes into account the statistical difference between such regions. Our main contribution is a new decoder modulation branch added to the encoder-decoder architecture. The encoder extracts features from the concatenated RGB image and raw depth. Given the mask of missing values as input, the proposed modulation branch controls the decoding of a dense depth map from these features differently for different regions. This is implemented by modifying the spatial distribution of output signals inside the decoder via Spatially-Adaptive Denormalization (SPADE) blocks. Our second contribution is a novel training strategy that allows us to train on a semi-dense sensor data when the ground truth depth map is not available. Our model achieves the state of the art results on indoor Matterport3D dataset. Being designed for semi-dense input depth, our model is still competitive with LiDAR-oriented approaches on the KITTI dataset. Our training strategy significantly improves prediction quality with no dense ground truth available, as validated on the NYUv2 dataset.
△ Less
Submitted 8 February, 2021; v1 submitted 18 May, 2020;
originally announced May 2020.
-
IterDet: Iterative Scheme for Object Detection in Crowded Environments
Authors:
Danila Rukhovich,
Konstantin Sofiiuk,
Danil Galeev,
Olga Barinova,
Anton Konushin
Abstract:
Deep learning-based detectors usually produce a redundant set of object bounding boxes including many duplicate detections of the same object. These boxes are then filtered using non-maximum suppression (NMS) in order to select exactly one bounding box per object of interest. This greedy scheme is simple and provides sufficient accuracy for isolated objects but often fails in crowded environments,…
▽ More
Deep learning-based detectors usually produce a redundant set of object bounding boxes including many duplicate detections of the same object. These boxes are then filtered using non-maximum suppression (NMS) in order to select exactly one bounding box per object of interest. This greedy scheme is simple and provides sufficient accuracy for isolated objects but often fails in crowded environments, since one needs to both preserve boxes for different objects and suppress duplicate detections. In this work we develop an alternative iterative scheme, where a new subset of objects is detected at each iteration. Detected boxes from the previous iterations are passed to the network at the following iterations to ensure that the same object would not be detected twice. This iterative scheme can be applied to both one-stage and two-stage object detectors with just minor modifications of the training and inference procedures. We perform extensive experiments with two different baseline detectors on four datasets and show significant improvement over the baseline, leading to state-of-the-art performance on CrowdHuman and WiderPerson datasets. The source code and the trained models are available at https://github.com/saic-vul/iterdet.
△ Less
Submitted 29 January, 2021; v1 submitted 12 May, 2020;
originally announced May 2020.
-
f-BRS: Rethinking Backpropagating Refinement for Interactive Segmentation
Authors:
Konstantin Sofiiuk,
Ilia Petrov,
Olga Barinova,
Anton Konushin
Abstract:
Deep neural networks have become a mainstream approach to interactive segmentation. As we show in our experiments, while for some images a trained network provides accurate segmentation result with just a few clicks, for some unknown objects it cannot achieve satisfactory result even with a large amount of user input. Recently proposed backpropagating refinement (BRS) scheme introduces an optimiza…
▽ More
Deep neural networks have become a mainstream approach to interactive segmentation. As we show in our experiments, while for some images a trained network provides accurate segmentation result with just a few clicks, for some unknown objects it cannot achieve satisfactory result even with a large amount of user input. Recently proposed backpropagating refinement (BRS) scheme introduces an optimization problem for interactive segmentation that results in significantly better performance for the hard cases. At the same time, BRS requires running forward and backward pass through a deep network several times that leads to significantly increased computational budget per click compared to other methods. We propose f-BRS (feature backpropagating refinement scheme) that solves an optimization problem with respect to auxiliary variables instead of the network inputs, and requires running forward and backward pass just for a small part of a network. Experiments on GrabCut, Berkeley, DAVIS and SBD datasets set new state-of-the-art at an order of magnitude lower time per click compared to original BRS. The code and trained models are available at https://github.com/saic-vul/fbrs_interactive_segmentation .
△ Less
Submitted 25 August, 2020; v1 submitted 28 January, 2020;
originally announced January 2020.
-
Training Deep SLAM on Single Frames
Authors:
Igor Slinko,
Anna Vorontsova,
Dmitry Zhukov,
Olga Barinova,
Anton Konushin
Abstract:
Learning-based visual odometry and SLAM methods demonstrate a steady improvement over past years. However, collecting ground truth poses to train these methods is difficult and expensive. This could be resolved by training in an unsupervised mode, but there is still a large gap between performance of unsupervised and supervised methods. In this work, we focus on generating synthetic data for deep…
▽ More
Learning-based visual odometry and SLAM methods demonstrate a steady improvement over past years. However, collecting ground truth poses to train these methods is difficult and expensive. This could be resolved by training in an unsupervised mode, but there is still a large gap between performance of unsupervised and supervised methods. In this work, we focus on generating synthetic data for deep learning-based visual odometry and SLAM methods that take optical flow as an input. We produce training data in a form of optical flow that corresponds to arbitrary camera movement between a real frame and a virtual frame. For synthesizing data we use depth maps either produced by a depth sensor or estimated from stereo pair. We train visual odometry model on synthetic data and do not use ground truth poses hence this model can be considered unsupervised. Also it can be classified as monocular as we do not use depth maps on inference. We also propose a simple way to convert any visual odometry model into a SLAM method based on frame matching and graph optimization. We demonstrate that both the synthetically-trained visual odometry model and the proposed SLAM method build upon this model yields state-of-the-art results among unsupervised methods on KITTI dataset and shows promising results on a challenging EuRoC dataset.
△ Less
Submitted 11 December, 2019;
originally announced December 2019.
-
Measuring robustness of Visual SLAM
Authors:
David Prokhorov,
Dmitry Zhukov,
Olga Barinova,
Anna Vorontsova,
Anton Konushin
Abstract:
Simultaneous localization and map** (SLAM) is an essential component of robotic systems. In this work we perform a feasibility study of RGB-D SLAM for the task of indoor robot navigation. Recent visual SLAM methods, e.g. ORBSLAM2 \cite{mur2017orb}, demonstrate really impressive accuracy, but the experiments in the papers are usually conducted on just a few sequences, that makes it difficult to r…
▽ More
Simultaneous localization and map** (SLAM) is an essential component of robotic systems. In this work we perform a feasibility study of RGB-D SLAM for the task of indoor robot navigation. Recent visual SLAM methods, e.g. ORBSLAM2 \cite{mur2017orb}, demonstrate really impressive accuracy, but the experiments in the papers are usually conducted on just a few sequences, that makes it difficult to reason about the robustness of the methods. Another problem is that all available RGB-D datasets contain the trajectories with very complex camera motions. In this work we extensively evaluate ORBSLAM2 to better understand the state-of-the-art. First, we conduct experiments on the popular publicly available datasets for RGB-D SLAM across the conventional metrics. We perform statistical analysis of the results and find correlations between the metrics and the attributes of the trajectories. Then, we introduce a new large and diverse HomeRobot dataset where we model the motions of a simple home robot. Our dataset is created using physically-based rendering with realistic lighting and contains the scenes composed by human designers. It includes thousands of sequences, that is two orders of magnitude greater than in previous works. We find that while in many cases the accuracy of SLAM is very good, the robustness is still an issue.
△ Less
Submitted 10 October, 2019;
originally announced October 2019.
-
DISCOMAN: Dataset of Indoor SCenes for Odometry, Map** And Navigation
Authors:
Pavel Kirsanov,
Airat Gaskarov,
Filipp Konokhov,
Konstantin Sofiiuk,
Anna Vorontsova,
Igor Slinko,
Dmitry Zhukov,
Sergey Bykov,
Olga Barinova,
Anton Konushin
Abstract:
We present a novel dataset for training and benchmarking semantic SLAM methods. The dataset consists of 200 long sequences, each one containing 3000-5000 data frames. We generate the sequences using realistic home layouts. For that we sample trajectories that simulate motions of a simple home robot, and then render the frames along the trajectories. Each data frame contains a) RGB images generated…
▽ More
We present a novel dataset for training and benchmarking semantic SLAM methods. The dataset consists of 200 long sequences, each one containing 3000-5000 data frames. We generate the sequences using realistic home layouts. For that we sample trajectories that simulate motions of a simple home robot, and then render the frames along the trajectories. Each data frame contains a) RGB images generated using physically-based rendering, b) simulated depth measurements, c) simulated IMU readings and d) ground truth occupancy grid of a house. Our dataset serves a wider range of purposes compared to existing datasets and is the first large-scale benchmark focused on the map** component of SLAM. The dataset is split into train/validation/test parts sampled from different sets of virtual houses. We present benchmarking results forboth classical geometry-based and recent learning-based SLAM algorithms, a baseline map** method, semantic segmentation and panoptic segmentation.
△ Less
Submitted 26 September, 2019;
originally announced September 2019.
-
AdaptIS: Adaptive Instance Selection Network
Authors:
Konstantin Sofiiuk,
Olga Barinova,
Anton Konushin
Abstract:
We present Adaptive Instance Selection network architecture for class-agnostic instance segmentation. Given an input image and a point $(x, y)$, it generates a mask for the object located at $(x, y)$. The network adapts to the input point with a help of AdaIN layers, thus producing different masks for different objects on the same image. AdaptIS generates pixel-accurate object masks, therefore it…
▽ More
We present Adaptive Instance Selection network architecture for class-agnostic instance segmentation. Given an input image and a point $(x, y)$, it generates a mask for the object located at $(x, y)$. The network adapts to the input point with a help of AdaIN layers, thus producing different masks for different objects on the same image. AdaptIS generates pixel-accurate object masks, therefore it accurately segments objects of complex shape or severely occluded ones. AdaptIS can be easily combined with standard semantic segmentation pipeline to perform panoptic segmentation. To illustrate the idea, we perform experiments on a challenging toy problem with difficult occlusions. Then we extensively evaluate the method on panoptic segmentation benchmarks. We obtain state-of-the-art results on Cityscapes and Mapillary even without pretraining on COCO, and show competitive results on a challenging COCO dataset. The source code of the method and the trained models are available at https://github.com/saic-vul/adaptis.
△ Less
Submitted 17 September, 2019;
originally announced September 2019.
-
Perceptual Image Anomaly Detection
Authors:
Nina Tuluptceva,
Bart Bakker,
Irina Fedulova,
Anton Konushin
Abstract:
We present a novel method for image anomaly detection, where algorithms that use samples drawn from some distribution of "normal" data, aim to detect out-of-distribution (abnormal) samples. Our approach includes a combination of encoder and generator for map** an image distribution to a predefined latent distribution and vice versa. It leverages Generative Adversarial Networks to learn these dat…
▽ More
We present a novel method for image anomaly detection, where algorithms that use samples drawn from some distribution of "normal" data, aim to detect out-of-distribution (abnormal) samples. Our approach includes a combination of encoder and generator for map** an image distribution to a predefined latent distribution and vice versa. It leverages Generative Adversarial Networks to learn these data distributions and uses perceptual loss for the detection of image abnormality. To accomplish this goal, we introduce a new similarity metric, which expresses the perceived similarity between images and is robust to changes in image contrast. Secondly, we introduce a novel approach for the selection of weights of a multi-objective loss function (image reconstruction and distribution map**) in the absence of a validation dataset for hyperparameter tuning. After training, our model measures the abnormality of the input image as the perceptual dissimilarity between it and the closest generated image of the modeled data distribution. The proposed approach is extensively evaluated on several publicly available image benchmarks and achieves state-of-the-art performance.
△ Less
Submitted 28 February, 2020; v1 submitted 12 September, 2019;
originally announced September 2019.
-
Scene Motion Decomposition for Learnable Visual Odometry
Authors:
Igor Slinko,
Anna Vorontsova,
Filipp Konokhov,
Olga Barinova,
Anton Konushin
Abstract:
Optical Flow (OF) and depth are commonly used for visual odometry since they provide sufficient information about camera ego-motion in a rigid scene. We reformulate the problem of ego-motion estimation as a problem of motion estimation of a 3D-scene with respect to a static camera. The entire scene motion can be represented as a combination of motions of its visible points. Using OF and depth we e…
▽ More
Optical Flow (OF) and depth are commonly used for visual odometry since they provide sufficient information about camera ego-motion in a rigid scene. We reformulate the problem of ego-motion estimation as a problem of motion estimation of a 3D-scene with respect to a static camera. The entire scene motion can be represented as a combination of motions of its visible points. Using OF and depth we estimate a motion of each point in terms of 6DoF and represent results in the form of motion maps, each one addressing single degree of freedom. In this work we provide motion maps as inputs to a deep neural network that predicts 6DoF of scene motion. Through our evaluation on outdoor and indoor datasets we show that utilizing motion maps leads to accuracy improvement in comparison with naive stacking of depth and OF. Another contribution of our work is a novel network architecture that efficiently exploits motion maps and outperforms learnable RGB/RGB-D baselines.
△ Less
Submitted 16 July, 2019;
originally announced July 2019.
-
Double Refinement Network for Efficient Indoor Monocular Depth Estimation
Authors:
Nikita Durasov,
Mikhail Romanov,
Valeriya Bubnova,
Pavel Bogomolov,
Anton Konushin
Abstract:
Monocular depth estimation is the task of obtaining a measure of distance for each pixel using a single image. It is an important problem in computer vision and is usually solved using neural networks. Though recent works in this area have shown significant improvement in accuracy, the state-of-the-art methods tend to require massive amounts of memory and time to process an image. The main purpose…
▽ More
Monocular depth estimation is the task of obtaining a measure of distance for each pixel using a single image. It is an important problem in computer vision and is usually solved using neural networks. Though recent works in this area have shown significant improvement in accuracy, the state-of-the-art methods tend to require massive amounts of memory and time to process an image. The main purpose of this work is to improve the performance of the latest solutions with no decrease in accuracy. To this end, we introduce the Double Refinement Network architecture. The proposed method achieves state-of-the-art results on the standard benchmark RGB-D dataset NYU Depth v2, while its frames per second rate is significantly higher (up to 18 times speedup per image at batch size 1) and the RAM usage per image is lower.
△ Less
Submitted 4 April, 2019; v1 submitted 20 November, 2018;
originally announced November 2018.
-
Pose-based Deep Gait Recognition
Authors:
Anna Sokolova,
Anton Konushin
Abstract:
Human gait or walking manner is a biometric feature that allows identification of a person when other biometric features such as the face or iris are not visible. In this paper, we present a new pose-based convolutional neural network model for gait recognition. Unlike many methods that consider the full-height silhouette of a moving person, we consider the motion of points in the areas around hum…
▽ More
Human gait or walking manner is a biometric feature that allows identification of a person when other biometric features such as the face or iris are not visible. In this paper, we present a new pose-based convolutional neural network model for gait recognition. Unlike many methods that consider the full-height silhouette of a moving person, we consider the motion of points in the areas around human joints. To extract motion information, we estimate the optical flow between consecutive frames. We propose a deep convolutional model that computes pose-based gait descriptors. We compare different network architectures and aggregation methods and experimentally assess various sets of body parts to determine which are the most important for gait recognition. In addition, we investigate the generalization ability of the developed algorithms by transferring them between datasets. The results of these experiments show that our approach outperforms state-of-the-art methods.
△ Less
Submitted 8 February, 2018; v1 submitted 17 October, 2017;
originally announced October 2017.