Search | arXiv e-print repository

Grouped Discrete Representation Guides Object-Centric Learning

Authors: Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

Abstract: Similar to humans perceiving visual scenes as objects, Object-Centric Learning (OCL) can abstract dense images or videos into sparse object-level features. Transformer-based OCL handles complex textures well due to the decoding guidance of discrete representation, obtained by discretizing noisy features in image or video feature maps using template features from a codebook. However, treating featu… ▽ More Similar to humans perceiving visual scenes as objects, Object-Centric Learning (OCL) can abstract dense images or videos into sparse object-level features. Transformer-based OCL handles complex textures well due to the decoding guidance of discrete representation, obtained by discretizing noisy features in image or video feature maps using template features from a codebook. However, treating features as minimal units overlooks their composing attributes, thus impeding model generalization; indexing features with natural numbers loses attribute-level commonalities and characteristics, thus diminishing heuristics for model convergence. We propose \textit{Grouped Discrete Representation} (GDR) to address these issues by grou** features into attributes and indexing them with tuple numbers. In extensive experiments across different query initializations, dataset modalities, and model architectures, GDR consistently improves convergence and generalizability. Visualizations show that our method effectively captures attribute-level information in features. The source code will be available upon acceptance. △ Less

Submitted 1 July, 2024; originally announced July 2024.

ACM Class: I.4.6

arXiv:2404.17324 [pdf, other]

Dense Road Surface Grip Map Prediction from Multimodal Image Data

Authors: Jyri Maanpää, Julius Pesonen, Heikki Hyyti, Iaroslav Melekhov, Juho Kannala, Petri Manninen, Antero Kukko, Juha Hyyppä

Abstract: Slippery road weather conditions are prevalent in many regions and cause a regular risk for traffic. Still, there has been less research on how autonomous vehicles could detect slippery driving conditions on the road to drive safely. In this work, we propose a method to predict a dense grip map from the area in front of the car, based on postprocessed multimodal sensor data. We trained a convoluti… ▽ More Slippery road weather conditions are prevalent in many regions and cause a regular risk for traffic. Still, there has been less research on how autonomous vehicles could detect slippery driving conditions on the road to drive safely. In this work, we propose a method to predict a dense grip map from the area in front of the car, based on postprocessed multimodal sensor data. We trained a convolutional neural network to predict pixelwise grip values from fused RGB camera, thermal camera, and LiDAR reflectance images, based on weakly supervised ground truth from an optical road weather sensor. The experiments show that it is possible to predict dense grip values with good accuracy from the used data modalities as the produced grip map follows both ground truth measurements and local weather conditions, such as snowy areas on the road. The model using only the RGB camera or LiDAR reflectance modality provided good baseline results for grip prediction accuracy while using models fusing the RGB camera, thermal camera, and LiDAR modalities improved the grip predictions significantly. △ Less

Submitted 26 April, 2024; originally announced April 2024.

Comments: 17 pages, 7 figures (supplementary material 1 page, 1 figure). Submitted to 27th International Conference of Pattern Recognition (ICPR 2024)

arXiv:2403.17822 [pdf, other]

DN-Splatter: Depth and Normal Priors for Gaussian Splatting and Meshing

Authors: Matias Turkulainen, Xuqian Ren, Iaroslav Melekhov, Otto Seiskari, Esa Rahtu, Juho Kannala

Abstract: 3D Gaussian splatting, a novel differentiable rendering technique, has achieved state-of-the-art novel view synthesis results with high rendering speeds and relatively low training times. However, its performance on scenes commonly seen in indoor datasets is poor due to the lack of geometric constraints during optimization. We extend 3D Gaussian splatting with depth and normal cues to tackle chall… ▽ More 3D Gaussian splatting, a novel differentiable rendering technique, has achieved state-of-the-art novel view synthesis results with high rendering speeds and relatively low training times. However, its performance on scenes commonly seen in indoor datasets is poor due to the lack of geometric constraints during optimization. We extend 3D Gaussian splatting with depth and normal cues to tackle challenging indoor datasets and showcase techniques for efficient mesh extraction, an important downstream application. Specifically, we regularize the optimization procedure with depth information, enforce local smoothness of nearby Gaussians, and use the geometry of the 3D Gaussians supervised by normal cues to achieve better alignment with the true scene geometry. We improve depth estimation and novel view synthesis results over baselines and show how this simple yet effective regularization technique can be used to directly extract meshes from the Gaussian representation yielding more physically accurate reconstructions on indoor scenes. Our code will be released in https://github.com/maturk/dn-splatter. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.13327 [pdf, other]

Gaussian Splatting on the Move: Blur and Rolling Shutter Compensation for Natural Camera Motion

Authors: Otto Seiskari, Jerry Ylilammi, Valtteri Kaatrasalo, Pekka Rantalankila, Matias Turkulainen, Juho Kannala, Arno Solin

Abstract: High-quality scene reconstruction and novel view synthesis based on Gaussian Splatting (3DGS) typically require steady, high-quality photographs, often impractical to capture with handheld cameras. We present a method that adapts to camera motion and allows high-quality scene reconstruction with handheld video data suffering from motion blur and rolling shutter distortion. Our approach is based on… ▽ More High-quality scene reconstruction and novel view synthesis based on Gaussian Splatting (3DGS) typically require steady, high-quality photographs, often impractical to capture with handheld cameras. We present a method that adapts to camera motion and allows high-quality scene reconstruction with handheld video data suffering from motion blur and rolling shutter distortion. Our approach is based on detailed modelling of the physical image formation process and utilizes velocities estimated using visual-inertial odometry (VIO). Camera poses are considered non-static during the exposure time of a single image frame and camera poses are further optimized in the reconstruction process. We formulate a differentiable rendering pipeline that leverages screen space approximation to efficiently incorporate rolling-shutter and motion blur effects into the 3DGS framework. Our results with both synthetic and real data demonstrate superior performance in mitigating camera motion over existing methods, thereby advancing 3DGS in naturalistic settings. △ Less

Submitted 24 May, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

Comments: Source code available at https://github.com/SpectacularAI/3dgs-deblur

arXiv:2311.02778 [pdf, other]

MuSHRoom: Multi-Sensor Hybrid Room Dataset for Joint 3D Reconstruction and Novel View Synthesis

Authors: Xuqian Ren, Wenjia Wang, Dingding Cai, Tuuli Tuominen, Juho Kannala, Esa Rahtu

Abstract: Metaverse technologies demand accurate, real-time, and immersive modeling on consumer-grade hardware for both non-human perception (e.g., drone/robot/autonomous car navigation) and immersive technologies like AR/VR, requiring both structural accuracy and photorealism. However, there exists a knowledge gap in how to apply geometric reconstruction and photorealism modeling (novel view synthesis) in… ▽ More Metaverse technologies demand accurate, real-time, and immersive modeling on consumer-grade hardware for both non-human perception (e.g., drone/robot/autonomous car navigation) and immersive technologies like AR/VR, requiring both structural accuracy and photorealism. However, there exists a knowledge gap in how to apply geometric reconstruction and photorealism modeling (novel view synthesis) in a unified framework. To address this gap and promote the development of robust and immersive modeling and rendering with consumer-grade devices, we propose a real-world Multi-Sensor Hybrid Room Dataset (MuSHRoom). Our dataset presents exciting challenges and requires state-of-the-art methods to be cost-effective, robust to noisy data and devices, and can jointly learn 3D reconstruction and novel view synthesis instead of treating them as separate tasks, making them ideal for real-world applications. We benchmark several famous pipelines on our dataset for joint 3D mesh reconstruction and novel view synthesis. Our dataset and benchmark show great potential in promoting the improvements for fusing 3D reconstruction and high-quality rendering in a robust and computationally efficient end-to-end fashion. The dataset and code are available at the project website: https://xuqianren.github.io/publications/MuSHRoom/. △ Less

Submitted 19 March, 2024; v1 submitted 5 November, 2023; originally announced November 2023.

arXiv:2311.01953 [pdf, other]

Optimistic Multi-Agent Policy Gradient

Authors: Wenshuai Zhao, Yi Zhao, Zhiyuan Li, Juho Kannala, Joni Pajarinen

Abstract: *Relative overgeneralization* (RO) occurs in cooperative multi-agent learning tasks when agents converge towards a suboptimal joint policy due to overfitting to suboptimal behavior of other agents. No methods have been proposed for addressing RO in multi-agent policy gradient (MAPG) methods although these methods produce state-of-the-art results. To address this gap, we propose a general, yet simp… ▽ More *Relative overgeneralization* (RO) occurs in cooperative multi-agent learning tasks when agents converge towards a suboptimal joint policy due to overfitting to suboptimal behavior of other agents. No methods have been proposed for addressing RO in multi-agent policy gradient (MAPG) methods although these methods produce state-of-the-art results. To address this gap, we propose a general, yet simple, framework to enable optimistic updates in MAPG methods that alleviate the RO problem. Our approach involves clip** the advantage to eliminate negative values, thereby facilitating optimistic updates in MAPG. The optimism prevents individual agents from quickly converging to a local optimum. Additionally, we provide a formal analysis to show that the proposed method retains optimality at a fixed point. In extensive evaluations on a diverse set of tasks including the *Multi-agent MuJoCo* and *Overcooked* benchmarks, our method outperforms strong baselines on 13 out of 19 tested tasks and matches the performance on the rest. △ Less

Submitted 25 May, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

Comments: Published at ICML 2024, 17 pages, 10 figures

arXiv:2310.15128 [pdf, other]

Projected Stochastic Gradient Descent with Quantum Annealed Binary Gradients

Authors: Maximilian Krahn, Michelle Sasdelli, Fengyi Yang, Vladislav Golyanik, Juho Kannala, Tat-Jun Chin, Tolga Birdal

Abstract: We present, QP-SBGD, a novel layer-wise stochastic optimiser tailored towards training neural networks with binary weights, known as binary neural networks (BNNs), on quantum hardware. BNNs reduce the computational requirements and energy consumption of deep learning models with minimal loss in accuracy. However, training them in practice remains to be an open challenge. Most known BNN-optimisers… ▽ More We present, QP-SBGD, a novel layer-wise stochastic optimiser tailored towards training neural networks with binary weights, known as binary neural networks (BNNs), on quantum hardware. BNNs reduce the computational requirements and energy consumption of deep learning models with minimal loss in accuracy. However, training them in practice remains to be an open challenge. Most known BNN-optimisers either rely on projected updates or binarise weights post-training. Instead, QP-SBGD approximately maps the gradient onto binary variables, by solving a quadratic constrained binary optimisation. Under practically reasonable assumptions, we show that this update rule converges with a rate of $\mathcal{O}(1 / \sqrt{T})$. Moreover, we show how the $\mathcal{NP}$-hard projection can be effectively executed on an adiabatic quantum annealer, harnessing recent advancements in quantum computation. We also introduce a projected version of this update rule and prove that if a fixed point exists in the binary variable space, the modified updates will converge to it. Last but not least, our algorithm is implemented layer-wise, making it suitable to train larger networks on resource-limited quantum hardware. Through extensive evaluations, we show that QP-SBGD outperforms or is on par with competitive and well-established baselines such as BinaryConnect, signSGD and ProxQuant when optimising the Rosenbrock function, training BNNs as well as binary graph neural networks. △ Less

Submitted 23 October, 2023; originally announced October 2023.

arXiv:2306.12547 [pdf, other]

DGC-GNN: Leveraging Geometry and Color Cues for Visual Descriptor-Free 2D-3D Matching

Authors: Shuzhe Wang, Juho Kannala, Daniel Barath

Abstract: Matching 2D keypoints in an image to a sparse 3D point cloud of the scene without requiring visual descriptors has garnered increased interest due to its low memory requirements, inherent privacy preservation, and reduced need for expensive 3D model maintenance compared to visual descriptor-based methods. However, existing algorithms often compromise on performance, resulting in a significant dete… ▽ More Matching 2D keypoints in an image to a sparse 3D point cloud of the scene without requiring visual descriptors has garnered increased interest due to its low memory requirements, inherent privacy preservation, and reduced need for expensive 3D model maintenance compared to visual descriptor-based methods. However, existing algorithms often compromise on performance, resulting in a significant deterioration compared to their descriptor-based counterparts. In this paper, we introduce DGC-GNN, a novel algorithm that employs a global-to-local Graph Neural Network (GNN) that progressively exploits geometric and color cues to represent keypoints, thereby improving matching accuracy. Our procedure encodes both Euclidean and angular relations at a coarse level, forming the geometric embedding to guide the point matching. We evaluate DGC-GNN on both indoor and outdoor datasets, demonstrating that it not only doubles the accuracy of the state-of-the-art visual descriptor-free algorithm but also substantially narrows the performance gap between descriptor-based and descriptor-free methods. △ Less

Submitted 24 March, 2024; v1 submitted 21 June, 2023; originally announced June 2023.

Comments: CVPR 2024

arXiv:2306.09466 [pdf, other]

Simplified Temporal Consistency Reinforcement Learning

Authors: Yi Zhao, Wenshuai Zhao, Rinu Boney, Juho Kannala, Joni Pajarinen

Abstract: Reinforcement learning is able to solve complex sequential decision-making tasks but is currently limited by sample efficiency and required computation. To improve sample efficiency, recent work focuses on model-based RL which interleaves model learning with planning. Recent methods further utilize policy learning, value estimation, and, self-supervised learning as auxiliary objectives. In this pa… ▽ More Reinforcement learning is able to solve complex sequential decision-making tasks but is currently limited by sample efficiency and required computation. To improve sample efficiency, recent work focuses on model-based RL which interleaves model learning with planning. Recent methods further utilize policy learning, value estimation, and, self-supervised learning as auxiliary objectives. In this paper we show that, surprisingly, a simple representation learning approach relying only on a latent dynamics model trained by latent temporal consistency is sufficient for high-performance RL. This applies when using pure planning with a dynamics model conditioned on the representation, but, also when utilizing the representation as policy and value function features in model-free RL. In experiments, our approach learns an accurate dynamics model to solve challenging high-dimensional locomotion tasks with online planners while being 4.1 times faster to train compared to ensemble-based methods. With model-free RL without planning, especially on high-dimensional tasks, such as the DeepMind Control Suite Humanoid and Dog tasks, our approach outperforms model-free methods by a large margin and matches model-based methods' sample efficiency while training 2.4 times faster. △ Less

Submitted 15 June, 2023; originally announced June 2023.

arXiv:2305.03595 [pdf, other]

HSCNet++: Hierarchical Scene Coordinate Classification and Regression for Visual Localization with Transformer

Authors: Shuzhe Wang, Zakaria Laskar, Iaroslav Melekhov, Xiaotian Li, Yi Zhao, Giorgos Tolias, Juho Kannala

Abstract: Visual localization is critical to many applications in computer vision and robotics. To address single-image RGB localization, state-of-the-art feature-based methods match local descriptors between a query image and a pre-built 3D model. Recently, deep neural networks have been exploited to regress the map** between raw pixels and 3D coordinates in the scene, and thus the matching is implicitly… ▽ More Visual localization is critical to many applications in computer vision and robotics. To address single-image RGB localization, state-of-the-art feature-based methods match local descriptors between a query image and a pre-built 3D model. Recently, deep neural networks have been exploited to regress the map** between raw pixels and 3D coordinates in the scene, and thus the matching is implicitly performed by the forward pass through the network. However, in a large and ambiguous environment, learning such a regression task directly can be difficult for a single network. In this work, we present a new hierarchical scene coordinate network to predict pixel scene coordinates in a coarse-to-fine manner from a single RGB image. The proposed method, which is an extension of HSCNet, allows us to train compact models which scale robustly to large environments. It sets a new state-of-the-art for single-image localization on the 7-Scenes, 12 Scenes, Cambridge Landmarks datasets, and the combined indoor scenes. △ Less

Submitted 5 May, 2023; originally announced May 2023.

arXiv:2302.09825 [pdf, other]

TBPos: Dataset for Large-Scale Precision Visual Localization

Authors: Masud Fahim, Ilona Söchting, Luca Ferranti, Juho Kannala, Jani Boutellier

Abstract: Image based localization is a classical computer vision challenge, with several well-known datasets. Generally, datasets consist of a visual 3D database that captures the modeled scenery, as well as query images whose 3D pose is to be discovered. Usually the query images have been acquired with a camera that differs from the imaging hardware used to collect the 3D database; consequently, it is har… ▽ More Image based localization is a classical computer vision challenge, with several well-known datasets. Generally, datasets consist of a visual 3D database that captures the modeled scenery, as well as query images whose 3D pose is to be discovered. Usually the query images have been acquired with a camera that differs from the imaging hardware used to collect the 3D database; consequently, it is hard to acquire accurate ground truth poses between query images and the 3D database. As the accuracy of visual localization algorithms constantly improves, precise ground truth becomes increasingly important. This paper proposes TBPos, a novel large-scale visual dataset for image based positioning, which provides query images with fully accurate ground truth poses: both the database images and the query images have been derived from the same laser scanner data. In the experimental part of the paper, the proposed dataset is evaluated by means of an image-based localization pipeline. △ Less

Submitted 20 February, 2023; originally announced February 2023.

Comments: Scandinavian Conference on Image Analysis 2023

arXiv:2301.01057 [pdf, other]

BS3D: Building-scale 3D Reconstruction from RGB-D Images

Authors: Janne Mustaniemi, Juho Kannala, Esa Rahtu, Li Liu, Janne Heikkilä

Abstract: Various datasets have been proposed for simultaneous localization and map** (SLAM) and related problems. Existing datasets often include small environments, have incomplete ground truth, or lack important sensor data, such as depth and infrared images. We propose an easy-to-use framework for acquiring building-scale 3D reconstruction using a consumer depth camera. Unlike complex and expensive ac… ▽ More Various datasets have been proposed for simultaneous localization and map** (SLAM) and related problems. Existing datasets often include small environments, have incomplete ground truth, or lack important sensor data, such as depth and infrared images. We propose an easy-to-use framework for acquiring building-scale 3D reconstruction using a consumer depth camera. Unlike complex and expensive acquisition setups, our system enables crowd-sourcing, which can greatly benefit data-hungry algorithms. Compared to similar systems, we utilize raw depth maps for odometry computation and loop closure refinement which results in better reconstructions. We acquire a building-scale 3D dataset (BS3D) and demonstrate its value by training an improved monocular depth estimation model. As a unique experiment, we benchmark visual-inertial odometry methods using both color and active infrared images. △ Less

Submitted 3 January, 2023; originally announced January 2023.

arXiv:2212.13381 [pdf, other]

MixupE: Understanding and Improving Mixup from Directional Derivative Perspective

Authors: Yingtian Zou, Vikas Verma, Sarthak Mittal, Wai Hoh Tang, Hieu Pham, Juho Kannala, Yoshua Bengio, Arno Solin, Kenji Kawaguchi

Abstract: Mixup is a popular data augmentation technique for training deep neural networks where additional samples are generated by linearly interpolating pairs of inputs and their labels. This technique is known to improve the generalization performance in many learning paradigms and applications. In this work, we first analyze Mixup and show that it implicitly regularizes infinitely many directional deri… ▽ More Mixup is a popular data augmentation technique for training deep neural networks where additional samples are generated by linearly interpolating pairs of inputs and their labels. This technique is known to improve the generalization performance in many learning paradigms and applications. In this work, we first analyze Mixup and show that it implicitly regularizes infinitely many directional derivatives of all orders. Based on this new insight, we propose an improved version of Mixup, theoretically justified to deliver better generalization performance than the vanilla Mixup. To demonstrate the effectiveness of the proposed method, we conduct experiments across various domains such as images, tabular data, speech, and graphs. Our results show that the proposed method improves Mixup across multiple datasets using a variety of architectures, for instance, exhibiting an improvement over Mixup by 0.8% in ImageNet top-1 accuracy. △ Less

Submitted 15 October, 2023; v1 submitted 27 December, 2022; originally announced December 2022.

Comments: 16 pages, Best Student Paper Award at UAI 2023

arXiv:2211.15656 [pdf, other]

SuperFusion: Multilevel LiDAR-Camera Fusion for Long-Range HD Map Generation

Authors: Hao Dong, Xian**g Zhang, **tao Xu, Rui Ai, Weihao Gu, Huimin Lu, Juho Kannala, Xieyuanli Chen

Abstract: High-definition (HD) semantic map generation of the environment is an essential component of autonomous driving. Existing methods have achieved good performance in this task by fusing different sensor modalities, such as LiDAR and camera. However, current works are based on raw data or network feature-level fusion and only consider short-range HD map generation, limiting their deployment to realis… ▽ More High-definition (HD) semantic map generation of the environment is an essential component of autonomous driving. Existing methods have achieved good performance in this task by fusing different sensor modalities, such as LiDAR and camera. However, current works are based on raw data or network feature-level fusion and only consider short-range HD map generation, limiting their deployment to realistic autonomous driving applications. In this paper, we focus on the task of building the HD maps in both short ranges, i.e., within 30 m, and also predicting long-range HD maps up to 90 m, which is required by downstream path planning and control tasks to improve the smoothness and safety of autonomous driving. To this end, we propose a novel network named SuperFusion, exploiting the fusion of LiDAR and camera data at multiple levels. We use LiDAR depth to improve image depth estimation and use image features to guide long-range LiDAR feature prediction. We benchmark our SuperFusion on the nuScenes dataset and a self-recorded dataset and show that it outperforms the state-of-the-art baseline methods with large margins on all intervals. Additionally, we apply the generated HD map to a downstream path planning task, demonstrating that the long-range HD maps predicted by our method can lead to better path planning for autonomous vehicles. Our code and self-recorded dataset will be available at https://github.com/haomo-ai/SuperFusion. △ Less

Submitted 16 March, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

arXiv:2211.00392 [pdf, other]

Expansion of Visual Hints for Improved Generalization in Stereo Matching

Authors: Andrea Pilzer, Yuxin Hou, Niki Loppi, Arno Solin, Juho Kannala

Abstract: We introduce visual hints expansion for guiding stereo matching to improve generalization. Our work is motivated by the robustness of Visual Inertial Odometry (VIO) in computer vision and robotics, where a sparse and unevenly distributed set of feature points characterizes a scene. To improve stereo matching, we propose to elevate 2D hints to 3D points. These sparse and unevenly distributed 3D vis… ▽ More We introduce visual hints expansion for guiding stereo matching to improve generalization. Our work is motivated by the robustness of Visual Inertial Odometry (VIO) in computer vision and robotics, where a sparse and unevenly distributed set of feature points characterizes a scene. To improve stereo matching, we propose to elevate 2D hints to 3D points. These sparse and unevenly distributed 3D visual hints are expanded using a 3D random geometric graph, which enhances the learning and inference process. We evaluate our proposal on multiple widely adopted benchmarks and show improved performance without access to additional sensors other than the image sequence. To highlight practical applicability and symbiosis with visual odometry, we demonstrate how our methods run on embedded hardware. △ Less

Submitted 1 November, 2022; originally announced November 2022.

Comments: 2023 IEEE Winter Conference on Applications of Computer Vision (WACV)

arXiv:2210.13846 [pdf, other]

Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning

Authors: Yi Zhao, Rinu Boney, Alexander Ilin, Juho Kannala, Joni Pajarinen

Abstract: Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment. However, depending on the quality of the offline dataset, such pre-trained agents may have limited performance and would further need to be fine-tuned online by interacting with the environment. During online fine-tuning, the performance of the pre-… ▽ More Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment. However, depending on the quality of the offline dataset, such pre-trained agents may have limited performance and would further need to be fine-tuned online by interacting with the environment. During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data. While constraints enforced by offline RL methods such as a behaviour cloning loss prevent this to an extent, these constraints also significantly slow down online fine-tuning by forcing the agent to stay close to the behavior policy. We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability. Moreover, we use a randomized ensemble of Q functions to further increase the sample efficiency of online fine-tuning by performing a large number of learning updates. Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark. Code is available: \url{https://github.com/zhaoyi11/adaptive_bc}. △ Less

Submitted 25 October, 2022; originally announced October 2022.

arXiv:2210.01426 [pdf, other]

Continuous Monte Carlo Graph Search

Authors: Kalle Kujanpää, Amin Babadi, Yi Zhao, Juho Kannala, Alexander Ilin, Joni Pajarinen

Abstract: Online planning is crucial for high performance in many complex sequential decision-making tasks. Monte Carlo Tree Search (MCTS) employs a principled mechanism for trading off exploration for exploitation for efficient online planning, and it outperforms comparison methods in many discrete decision-making domains such as Go, Chess, and Shogi. Subsequently, extensions of MCTS to continuous domains… ▽ More Online planning is crucial for high performance in many complex sequential decision-making tasks. Monte Carlo Tree Search (MCTS) employs a principled mechanism for trading off exploration for exploitation for efficient online planning, and it outperforms comparison methods in many discrete decision-making domains such as Go, Chess, and Shogi. Subsequently, extensions of MCTS to continuous domains have been developed. However, the inherent high branching factor and the resulting explosion of the search tree size are limiting the existing methods. To address this problem, we propose Continuous Monte Carlo Graph Search (CMCGS), an extension of MCTS to online planning in environments with continuous state and action spaces. CMCGS takes advantage of the insight that, during planning, sharing the same action policy between several states can yield high performance. To implement this idea, at each time step, CMCGS clusters similar states into a limited number of stochastic action bandit nodes, which produce a layered directed graph instead of an MCTS search tree. Experimental evaluation shows that CMCGS outperforms comparable planning methods in several complex continuous DeepMind Control Suite benchmarks and 2D navigation and exploration tasks with limited sample budgets. Furthermore, CMCGS can be scaled up through parallelization, and it outperforms the Cross-Entropy Method (CEM) in continuous control with learned dynamics models. △ Less

Submitted 7 February, 2024; v1 submitted 4 October, 2022; originally announced October 2022.

Comments: Accepted at AAMAS 2024 (full paper & oral)

arXiv:2208.07591 [pdf, other]

Uncertainty-guided Source-free Domain Adaptation

Authors: Subhankar Roy, Martin Trapp, Andrea Pilzer, Juho Kannala, Nicu Sebe, Elisa Ricci, Arno Solin

Abstract: Source-free domain adaptation (SFDA) aims to adapt a classifier to an unlabelled target data set by only using a pre-trained source model. However, the absence of the source data and the domain shift makes the predictions on the target data unreliable. We propose quantifying the uncertainty in the source model predictions and utilizing it to guide the target adaptation. For this, we construct a pr… ▽ More Source-free domain adaptation (SFDA) aims to adapt a classifier to an unlabelled target data set by only using a pre-trained source model. However, the absence of the source data and the domain shift makes the predictions on the target data unreliable. We propose quantifying the uncertainty in the source model predictions and utilizing it to guide the target adaptation. For this, we construct a probabilistic source model by incorporating priors on the network parameters inducing a distribution over the model predictions. Uncertainties are estimated by employing a Laplace approximation and incorporated to identify target data points that do not lie in the source manifold and to down-weight them when maximizing the mutual information on the target data. Unlike recent works, our probabilistic treatment is computationally lightweight, decouples source training and target adaptation, and requires no specialized source training or changes of the model architecture. We show the advantages of uncertainty-guided SFDA over traditional SFDA in the closed-set and open-set settings and provide empirical evidence that our approach is more robust to strong domain shifts even without tuning. △ Less

Submitted 16 August, 2022; originally announced August 2022.

Comments: ECCV 2022

arXiv:2208.06933 [pdf, other]

Visual Localization via Few-Shot Scene Region Classification

Authors: Siyan Dong, Shuzhe Wang, Yixin Zhuang, Juho Kannala, Marc Pollefeys, Baoquan Chen

Abstract: Visual (re)localization addresses the problem of estimating the 6-DoF (Degree of Freedom) camera pose of a query image captured in a known scene, which is a key building block of many computer vision and robotics applications. Recent advances in structure-based localization solve this problem by memorizing the map** from image pixels to scene coordinates with neural networks to build 2D-3D corre… ▽ More Visual (re)localization addresses the problem of estimating the 6-DoF (Degree of Freedom) camera pose of a query image captured in a known scene, which is a key building block of many computer vision and robotics applications. Recent advances in structure-based localization solve this problem by memorizing the map** from image pixels to scene coordinates with neural networks to build 2D-3D correspondences for camera pose optimization. However, such memorization requires training by amounts of posed images in each scene, which is heavy and inefficient. On the contrary, few-shot images are usually sufficient to cover the main regions of a scene for a human operator to perform visual localization. In this paper, we propose a scene region classification approach to achieve fast and effective scene memorization with few-shot images. Our insight is leveraging a) pre-learned feature extractor, b) scene region classifier, and c) meta-learning strategy to accelerate training while mitigating overfitting. We evaluate our method on both indoor and outdoor benchmarks. The experiments validate the effectiveness of our method in the few-shot setting, and the training time is significantly reduced to only a few minutes. Code available at: \url{https://github.com/siyandong/SRC} △ Less

Submitted 14 August, 2022; originally announced August 2022.

Comments: 3DV 2022

arXiv:2206.08890 [pdf, other]

Disentangling Model Multiplicity in Deep Learning

Authors: Ari Heljakka, Martin Trapp, Juho Kannala, Arno Solin

Abstract: Model multiplicity is a well-known but poorly understood phenomenon that undermines the generalisation guarantees of machine learning models. It appears when two models with similar training-time performance differ in their predictions and real-world performance characteristics. This observed 'predictive' multiplicity (PM) also implies elusive differences in the internals of the models, their 'rep… ▽ More Model multiplicity is a well-known but poorly understood phenomenon that undermines the generalisation guarantees of machine learning models. It appears when two models with similar training-time performance differ in their predictions and real-world performance characteristics. This observed 'predictive' multiplicity (PM) also implies elusive differences in the internals of the models, their 'representational' multiplicity (RM). We introduce a conceptual and experimental setup for analysing RM by measuring activation similarity via singular vector canonical correlation analysis (SVCCA). We show that certain differences in training methods systematically result in larger RM than others and evaluate RM and PM over a finite sample as predictors for generalizability. We further correlate RM with PM measured by the variance in i.i.d. and out-of-distribution test predictions in four standard image data sets. Finally, instead of attempting to eliminate RM, we call for its systematic measurement and maximal exposure. △ Less

Submitted 31 January, 2023; v1 submitted 17 June, 2022; originally announced June 2022.

Comments: 13 pages, 6 figures

arXiv:2205.11299 [pdf, other]

Multiple Offsets Multilateration: a new paradigm for sensor network calibration with unsynchronized reference nodes

Authors: Luca Ferranti, Kalle Åström, Magnus Oskarsson, Jani Boutellier, Juho Kannala

Abstract: Positioning using wave signal measurements is used in several applications, such as GPS systems, structure from sound and Wifi based positioning. Mathematically, such problems require the computation of the positions of receivers and/or transmitters as well as time offsets if the devices are unsynchronized. In this paper, we expand the previous state-of-the-art on positioning formulations by intro… ▽ More Positioning using wave signal measurements is used in several applications, such as GPS systems, structure from sound and Wifi based positioning. Mathematically, such problems require the computation of the positions of receivers and/or transmitters as well as time offsets if the devices are unsynchronized. In this paper, we expand the previous state-of-the-art on positioning formulations by introducing Multiple Offsets Multilateration (MOM), a new mathematical framework to compute the receivers positions with pseudoranges from unsynchronized reference transmitters at known positions. This could be applied in several scenarios, for example structure from sound and positioning with LEO satellites. We mathematically describe MOM, determining how many receivers and transmitters are needed for the network to be solvable, a study on the number of possible distinct solutions is presented and stable solvers based on homotopy continuation are derived. The solvers are shown to be efficient and robust to noise both for synthetic and real audio data. △ Less

Submitted 23 May, 2022; originally announced May 2022.

Comments: accepted to ICASSP2022

arXiv:2110.08407 [pdf, other]

doi 10.1007/978-3-030-88210-5_4

Bridging the gap between paired and unpaired medical image translation

Authors: Pauliina Paavilainen, Saad Ullah Akram, Juho Kannala

Abstract: Medical image translation has the potential to reduce the imaging workload, by removing the need to capture some sequences, and to reduce the annotation burden for develo** machine learning methods. GANs have been used successfully to translate images from one domain to another, such as MR to CT. At present, paired data (registered MR and CT images) or extra supervision (e.g. segmentation masks)… ▽ More Medical image translation has the potential to reduce the imaging workload, by removing the need to capture some sequences, and to reduce the annotation burden for develo** machine learning methods. GANs have been used successfully to translate images from one domain to another, such as MR to CT. At present, paired data (registered MR and CT images) or extra supervision (e.g. segmentation masks) is needed to learn good translation models. Registering multiple modalities or annotating structures within each of them is a tedious and laborious task. Thus, there is a need to develop improved translation methods for unpaired data. Here, we introduce modified pix2pix models for tasks CT$\rightarrow$MR and MR$\rightarrow$CT, trained with unpaired CT and MR data, and MRCAT pairs generated from the MR scans. The proposed modifications utilize the paired MR and MRCAT images to ensure good alignment between input and translated images, and unpaired CT images ensure the MR$\rightarrow$CT model produces realistic-looking CT and CT$\rightarrow$MR model works well with real CT as input. The proposed pix2pix variants outperform baseline pix2pix, pix2pixHD and CycleGAN in terms of FID and KID, and generate more realistic looking CT and MR translations. △ Less

Submitted 15 October, 2021; originally announced October 2021.

Comments: Deep Generative Models for MICCAI (DGM4MICCAI) workshop 2021

arXiv:2110.04773 [pdf, other]

Digging Into Self-Supervised Learning of Feature Descriptors

Authors: Iaroslav Melekhov, Zakaria Laskar, Xiaotian Li, Shuzhe Wang, Juho Kannala

Abstract: Fully-supervised CNN-based approaches for learning local image descriptors have shown remarkable results in a wide range of geometric tasks. However, most of them require per-pixel ground-truth keypoint correspondence data which is difficult to acquire at scale. To address this challenge, recent weakly- and self-supervised methods can learn feature descriptors from relative camera poses or using o… ▽ More Fully-supervised CNN-based approaches for learning local image descriptors have shown remarkable results in a wide range of geometric tasks. However, most of them require per-pixel ground-truth keypoint correspondence data which is difficult to acquire at scale. To address this challenge, recent weakly- and self-supervised methods can learn feature descriptors from relative camera poses or using only synthetic rigid transformations such as homographies. In this work, we focus on understanding the limitations of existing self-supervised approaches and propose a set of improvements that combined lead to powerful feature descriptors. We show that increasing the search space from in-pair to in-batch for hard negative mining brings consistent improvement. To enhance the discriminativeness of feature descriptors, we propose a coarse-to-fine method for mining local hard negatives from a wider search space by using global visual image descriptors. We demonstrate that a combination of synthetic homography transformation, color augmentation, and photorealistic image stylization produces useful representations that are viewpoint and illumination invariant. The feature descriptors learned by the proposed approach perform competitively and surpass their fully- and weakly-supervised counterparts on various geometric benchmarks such as image-based localization, sparse feature matching, and image retrieval. △ Less

Submitted 10 October, 2021; originally announced October 2021.

Comments: Camera ready (3DV 2021)

arXiv:2108.09112 [pdf, other]

Continual Learning for Image-Based Camera Localization

Authors: Shuzhe Wang, Zakaria Laskar, Iaroslav Melekhov, Xiaotian Li, Juho Kannala

Abstract: For several emerging technologies such as augmented reality, autonomous driving and robotics, visual localization is a critical component. Directly regressing camera pose/3D scene coordinates from the input image using deep neural networks has shown great potential. However, such methods assume a stationary data distribution with all scenes simultaneously available during training. In this paper,… ▽ More For several emerging technologies such as augmented reality, autonomous driving and robotics, visual localization is a critical component. Directly regressing camera pose/3D scene coordinates from the input image using deep neural networks has shown great potential. However, such methods assume a stationary data distribution with all scenes simultaneously available during training. In this paper, we approach the problem of visual localization in a continual learning setup -- whereby the model is trained on scenes in an incremental manner. Our results show that similar to the classification domain, non-stationary data induces catastrophic forgetting in deep networks for visual localization. To address this issue, a strong baseline based on storing and replaying images from a fixed buffer is proposed. Furthermore, we propose a new sampling method based on coverage score (Buff-CS) that adapts the existing sampling strategies in the buffering process to the problem of visual localization. Results demonstrate consistent improvements over standard buffering methods on two challenging datasets -- 7Scenes, 12Scenes, and also 19Scenes by combining the former scenes. △ Less

Submitted 27 April, 2022; v1 submitted 20 August, 2021; originally announced August 2021.

Comments: ICCV 2021

arXiv:2106.11857 [pdf, other]

doi 10.1109/WACV51458.2022.00036

HybVIO: Pushing the Limits of Real-time Visual-inertial Odometry

Authors: Otto Seiskari, Pekka Rantalankila, Juho Kannala, Jerry Ylilammi, Esa Rahtu, Arno Solin

Abstract: We present HybVIO, a novel hybrid approach for combining filtering-based visual-inertial odometry (VIO) with optimization-based SLAM. The core of our method is highly robust, independent VIO with improved IMU bias modeling, outlier rejection, stationarity detection, and feature track selection, which is adjustable to run on embedded hardware. Long-term consistency is achieved with a loosely-couple… ▽ More We present HybVIO, a novel hybrid approach for combining filtering-based visual-inertial odometry (VIO) with optimization-based SLAM. The core of our method is highly robust, independent VIO with improved IMU bias modeling, outlier rejection, stationarity detection, and feature track selection, which is adjustable to run on embedded hardware. Long-term consistency is achieved with a loosely-coupled SLAM module. In academic benchmarks, our solution yields excellent performance in all categories, especially in the real-time use case, where we outperform the current state-of-the-art. We also demonstrate the feasibility of VIO for vehicular tracking on consumer-grade hardware using a custom dataset, and show good performance in comparison to current commercial VISLAM alternatives. An open-source implementation of the HybVIO method is available at https://github.com/SpectacularAI/HybVIO △ Less

Submitted 25 November, 2021; v1 submitted 22 June, 2021; originally announced June 2021.

Comments: 2022 IEEE Winter Conference on Applications of Computer Vision (WACV)

arXiv:2106.07995 [pdf, other]

Learning of feature points without additional supervision improves reinforcement learning from images

Authors: Rinu Boney, Alexander Ilin, Juho Kannala

Abstract: In many control problems that include vision, optimal controls can be inferred from the location of the objects in the scene. This information can be represented using feature points, which is a list of spatial locations in learned feature maps of an input image. Previous works show that feature points learned using unsupervised pre-training or human supervision can provide good features for contr… ▽ More In many control problems that include vision, optimal controls can be inferred from the location of the objects in the scene. This information can be represented using feature points, which is a list of spatial locations in learned feature maps of an input image. Previous works show that feature points learned using unsupervised pre-training or human supervision can provide good features for control tasks. In this paper, we show that it is possible to learn efficient feature point representations end-to-end, without the need for unsupervised pre-training, decoders, or additional losses. Our proposed architecture consists of a differentiable feature point extractor that feeds the coordinates of the estimated feature points directly to a soft actor-critic agent. The proposed algorithm yields performance competitive to the state-of-the art on DeepMind Control Suite tasks. △ Less

Submitted 4 June, 2022; v1 submitted 15 June, 2021; originally announced June 2021.

arXiv:2104.03117 [pdf, other]

Single Source One Shot Reenactment using Weighted motion From Paired Feature Points

Authors: Soumya Tripathy, Juho Kannala, Esa Rahtu

Abstract: Image reenactment is a task where the target object in the source image imitates the motion represented in the driving image. One of the most common reenactment tasks is face image animation. The major challenge in the current face reenactment approaches is to distinguish between facial motion and identity. For this reason, the previous models struggle to produce high-quality animations if the dri… ▽ More Image reenactment is a task where the target object in the source image imitates the motion represented in the driving image. One of the most common reenactment tasks is face image animation. The major challenge in the current face reenactment approaches is to distinguish between facial motion and identity. For this reason, the previous models struggle to produce high-quality animations if the driving and source identities are different (cross-person reenactment). We propose a new (face) reenactment model that learns shape-independent motion features in a self-supervised setup. The motion is represented using a set of paired feature points extracted from the source and driving images simultaneously. The model is generalised to multiple reenactment tasks including faces and non-face objects using only a single source image. The extensive experiments show that the model faithfully transfers the driving motion to the source while retaining the source identity intact. △ Less

Submitted 7 April, 2021; originally announced April 2021.

arXiv:2101.01619 [pdf, other]

Novel View Synthesis via Depth-guided Skip Connections

Authors: Yuxin Hou, Arno Solin, Juho Kannala

Abstract: We introduce a principled approach for synthesizing new views of a scene given a single source image. Previous methods for novel view synthesis can be divided into image-based rendering methods (e.g. flow prediction) or pixel generation methods. Flow predictions enable the target view to re-use pixels directly, but can easily lead to distorted results. Directly regressing pixels can produce struct… ▽ More We introduce a principled approach for synthesizing new views of a scene given a single source image. Previous methods for novel view synthesis can be divided into image-based rendering methods (e.g. flow prediction) or pixel generation methods. Flow predictions enable the target view to re-use pixels directly, but can easily lead to distorted results. Directly regressing pixels can produce structurally consistent results but generally suffer from the lack of low-level details. In this paper, we utilize an encoder-decoder architecture to regress pixels of a target view. In order to maintain details, we couple the decoder aligned feature maps with skip connections, where the alignment is guided by predicted depth map of the target view. Our experimental results show that our method does not suffer from distortions and successfully preserves texture details with aligned skip connections. △ Less

Submitted 5 January, 2021; originally announced January 2021.

arXiv:2012.12186 [pdf, other]

Learning to Play Imperfect-Information Games by Imitating an Oracle Planner

Authors: Rinu Boney, Alexander Ilin, Juho Kannala, Jarno Seppänen

Abstract: We consider learning to play multiplayer imperfect-information games with simultaneous moves and large state-action spaces. Previous attempts to tackle such challenging games have largely focused on model-free learning methods, often requiring hundreds of years of experience to produce competitive agents. Our approach is based on model-based planning. We tackle the problem of partial observability… ▽ More We consider learning to play multiplayer imperfect-information games with simultaneous moves and large state-action spaces. Previous attempts to tackle such challenging games have largely focused on model-free learning methods, often requiring hundreds of years of experience to produce competitive agents. Our approach is based on model-based planning. We tackle the problem of partial observability by first building an (oracle) planner that has access to the full state of the environment and then distilling the knowledge of the oracle to a (follower) agent which is trained to play the imperfect-information game by imitating the oracle's choices. We experimentally show that planning with naive Monte Carlo tree search does not perform very well in large combinatorial action spaces. We therefore propose planning with a fixed-depth tree search and decoupled Thompson sampling for action selection. We show that the planner is able to discover efficient playing strategies in the games of Clash Royale and Pommerman and the follower policy successfully learns to implement them by training on a few hundred battles. △ Less

Submitted 22 December, 2020; originally announced December 2020.

arXiv:2011.04439 [pdf, other]

FACEGAN: Facial Attribute Controllable rEenactment GAN

Authors: Soumya Tripathy, Juho Kannala, Esa Rahtu

Abstract: The face reenactment is a popular facial animation method where the person's identity is taken from the source image and the facial motion from the driving image. Recent works have demonstrated high quality results by combining the facial landmark based motion representations with the generative adversarial networks. These models perform best if the source and driving images depict the same person… ▽ More The face reenactment is a popular facial animation method where the person's identity is taken from the source image and the facial motion from the driving image. Recent works have demonstrated high quality results by combining the facial landmark based motion representations with the generative adversarial networks. These models perform best if the source and driving images depict the same person or if the facial structures are otherwise very similar. However, if the identity differs, the driving facial structures leak to the output distorting the reenactment result. We propose a novel Facial Attribute Controllable rEenactment GAN (FACEGAN), which transfers the facial motion from the driving face via the Action Unit (AU) representation. Unlike facial landmarks, the AUs are independent of the facial structure preventing the identity leak. Moreover, AUs provide a human interpretable way to control the reenactment. FACEGAN processes background and face regions separately for optimized output quality. The extensive quantitative and qualitative comparisons show a clear improvement over the state-of-the-art in a single source reenactment task. The results are best illustrated in the reenactment video provided in the supplementary material. The source code will be made available upon publication of the paper. △ Less

Submitted 9 November, 2020; originally announced November 2020.

Comments: Accepted to WACV-2021

arXiv:2011.03085 [pdf, other]

RealAnt: An Open-Source Low-Cost Quadruped for Education and Research in Real-World Reinforcement Learning

Authors: Rinu Boney, Jussi Sainio, Mikko Kaivola, Arno Solin, Juho Kannala

Abstract: Current robot platforms available for research are either very expensive or unable to handle the abuse of exploratory controls in reinforcement learning. We develop RealAnt, a minimal low-cost physical version of the popular `Ant' benchmark used in reinforcement learning. RealAnt costs only $\sim$350 EUR (\$410) in materials and can be assembled in less than an hour. We validate the platform with… ▽ More Current robot platforms available for research are either very expensive or unable to handle the abuse of exploratory controls in reinforcement learning. We develop RealAnt, a minimal low-cost physical version of the popular `Ant' benchmark used in reinforcement learning. RealAnt costs only $\sim$350 EUR (\$410) in materials and can be assembled in less than an hour. We validate the platform with reinforcement learning experiments and provide baseline results on a set of benchmark tasks. We demonstrate that the RealAnt robot can learn to walk from scratch from less than 10 minutes of experience. We also provide simulator versions of the robot (with the same dimensions, state-action spaces, and delayed noisy observations) in the MuJoCo and PyBullet simulators. We open-source hardware designs, supporting software, and baseline results for educational use and reproducible research. △ Less

Submitted 4 June, 2022; v1 submitted 5 November, 2020; originally announced November 2020.

arXiv:2010.09105 [pdf, other]

Movement-induced Priors for Deep Stereo

Authors: Yuxin Hou, Muhammad Kamran Janjua, Juho Kannala, Arno Solin

Abstract: We propose a method for fusing stereo disparity estimation with movement-induced prior information. Instead of independent inference frame-by-frame, we formulate the problem as a non-parametric learning task in terms of a temporal Gaussian process prior with a movement-driven kernel for inter-frame reasoning. We present a hierarchy of three Gaussian process kernels depending on the availability of… ▽ More We propose a method for fusing stereo disparity estimation with movement-induced prior information. Instead of independent inference frame-by-frame, we formulate the problem as a non-parametric learning task in terms of a temporal Gaussian process prior with a movement-driven kernel for inter-frame reasoning. We present a hierarchy of three Gaussian process kernels depending on the availability of motion information, where our main focus is on a new gyroscope-driven kernel for handheld devices with low-quality MEMS sensors, thus also relaxing the requirement of having full 6D camera poses available. We show how our method can be combined with two state-of-the-art deep stereo methods. The method either work in a plug-and-play fashion with pre-trained deep stereo networks, or further improved by jointly training the kernels together with encoder-decoder architectures, leading to consistent improvement. △ Less

Submitted 18 October, 2020; originally announced October 2020.

arXiv:2010.00347 [pdf, other]

Can You Trust Your Pose? Confidence Estimation in Visual Localization

Authors: Luca Ferranti, Xiaotian Li, Jani Boutellier, Juho Kannala

Abstract: Camera pose estimation in large-scale environments is still an open question and, despite recent promising results, it may still fail in some situations. The research so far has focused on improving subcomponents of estimation pipelines, to achieve more accurate poses. However, there is no guarantee for the result to be correct, even though the correctness of pose estimation is critically importan… ▽ More Camera pose estimation in large-scale environments is still an open question and, despite recent promising results, it may still fail in some situations. The research so far has focused on improving subcomponents of estimation pipelines, to achieve more accurate poses. However, there is no guarantee for the result to be correct, even though the correctness of pose estimation is critically important in several visual localization applications,such as in autonomous navigation. In this paper we bring to attention a novel research question, pose confidence estimation,where we aim at quantifying how reliable the visually estimated pose is. We develop a novel confidence measure to fulfil this task and show that it can be flexibly applied to different datasets,indoor or outdoor, and for various visual localization pipelines.We also show that the proposed techniques can be used to accomplish a secondary goal: improving the accuracy of existing pose estimation pipelines. Finally, the proposed approach is computationally light-weight and adds only a negligible increase to the computational effort of pose estimation. △ Less

Submitted 1 October, 2020; originally announced October 2020.

Comments: To appear in ICPR 2020

arXiv:2008.06959 [pdf, other]

Image Stylization for Robust Features

Authors: Iaroslav Melekhov, Gabriel J. Brostow, Juho Kannala, Daniyar Turmukhambetov

Abstract: Local features that are robust to both viewpoint and appearance changes are crucial for many computer vision tasks. In this work we investigate if photorealistic image stylization improves robustness of local features to not only day-night, but also weather and season variations. We show that image stylization in addition to color augmentation is a powerful method of learning robust features. We e… ▽ More Local features that are robust to both viewpoint and appearance changes are crucial for many computer vision tasks. In this work we investigate if photorealistic image stylization improves robustness of local features to not only day-night, but also weather and season variations. We show that image stylization in addition to color augmentation is a powerful method of learning robust features. We evaluate learned features on visual localization benchmarks, outperforming state of the art baseline models despite training without ground-truth 3D correspondences using synthetic homographies only. We use trained feature networks to compete in Long-Term Visual Localization and Map-based Localization for Autonomous Driving challenges achieving competitive scores. △ Less

Submitted 16 August, 2020; originally announced August 2020.

Comments: v1.1

arXiv:2008.00715 [pdf, other]

Learning to Drive (L2D) as a Low-Cost Benchmark for Real-World Reinforcement Learning

Authors: Ari Viitala, Rinu Boney, Yi Zhao, Alexander Ilin, Juho Kannala

Abstract: We present Learning to Drive (L2D), a low-cost benchmark for real-world reinforcement learning (RL). L2D involves a simple and reproducible experimental setup where an RL agent has to learn to drive a Donkey car around three miniature tracks, given only monocular image observations and speed of the car. The agent has to learn to drive from disengagements, which occurs when it drives off the track.… ▽ More We present Learning to Drive (L2D), a low-cost benchmark for real-world reinforcement learning (RL). L2D involves a simple and reproducible experimental setup where an RL agent has to learn to drive a Donkey car around three miniature tracks, given only monocular image observations and speed of the car. The agent has to learn to drive from disengagements, which occurs when it drives off the track. We present and open-source our training pipeline, which makes it straightforward to apply any existing RL algorithm to the task of autonomous driving with a Donkey car. We test imitation learning, state-of-the-art model-free, and model-based algorithms on the proposed L2D benchmark. Our results show that existing RL algorithms can learn to drive the car from scratch in less than five minutes of interaction. We demonstrate that RL algorithms can learn from sparse and noisy disengagement to drive even faster than imitation learning and a human operator. △ Less

Submitted 6 November, 2020; v1 submitted 3 August, 2020; originally announced August 2020.

arXiv:2007.05299 [pdf, other]

Data-Efficient Ranking Distillation for Image Retrieval

Authors: Zakaria Laskar, Juho Kannala

Abstract: Recent advances in deep learning has lead to rapid developments in the field of image retrieval. However, the best performing architectures incur significant computational cost. Recent approaches tackle this issue using knowledge distillation to transfer knowledge from a deeper and heavier architecture to a much smaller network. In this paper we address knowledge distillation for metric learning p… ▽ More Recent advances in deep learning has lead to rapid developments in the field of image retrieval. However, the best performing architectures incur significant computational cost. Recent approaches tackle this issue using knowledge distillation to transfer knowledge from a deeper and heavier architecture to a much smaller network. In this paper we address knowledge distillation for metric learning problems. Unlike previous approaches, our proposed method jointly addresses the following constraints i) limited queries to teacher model, ii) black box teacher model with access to the final output representation, and iii) small fraction of original training data without any ground-truth labels. In addition, the distillation method does not require the student and teacher to have same dimensionality. Addressing these constraints reduces computation requirements, dependency on large-scale training datasets and addresses practical scenarios of limited or partial access to private data such as teacher models or the corresponding training data/labels. The key idea is to augment the original training set with additional samples by performing linear interpolation in the final output representation space. Distillation is then performed in the joint space of original and augmented teacher-student sample representations. Results demonstrate that our approach can match baseline models trained with full supervision. In low training sample settings, our approach outperforms the fully supervised approach on two challenging image retrieval datasets, ROxford5k and RParis6k \cite{Roxf} with the least possible teacher supervision. △ Less

Submitted 13 July, 2020; v1 submitted 10 July, 2020; originally announced July 2020.

Comments: 10 pages, 2 figures. Edited figure 7

arXiv:2006.02158 [pdf, other]

Interpolation-based semi-supervised learning for object detection

Authors: Jisoo Jeong, Vikas Verma, Minsung Hyun, Juho Kannala, Nojun Kwak

Abstract: Despite the data labeling cost for the object detection tasks being substantially more than that of the classification tasks, semi-supervised learning methods for object detection have not been studied much. In this paper, we propose an Interpolation-based Semi-supervised learning method for object Detection (ISD), which considers and solves the problems caused by applying conventional Interpolati… ▽ More Despite the data labeling cost for the object detection tasks being substantially more than that of the classification tasks, semi-supervised learning methods for object detection have not been studied much. In this paper, we propose an Interpolation-based Semi-supervised learning method for object Detection (ISD), which considers and solves the problems caused by applying conventional Interpolation Regularization (IR) directly to object detection. We divide the output of the model into two types according to the objectness scores of both original patches that are mixed in IR. Then, we apply a separate loss suitable for each type in an unsupervised manner. The proposed losses dramatically improve the performance of semi-supervised learning as well as supervised learning. In the supervised learning setting, our method improves the baseline methods by a significant margin. In the semi-supervised learning setting, our algorithm improves the performance on a benchmark dataset (PASCAL VOC and MSCOCO) in a benchmark architecture (SSD). △ Less

Submitted 29 December, 2020; v1 submitted 3 June, 2020; originally announced June 2020.

arXiv:2005.10298 [pdf, ps, other]

Sensor Networks TDOA Self-Calibration: 2D Complexity Analysis and Solutions

Authors: Luca Ferranti, Kalle Åström, Magnus Oskarsson, Jani Boutellier, Juho Kannala

Abstract: Given a network of receivers and transmitters, the process of determining their positions from measured pseudoranges is known as network self-calibration. In this paper we consider 2D networks with synchronized receivers but unsynchronized transmitters and the corresponding calibration techniques, known as Time-Difference-Of-Arrival (TDOA) techniques. Despite previous work, TDOA self-calibration i… ▽ More Given a network of receivers and transmitters, the process of determining their positions from measured pseudoranges is known as network self-calibration. In this paper we consider 2D networks with synchronized receivers but unsynchronized transmitters and the corresponding calibration techniques, known as Time-Difference-Of-Arrival (TDOA) techniques. Despite previous work, TDOA self-calibration is computationally challenging. Iterative algorithms are very sensitive to the initialization, causing convergence issues. In this paper, we present a novel approach, which gives an algebraic solution to two previously unsolved scenarios. We also demonstrate that our solvers produce an excellent initial value for non-linear optimisation algorithms, leading to a full pipeline robust to noise. △ Less

Submitted 22 October, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

arXiv:1912.10321 [pdf, other]

Deep Automodulators

Authors: Ari Heljakka, Yuxin Hou, Juho Kannala, Arno Solin

Abstract: We introduce a new category of generative autoencoders called automodulators. These networks can faithfully reproduce individual real-world input images like regular autoencoders, but also generate a fused sample from an arbitrary combination of several such images, allowing instantaneous 'style-mixing' and other new applications. An automodulator decouples the data flow of decoder operations from… ▽ More We introduce a new category of generative autoencoders called automodulators. These networks can faithfully reproduce individual real-world input images like regular autoencoders, but also generate a fused sample from an arbitrary combination of several such images, allowing instantaneous 'style-mixing' and other new applications. An automodulator decouples the data flow of decoder operations from statistical properties thereof and uses the latent vector to modulate the former by the latter, with a principled approach for mutual disentanglement of decoder layers. Prior work has explored similar decoder architecture with GANs, but their focus has been on random sampling. A corresponding autoencoder could operate on real input images. For the first time, we show how to train such a general-purpose model with sharp outputs in high resolution, using novel training techniques, demonstrated on four image data sets. Besides style-mixing, we show state-of-the-art results in autoencoder comparison, and visual image quality nearly indistinguishable from state-of-the-art GANs. We expect the automodulator variants to become a useful building block for image applications and other data domains. △ Less

Submitted 29 October, 2020; v1 submitted 21 December, 2019; originally announced December 2019.

Comments: To appear in Advances in Neural Information Processing Systems (NeurIPS 2020)

arXiv:1910.05527 [pdf, other]

Regularizing Model-Based Planning with Energy-Based Models

Authors: Rinu Boney, Juho Kannala, Alexander Ilin

Abstract: Model-based reinforcement learning could enable sample-efficient learning by quickly acquiring rich knowledge about the world and using it to improve behaviour without additional data. Learned dynamics models can be directly used for planning actions but this has been challenging because of inaccuracies in the learned models. In this paper, we focus on planning with learned dynamics models and pro… ▽ More Model-based reinforcement learning could enable sample-efficient learning by quickly acquiring rich knowledge about the world and using it to improve behaviour without additional data. Learned dynamics models can be directly used for planning actions but this has been challenging because of inaccuracies in the learned models. In this paper, we focus on planning with learned dynamics models and propose to regularize it using energy estimates of state transitions in the environment. We visually demonstrate the effectiveness of the proposed method and show that off-policy training of an energy estimator can be effectively used to regularize planning with pre-trained dynamics models. Further, we demonstrate that the proposed method enables sample-efficient learning to achieve competitive performance in challenging continuous control tasks such as Half-cheetah and Ant in just a few minutes of experience. △ Less

Submitted 12 October, 2019; originally announced October 2019.

Comments: Conference on Robot Learning 2019

arXiv:1909.11715 [pdf, other]

GraphMix: Improved Training of GNNs for Semi-Supervised Learning

Authors: Vikas Verma, Meng Qu, Kenji Kawaguchi, Alex Lamb, Yoshua Bengio, Juho Kannala, Jian Tang

Abstract: We present GraphMix, a regularization method for Graph Neural Network based semi-supervised object classification, whereby we propose to train a fully-connected network jointly with the graph neural network via parameter sharing and interpolation-based regularization. Further, we provide a theoretical analysis of how GraphMix improves the generalization bounds of the underlying graph neural networ… ▽ More We present GraphMix, a regularization method for Graph Neural Network based semi-supervised object classification, whereby we propose to train a fully-connected network jointly with the graph neural network via parameter sharing and interpolation-based regularization. Further, we provide a theoretical analysis of how GraphMix improves the generalization bounds of the underlying graph neural network, without making any assumptions about the "aggregation" layer or the depth of the graph neural networks. We experimentally validate this analysis by applying GraphMix to various architectures such as Graph Convolutional Networks, Graph Attention Networks and Graph-U-Net. Despite its simplicity, we demonstrate that GraphMix can consistently improve or closely match state-of-the-art performance using even simpler architectures such as Graph Convolutional Networks, across three established graph benchmarks: Cora, Citeseer and Pubmed citation network datasets, as well as three newly proposed datasets: Cora-Full, Co-author-CS and Co-author-Physics. △ Less

Submitted 8 October, 2020; v1 submitted 25 September, 2019; originally announced September 2019.

Comments: https://github.com/vikasverma1077/GraphMix

arXiv:1909.06216 [pdf, other]

Hierarchical Scene Coordinate Classification and Regression for Visual Localization

Authors: Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, Juho Kannala

Abstract: Visual localization is critical to many applications in computer vision and robotics. To address single-image RGB localization, state-of-the-art feature-based methods match local descriptors between a query image and a pre-built 3D model. Recently, deep neural networks have been exploited to regress the map** between raw pixels and 3D coordinates in the scene, and thus the matching is implicitly… ▽ More Visual localization is critical to many applications in computer vision and robotics. To address single-image RGB localization, state-of-the-art feature-based methods match local descriptors between a query image and a pre-built 3D model. Recently, deep neural networks have been exploited to regress the map** between raw pixels and 3D coordinates in the scene, and thus the matching is implicitly performed by the forward pass through the network. However, in a large and ambiguous environment, learning such a regression task directly can be difficult for a single network. In this work, we present a new hierarchical scene coordinate network to predict pixel scene coordinates in a coarse-to-fine manner from a single RGB image. The network consists of a series of output layers, each of them conditioned on the previous ones. The final output layer predicts the 3D coordinates and the others produce progressively finer discrete location labels. The proposed method outperforms the baseline regression-only network and allows us to train compact models which scale robustly to large environments. It sets a new state-of-the-art for single-image RGB localization performance on the 7-Scenes, 12-Scenes, Cambridge Landmarks datasets, and three combined scenes. Moreover, for large-scale outdoor localization on the Aachen Day-Night dataset, we present a hybrid approach which outperforms existing scene coordinate regression methods, and reduces significantly the performance gap w.r.t. explicit feature matching methods. △ Less

Submitted 31 March, 2020; v1 submitted 13 September, 2019; originally announced September 2019.

Comments: CVPR 2020

arXiv:1906.06784 [pdf]

doi 10.1016/j.neunet.2022.07.012

Interpolated Adversarial Training: Achieving Robust Neural Networks without Sacrificing Too Much Accuracy

Authors: Alex Lamb, Vikas Verma, Kenji Kawaguchi, Alexander Matyasko, Savya Khosla, Juho Kannala, Yoshua Bengio

Abstract: Adversarial robustness has become a central goal in deep learning, both in the theory and the practice. However, successful methods to improve the adversarial robustness (such as adversarial training) greatly hurt generalization performance on the unperturbed data. This could have a major impact on how the adversarial robustness affects real world systems (i.e. many may opt to forego robustness if… ▽ More Adversarial robustness has become a central goal in deep learning, both in the theory and the practice. However, successful methods to improve the adversarial robustness (such as adversarial training) greatly hurt generalization performance on the unperturbed data. This could have a major impact on how the adversarial robustness affects real world systems (i.e. many may opt to forego robustness if it can improve accuracy on the unperturbed data). We propose Interpolated Adversarial Training, which employs recently proposed interpolation based training methods in the framework of adversarial training. On CIFAR-10, adversarial training increases the standard test error (when there is no adversary) from 4.43% to 12.32%, whereas with our Interpolated adversarial training we retain the adversarial robustness while achieving a standard test error of only 6.45%. With our technique, the relative increase in the standard error for the robust model is reduced from 178.1% to just 45.5%. Moreover, we provide mathematical analysis of Interpolated Adversarial Training to confirm its efficiencies and demonstrate its advantages in terms of robustness and generalization. △ Less

Submitted 19 October, 2022; v1 submitted 16 June, 2019; originally announced June 2019.

Comments: This is the latest version, which is published in the Journal, "Neural Networks", in 2022. All the previous results are unchanged. First two authors contributed equally

Journal ref: Neural Networks, volume 154, pages 218-233 (2022)

arXiv:1906.00360 [pdf, other]

Iterative Path Reconstruction for Large-Scale Inertial Navigation on Smartphones

Authors: Santiago Cortés Reina, Yuxin Hou, Juho Kannala, Arno Solin

Abstract: Modern smartphones have all the sensing capabilities required for accurate and robust navigation and tracking. In specific environments some data streams may be absent, less reliable, or flat out wrong. In particular, the GNSS signal can become flawed or silent inside buildings or in streets with tall buildings. In this application paper, we aim to advance the current state-of-the-art in motion es… ▽ More Modern smartphones have all the sensing capabilities required for accurate and robust navigation and tracking. In specific environments some data streams may be absent, less reliable, or flat out wrong. In particular, the GNSS signal can become flawed or silent inside buildings or in streets with tall buildings. In this application paper, we aim to advance the current state-of-the-art in motion estimation using inertial measurements in combination with partial GNSS data on standard smartphones. We show how iterative estimation methods help refine the positioning path estimates in retrospective use cases that can cover both fixed-interval and fixed-lag scenarios. We compare estimation results provided by global iterated Kalman filtering methods to those of a visual-inertial tracking scheme (Apple ARKit). The practical applicability is demonstrated on real-world use cases on empirical data acquired from both smartphones and tablet devices. △ Less

Submitted 2 June, 2019; originally announced June 2019.

Comments: To appear in Proceedings FUSION 2019

arXiv:1905.10693 [pdf, other]

DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction

Authors: Hamed R. Tavakoli, Ali Borji, Esa Rahtu, Juho Kannala

Abstract: This paper studies audio-visual deep saliency prediction. It introduces a conceptually simple and effective Deep Audio-Visual Embedding for dynamic saliency prediction dubbed ``DAVE" in conjunction with our efforts towards building an Audio-Visual Eye-tracking corpus named ``AVE". Despite existing a strong relation between auditory and visual cues for guiding gaze during perception, video saliency… ▽ More This paper studies audio-visual deep saliency prediction. It introduces a conceptually simple and effective Deep Audio-Visual Embedding for dynamic saliency prediction dubbed ``DAVE" in conjunction with our efforts towards building an Audio-Visual Eye-tracking corpus named ``AVE". Despite existing a strong relation between auditory and visual cues for guiding gaze during perception, video saliency models only consider visual cues and neglect the auditory information that is ubiquitous in dynamic scenes. Here, we investigate the applicability of audio cues in conjunction with visual ones in predicting saliency maps using deep neural networks. To this end, the proposed model is intentionally designed to be simple. Two baseline models are developed on the same architecture which consists of an encoder-decoder. The encoder projects the input into a feature space followed by a decoder that infers saliency. We conduct an extensive analysis on different modalities and various aspects of multi-model dynamic saliency prediction. Our results suggest that (1) audio is a strong contributing cue for saliency prediction, (2) salient visible sound-source is the natural cause of the superiority of our Audio-Visual model, (3) richer feature representations for the input space leads to more powerful predictions even in absence of more sophisticated saliency decoders, and (4) Audio-Visual model improves over 53.54\% of the frames predicted by the best Visual model (our baseline). Our endeavour demonstrates that audio is an important cue that boosts dynamic video saliency prediction and helps models to approach human performance. The code is available at https://github.com/hrtavakoli/DAVE △ Less

Submitted 7 January, 2020; v1 submitted 25 May, 2019; originally announced May 2019.

arXiv:1904.06882 [pdf, other]

Geometric Image Correspondence Verification by Dense Pixel Matching

Authors: Zakaria Laskar, Iaroslav Melekhov, Hamed R. Tavakoli, Juha Ylioinas, Juho Kannala

Abstract: This paper addresses the problem of determining dense pixel correspondences between two images and its application to geometric correspondence verification in image retrieval. The main contribution is a geometric correspondence verification approach for re-ranking a shortlist of retrieved database images based on their dense pair-wise matching with the query image at a pixel level. We determine a… ▽ More This paper addresses the problem of determining dense pixel correspondences between two images and its application to geometric correspondence verification in image retrieval. The main contribution is a geometric correspondence verification approach for re-ranking a shortlist of retrieved database images based on their dense pair-wise matching with the query image at a pixel level. We determine a set of cyclically consistent dense pixel matches between the pair of images and evaluate local similarity of matched pixels using neural network based image descriptors. Final re-ranking is based on a novel similarity function, which fuses the local similarity metric with a global similarity metric and a geometric consistency measure computed for the matched pixels. For dense matching our approach utilizes a modified version of a recently proposed dense geometric correspondence network (DGC-Net), which we also improve by optimizing the architecture. The proposed model and similarity metric compare favourably to the state-of-the-art image retrieval methods. In addition, we apply our method to the problem of long-term visual localization demonstrating promising results and generalization across datasets. △ Less

Submitted 17 August, 2020; v1 submitted 15 April, 2019; originally announced April 2019.

Comments: The appendix has been updated by adding some clarifications

arXiv:1904.06397 [pdf, other]

Multi-View Stereo by Temporal Nonparametric Fusion

Authors: Yuxin Hou, Juho Kannala, Arno Solin

Abstract: We propose a novel idea for depth estimation from multi-view image-pose pairs, where the model has capability to leverage information from previous latent-space encodings of the scene. This model uses pairs of images and poses, which are passed through an encoder--decoder model for disparity estimation. The novelty lies in soft-constraining the bottleneck layer by a nonparametric Gaussian process… ▽ More We propose a novel idea for depth estimation from multi-view image-pose pairs, where the model has capability to leverage information from previous latent-space encodings of the scene. This model uses pairs of images and poses, which are passed through an encoder--decoder model for disparity estimation. The novelty lies in soft-constraining the bottleneck layer by a nonparametric Gaussian process prior. We propose a pose-kernel structure that encourages similar poses to have resembling latent spaces. The flexibility of the Gaussian process (GP) prior provides adapting memory for fusing information from previous views. We train the encoder--decoder and the GP hyperparameters jointly end-to-end. In addition to a batch method, we derive a lightweight estimation scheme that circumvents standard pitfalls in scaling Gaussian process inference, and demonstrate how our scheme can run in real-time on smart devices. △ Less

Submitted 16 August, 2019; v1 submitted 12 April, 2019; originally announced April 2019.

Comments: ICCV 2019

arXiv:1904.06145 [pdf, other]

Towards Photographic Image Manipulation with Balanced Growing of Generative Autoencoders

Authors: Ari Heljakka, Arno Solin, Juho Kannala

Abstract: We present a generative autoencoder that provides fast encoding, faithful reconstructions (eg. retaining the identity of a face), sharp generated/reconstructed samples in high resolutions, and a well-structured latent space that supports semantic manipulation of the inputs. There are no current autoencoder or GAN models that satisfactorily achieve all of these. We build on the progressively growin… ▽ More We present a generative autoencoder that provides fast encoding, faithful reconstructions (eg. retaining the identity of a face), sharp generated/reconstructed samples in high resolutions, and a well-structured latent space that supports semantic manipulation of the inputs. There are no current autoencoder or GAN models that satisfactorily achieve all of these. We build on the progressively growing autoencoder model PIONEER, for which we completely alter the training dynamics based on a careful analysis of recently introduced normalization schemes. We show significantly improved visual and quantitative results for face identity conservation in CelebAHQ. Our model achieves state-of-the-art disentanglement of latent space, both quantitatively and via realistic image attribute manipulations. On the LSUN Bedrooms dataset, we improve the disentanglement performance of the vanilla PIONEER, despite having a simpler model. Overall, our results indicate that the PIONEER networks provide a way towards photorealistic face manipulation. △ Less

Submitted 20 February, 2020; v1 submitted 12 April, 2019; originally announced April 2019.

Comments: WACV 2020

arXiv:1904.06090 [pdf, other]

doi 10.1109/WACV.2019.00035

Digging Deeper into Egocentric Gaze Prediction

Authors: Hamed R. Tavakoli, Esa Rahtu, Juho Kannala, Ali Borji

Abstract: This paper digs deeper into factors that influence egocentric gaze. Instead of training deep models for this purpose in a blind manner, we propose to inspect factors that contribute to gaze guidance during daily tasks. Bottom-up saliency and optical flow are assessed versus strong spatial prior baselines. Task-specific cues such as vanishing point, manipulation point, and hand regions are analyzed… ▽ More This paper digs deeper into factors that influence egocentric gaze. Instead of training deep models for this purpose in a blind manner, we propose to inspect factors that contribute to gaze guidance during daily tasks. Bottom-up saliency and optical flow are assessed versus strong spatial prior baselines. Task-specific cues such as vanishing point, manipulation point, and hand regions are analyzed as representatives of top-down information. We also look into the contribution of these factors by investigating a simple recurrent neural model for ego-centric gaze prediction. First, deep features are extracted for all input video frames. Then, a gated recurrent unit is employed to integrate information over time and to predict the next fixation. We also propose an integrated model that combines the recurrent model with several top-down and bottom-up cues. Extensive experiments over multiple datasets reveal that (1) spatial biases are strong in egocentric videos, (2) bottom-up saliency models perform poorly in predicting gaze and underperform spatial biases, (3) deep features perform better compared to traditional features, (4) as opposed to hand regions, the manipulation point is a strong influential cue for gaze prediction, (5) combining the proposed recurrent model with bottom-up cues, vanishing points and, in particular, manipulation point results in the best gaze prediction accuracy over egocentric videos, (6) the knowledge transfer works best for cases where the tasks or sequences are similar, and (7) task and activity recognition can benefit from gaze prediction. Our findings suggest that (1) there should be more emphasis on hand-object interaction and (2) the egocentric vision community should consider larger datasets including diverse stimuli and more subjects. △ Less

Submitted 12 April, 2019; originally announced April 2019.

Comments: presented at WACV 2019

arXiv:1904.01920 [pdf, other]

CubiCasa5K: A Dataset and an Improved Multi-Task Model for Floorplan Image Analysis

Authors: Ahti Kalervo, Juha Ylioinas, Markus Häikiö, Antti Karhu, Juho Kannala

Abstract: Better understanding and modelling of building interiors and the emergence of more impressive AR/VR technology has brought up the need for automatic parsing of floorplan images. However, there is a clear lack of representative datasets to investigate the problem further. To address this shortcoming, this paper presents a novel image dataset called CubiCasa5K, a large-scale floorplan image dataset… ▽ More Better understanding and modelling of building interiors and the emergence of more impressive AR/VR technology has brought up the need for automatic parsing of floorplan images. However, there is a clear lack of representative datasets to investigate the problem further. To address this shortcoming, this paper presents a novel image dataset called CubiCasa5K, a large-scale floorplan image dataset containing 5000 samples annotated into over 80 floorplan object categories. The dataset annotations are performed in a dense and versatile manner by using polygons for separating the different objects. Diverging from the classical approaches based on strong heuristics and low-level pixel operations, we present a method relying on an improved multi-task convolutional neural network. By releasing the novel dataset and our implementations, this study significantly boosts the research on automatic floorplan image analysis as it provides a richer set of tools for investigating the problem in a more comprehensive manner. △ Less

Submitted 3 April, 2019; originally announced April 2019.

Showing 1–50 of 85 results for author: Kannala, J