-
Towards Photographic Image Manipulation with Balanced Growing of Generative Autoencoders
Authors:
Ari Heljakka,
Arno Solin,
Juho Kannala
Abstract:
We present a generative autoencoder that provides fast encoding, faithful reconstructions (eg. retaining the identity of a face), sharp generated/reconstructed samples in high resolutions, and a well-structured latent space that supports semantic manipulation of the inputs. There are no current autoencoder or GAN models that satisfactorily achieve all of these. We build on the progressively growin…
▽ More
We present a generative autoencoder that provides fast encoding, faithful reconstructions (eg. retaining the identity of a face), sharp generated/reconstructed samples in high resolutions, and a well-structured latent space that supports semantic manipulation of the inputs. There are no current autoencoder or GAN models that satisfactorily achieve all of these. We build on the progressively growing autoencoder model PIONEER, for which we completely alter the training dynamics based on a careful analysis of recently introduced normalization schemes. We show significantly improved visual and quantitative results for face identity conservation in CelebAHQ. Our model achieves state-of-the-art disentanglement of latent space, both quantitatively and via realistic image attribute manipulations. On the LSUN Bedrooms dataset, we improve the disentanglement performance of the vanilla PIONEER, despite having a simpler model. Overall, our results indicate that the PIONEER networks provide a way towards photorealistic face manipulation.
△ Less
Submitted 20 February, 2020; v1 submitted 12 April, 2019;
originally announced April 2019.
-
Digging Deeper into Egocentric Gaze Prediction
Authors:
Hamed R. Tavakoli,
Esa Rahtu,
Juho Kannala,
Ali Borji
Abstract:
This paper digs deeper into factors that influence egocentric gaze. Instead of training deep models for this purpose in a blind manner, we propose to inspect factors that contribute to gaze guidance during daily tasks. Bottom-up saliency and optical flow are assessed versus strong spatial prior baselines. Task-specific cues such as vanishing point, manipulation point, and hand regions are analyzed…
▽ More
This paper digs deeper into factors that influence egocentric gaze. Instead of training deep models for this purpose in a blind manner, we propose to inspect factors that contribute to gaze guidance during daily tasks. Bottom-up saliency and optical flow are assessed versus strong spatial prior baselines. Task-specific cues such as vanishing point, manipulation point, and hand regions are analyzed as representatives of top-down information. We also look into the contribution of these factors by investigating a simple recurrent neural model for ego-centric gaze prediction. First, deep features are extracted for all input video frames. Then, a gated recurrent unit is employed to integrate information over time and to predict the next fixation. We also propose an integrated model that combines the recurrent model with several top-down and bottom-up cues. Extensive experiments over multiple datasets reveal that (1) spatial biases are strong in egocentric videos, (2) bottom-up saliency models perform poorly in predicting gaze and underperform spatial biases, (3) deep features perform better compared to traditional features, (4) as opposed to hand regions, the manipulation point is a strong influential cue for gaze prediction, (5) combining the proposed recurrent model with bottom-up cues, vanishing points and, in particular, manipulation point results in the best gaze prediction accuracy over egocentric videos, (6) the knowledge transfer works best for cases where the tasks or sequences are similar, and (7) task and activity recognition can benefit from gaze prediction. Our findings suggest that (1) there should be more emphasis on hand-object interaction and (2) the egocentric vision community should consider larger datasets including diverse stimuli and more subjects.
△ Less
Submitted 12 April, 2019;
originally announced April 2019.
-
CubiCasa5K: A Dataset and an Improved Multi-Task Model for Floorplan Image Analysis
Authors:
Ahti Kalervo,
Juha Ylioinas,
Markus Häikiö,
Antti Karhu,
Juho Kannala
Abstract:
Better understanding and modelling of building interiors and the emergence of more impressive AR/VR technology has brought up the need for automatic parsing of floorplan images. However, there is a clear lack of representative datasets to investigate the problem further. To address this shortcoming, this paper presents a novel image dataset called CubiCasa5K, a large-scale floorplan image dataset…
▽ More
Better understanding and modelling of building interiors and the emergence of more impressive AR/VR technology has brought up the need for automatic parsing of floorplan images. However, there is a clear lack of representative datasets to investigate the problem further. To address this shortcoming, this paper presents a novel image dataset called CubiCasa5K, a large-scale floorplan image dataset containing 5000 samples annotated into over 80 floorplan object categories. The dataset annotations are performed in a dense and versatile manner by using polygons for separating the different objects. Diverging from the classical approaches based on strong heuristics and low-level pixel operations, we present a method relying on an improved multi-task convolutional neural network. By releasing the novel dataset and our implementations, this study significantly boosts the research on automatic floorplan image analysis as it provides a richer set of tools for investigating the problem in a more comprehensive manner.
△ Less
Submitted 3 April, 2019;
originally announced April 2019.
-
ICface: Interpretable and Controllable Face Reenactment Using GANs
Authors:
Soumya Tripathy,
Juho Kannala,
Esa Rahtu
Abstract:
This paper presents a generic face animator that is able to control the pose and expressions of a given face image. The animation is driven by human interpretable control signals consisting of head pose angles and the Action Unit (AU) values. The control information can be obtained from multiple sources including external driving videos and manual controls. Due to the interpretable nature of the d…
▽ More
This paper presents a generic face animator that is able to control the pose and expressions of a given face image. The animation is driven by human interpretable control signals consisting of head pose angles and the Action Unit (AU) values. The control information can be obtained from multiple sources including external driving videos and manual controls. Due to the interpretable nature of the driving signal, one can easily mix the information between multiple sources (e.g. pose from one image and expression from another) and apply selective post-production editing. The proposed face animator is implemented as a two-stage neural network model that is learned in a self-supervised manner using a large video collection. The proposed Interpretable and Controllable face reenactment network (ICface) is compared to the state-of-the-art neural network-based face animation techniques in multiple tasks. The results indicate that ICface produces better visual quality while being more versatile than most of the comparison methods. The introduced model could provide a lightweight and easy to use tool for a multitude of advanced image and video editing tasks.
△ Less
Submitted 17 January, 2020; v1 submitted 3 April, 2019;
originally announced April 2019.
-
Regularizing Trajectory Optimization with Denoising Autoencoders
Authors:
Rinu Boney,
Norman Di Palo,
Mathias Berglund,
Alexander Ilin,
Juho Kannala,
Antti Rasmus,
Harri Valpola
Abstract:
Trajectory optimization using a learned model of the environment is one of the core elements of model-based reinforcement learning. This procedure often suffers from exploiting inaccuracies of the learned model. We propose to regularize trajectory optimization by means of a denoising autoencoder that is trained on the same trajectories as the model of the environment. We show that the proposed reg…
▽ More
Trajectory optimization using a learned model of the environment is one of the core elements of model-based reinforcement learning. This procedure often suffers from exploiting inaccuracies of the learned model. We propose to regularize trajectory optimization by means of a denoising autoencoder that is trained on the same trajectories as the model of the environment. We show that the proposed regularization leads to improved planning with both gradient-based and gradient-free optimizers. We also demonstrate that using regularized trajectory optimization leads to rapid initial learning in a set of popular motor control tasks, which suggests that the proposed approach can be a useful tool for improving sample efficiency.
△ Less
Submitted 25 December, 2019; v1 submitted 28 March, 2019;
originally announced March 2019.
-
Interpolation Consistency Training for Semi-Supervised Learning
Authors:
Vikas Verma,
Kenji Kawaguchi,
Alex Lamb,
Juho Kannala,
Arno Solin,
Yoshua Bengio,
David Lopez-Paz
Abstract:
We introduce Interpolation Consistency Training (ICT), a simple and computation efficient algorithm for training Deep Neural Networks in the semi-supervised learning paradigm. ICT encourages the prediction at an interpolation of unlabeled points to be consistent with the interpolation of the predictions at those points. In classification problems, ICT moves the decision boundary to low-density reg…
▽ More
We introduce Interpolation Consistency Training (ICT), a simple and computation efficient algorithm for training Deep Neural Networks in the semi-supervised learning paradigm. ICT encourages the prediction at an interpolation of unlabeled points to be consistent with the interpolation of the predictions at those points. In classification problems, ICT moves the decision boundary to low-density regions of the data distribution. Our experiments show that ICT achieves state-of-the-art performance when applied to standard neural network architectures on the CIFAR-10 and SVHN benchmark datasets. Our theoretical analysis shows that ICT corresponds to a certain type of data-adaptive regularization with unlabeled points which reduces overfitting to labeled points under high confidence values.
△ Less
Submitted 19 October, 2022; v1 submitted 9 March, 2019;
originally announced March 2019.
-
Unstructured Multi-View Depth Estimation Using Mask-Based Multiplane Representation
Authors:
Yuxin Hou,
Arno Solin,
Juho Kannala
Abstract:
This paper presents a novel method, MaskMVS, to solve depth estimation for unstructured multi-view image-pose pairs. In the plane-sweep procedure, the depth planes are sampled by histogram matching that ensures covering the depth range of interest. Unlike other plane-sweep methods, we do not rely on a cost metric to explicitly build the cost volume, but instead infer a multiplane mask representati…
▽ More
This paper presents a novel method, MaskMVS, to solve depth estimation for unstructured multi-view image-pose pairs. In the plane-sweep procedure, the depth planes are sampled by histogram matching that ensures covering the depth range of interest. Unlike other plane-sweep methods, we do not rely on a cost metric to explicitly build the cost volume, but instead infer a multiplane mask representation which regularizes the learning. Compared to many previous approaches, we show that our method is lightweight and generalizes well without requiring excessive training. We outperform the current state-of-the-art and show results on the sun3d, scenes11, MVS, and RGBD test data sets.
△ Less
Submitted 10 April, 2019; v1 submitted 6 February, 2019;
originally announced February 2019.
-
Mask-RCNN and U-net Ensembled for Nuclei Segmentation
Authors:
Aarno Oskar Vuola,
Saad Ullah Akram,
Juho Kannala
Abstract:
Nuclei segmentation is both an important and in some ways ideal task for modern computer vision methods, e.g. convolutional neural networks. While recent developments in theory and open-source software have made these tools easier to implement, expert knowledge is still required to choose the right model architecture and training setup. We compare two popular segmentation frameworks, U-Net and Mas…
▽ More
Nuclei segmentation is both an important and in some ways ideal task for modern computer vision methods, e.g. convolutional neural networks. While recent developments in theory and open-source software have made these tools easier to implement, expert knowledge is still required to choose the right model architecture and training setup. We compare two popular segmentation frameworks, U-Net and Mask-RCNN in the nuclei segmentation task and find that they have different strengths and failures. To get the best of both worlds, we develop an ensemble model to combine their predictions that can outperform both models by a significant margin and should be considered when aiming for best nuclei segmentation performance.
△ Less
Submitted 29 January, 2019;
originally announced January 2019.
-
Semantic Matching by Weakly Supervised 2D Point Set Registration
Authors:
Zakaria Laskar,
Hamed R. Tavakoli,
Juho Kannala
Abstract:
In this paper we address the problem of establishing correspondences between different instances of the same object. The problem is posed as finding the geometric transformation that aligns a given image pair. We use a convolutional neural network (CNN) to directly regress the parameters of the transformation model. The alignment problem is defined in the setting where an unordered set of semantic…
▽ More
In this paper we address the problem of establishing correspondences between different instances of the same object. The problem is posed as finding the geometric transformation that aligns a given image pair. We use a convolutional neural network (CNN) to directly regress the parameters of the transformation model. The alignment problem is defined in the setting where an unordered set of semantic key-points per image are available, but, without the correspondence information. To this end we propose a novel loss function based on cyclic consistency that solves this 2D point set registration problem by inferring the optimal geometric transformation model parameters. We train and test our approach on a standard benchmark dataset Proposal-Flow (PF-PASCAL)\cite{proposal_flow}. The proposed approach achieves state-of-the-art results demonstrating the effectiveness of the method. In addition, we show our approach further benefits from additional training samples in PF-PASCAL generated by using category level information.
△ Less
Submitted 24 January, 2019;
originally announced January 2019.
-
Semi-Supervised Semantic Matching
Authors:
Zakaria Laskar,
Juho Kannala
Abstract:
Convolutional neural networks (CNNs) have been successfully applied to solve the problem of correspondence estimation between semantically related images. Due to non-availability of large training datasets, existing methods resort to self-supervised or unsupervised training paradigm. In this paper we propose a semi-supervised learning framework that imposes cyclic consistency constraint on unlabel…
▽ More
Convolutional neural networks (CNNs) have been successfully applied to solve the problem of correspondence estimation between semantically related images. Due to non-availability of large training datasets, existing methods resort to self-supervised or unsupervised training paradigm. In this paper we propose a semi-supervised learning framework that imposes cyclic consistency constraint on unlabeled image pairs. Together with the supervised loss the proposed model achieves state-of-the-art on a benchmark semantic matching dataset.
△ Less
Submitted 24 January, 2019;
originally announced January 2019.
-
LSD$_2$ -- Joint Denoising and Deblurring of Short and Long Exposure Images with CNNs
Authors:
Janne Mustaniemi,
Juho Kannala,
Jiri Matas,
Simo Särkkä,
Janne Heikkilä
Abstract:
The paper addresses the problem of acquiring high-quality photographs with handheld smartphone cameras in low-light imaging conditions. We propose an approach based on capturing pairs of short and long exposure images in rapid succession and fusing them into a single high-quality photograph. Unlike existing methods, we take advantage of both images simultaneously and perform a joint denoising and…
▽ More
The paper addresses the problem of acquiring high-quality photographs with handheld smartphone cameras in low-light imaging conditions. We propose an approach based on capturing pairs of short and long exposure images in rapid succession and fusing them into a single high-quality photograph. Unlike existing methods, we take advantage of both images simultaneously and perform a joint denoising and deblurring using a convolutional neural network. A novel approach is introduced to generate realistic short-long exposure image pairs. The method produces good images in extremely challenging conditions and outperforms existing denoising and deblurring methods. It also enables exposure fusion in the presence of motion blur.
△ Less
Submitted 1 September, 2020; v1 submitted 23 November, 2018;
originally announced November 2018.
-
DGC-Net: Dense Geometric Correspondence Network
Authors:
Iaroslav Melekhov,
Aleksei Tiulpin,
Torsten Sattler,
Marc Pollefeys,
Esa Rahtu,
Juho Kannala
Abstract:
This paper addresses the challenge of dense pixel correspondence estimation between two images. This problem is closely related to optical flow estimation task where ConvNets (CNNs) have recently achieved significant progress. While optical flow methods produce very accurate results for the small pixel translation and limited appearance variation scenarios, they hardly deal with the strong geometr…
▽ More
This paper addresses the challenge of dense pixel correspondence estimation between two images. This problem is closely related to optical flow estimation task where ConvNets (CNNs) have recently achieved significant progress. While optical flow methods produce very accurate results for the small pixel translation and limited appearance variation scenarios, they hardly deal with the strong geometric transformations that we consider in this work. In this paper, we propose a coarse-to-fine CNN-based framework that can leverage the advantages of optical flow approaches and extend them to the case of large transformations providing dense and subpixel accurate estimates. It is trained on synthetic transformations and demonstrates very good performance to unseen, realistic, data. Further, we apply our method to the problem of relative camera pose estimation and demonstrate that the model outperforms existing dense approaches.
△ Less
Submitted 22 October, 2018; v1 submitted 19 October, 2018;
originally announced October 2018.
-
Gyroscope-Aided Motion Deblurring with Deep Networks
Authors:
Janne Mustaniemi,
Juho Kannala,
Simo Särkkä,
Jiri Matas,
Janne Heikkilä
Abstract:
We propose a deblurring method that incorporates gyroscope measurements into a convolutional neural network (CNN). With the help of such measurements, it can handle extremely strong and spatially-variant motion blur. At the same time, the image data is used to overcome the limitations of gyro-based blur estimation. To train our network, we also introduce a novel way of generating realistic trainin…
▽ More
We propose a deblurring method that incorporates gyroscope measurements into a convolutional neural network (CNN). With the help of such measurements, it can handle extremely strong and spatially-variant motion blur. At the same time, the image data is used to overcome the limitations of gyro-based blur estimation. To train our network, we also introduce a novel way of generating realistic training data using the gyroscope. The evaluation shows a clear improvement in visual quality over the state-of-the-art while achieving real-time performance. Furthermore, the method is shown to improve the performance of existing feature detectors and descriptors against the motion blur.
△ Less
Submitted 23 November, 2018; v1 submitted 1 October, 2018;
originally announced October 2018.
-
Scene Coordinate Regression with Angle-Based Reprojection Loss for Camera Relocalization
Authors:
Xiaotian Li,
Juha Ylioinas,
Jakob Verbeek,
Juho Kannala
Abstract:
Image-based camera relocalization is an important problem in computer vision and robotics. Recent works utilize convolutional neural networks (CNNs) to regress for pixels in a query image their corresponding 3D world coordinates in the scene. The final pose is then solved via a RANSAC-based optimization scheme using the predicted coordinates. Usually, the CNN is trained with ground truth scene coo…
▽ More
Image-based camera relocalization is an important problem in computer vision and robotics. Recent works utilize convolutional neural networks (CNNs) to regress for pixels in a query image their corresponding 3D world coordinates in the scene. The final pose is then solved via a RANSAC-based optimization scheme using the predicted coordinates. Usually, the CNN is trained with ground truth scene coordinates, but it has also been shown that the network can discover 3D scene geometry automatically by minimizing single-view reprojection loss. However, due to the deficiencies of the reprojection loss, the network needs to be carefully initialized. In this paper, we present a new angle-based reprojection loss, which resolves the issues of the original reprojection loss. With this new loss function, the network can be trained without careful initialization, and the system achieves more accurate results. The new loss also enables us to utilize available multi-view constraints, which further improve performance.
△ Less
Submitted 30 September, 2018; v1 submitted 15 August, 2018;
originally announced August 2018.
-
Deep Learning Based Speed Estimation for Constraining Strapdown Inertial Navigation on Smartphones
Authors:
Santiago Cortés,
Arno Solin,
Juho Kannala
Abstract:
Strapdown inertial navigation systems are sensitive to the quality of the data provided by the accelerometer and gyroscope. Low-grade IMUs in handheld smart-devices pose a problem for inertial odometry on these devices. We propose a scheme for constraining the inertial odometry problem by complementing non-linear state estimation by a CNN-based deep-learning model for inferring the momentary speed…
▽ More
Strapdown inertial navigation systems are sensitive to the quality of the data provided by the accelerometer and gyroscope. Low-grade IMUs in handheld smart-devices pose a problem for inertial odometry on these devices. We propose a scheme for constraining the inertial odometry problem by complementing non-linear state estimation by a CNN-based deep-learning model for inferring the momentary speed based on a window of IMU samples. We show the feasibility of the model using a wide range of data from an iPhone, and present proof-of-concept results for how the model can be combined with an inertial navigation system for three-dimensional inertial navigation.
△ Less
Submitted 10 August, 2018;
originally announced August 2018.
-
Leveraging Unlabeled Whole-Slide-Images for Mitosis Detection
Authors:
Saad Ullah Akram,
Talha Qaiser,
Simon Graham,
Juho Kannala,
Janne Heikkilä,
Nasir Rajpoot
Abstract:
Mitosis count is an important biomarker for prognosis of various cancers. At present, pathologists typically perform manual counting on a few selected regions of interest in breast whole-slide-images (WSIs) of patient biopsies. This task is very time-consuming, tedious and subjective. Automated mitosis detection methods have made great advances in recent years. However, these methods require exhau…
▽ More
Mitosis count is an important biomarker for prognosis of various cancers. At present, pathologists typically perform manual counting on a few selected regions of interest in breast whole-slide-images (WSIs) of patient biopsies. This task is very time-consuming, tedious and subjective. Automated mitosis detection methods have made great advances in recent years. However, these methods require exhaustive labeling of a large number of selected regions of interest. This task is very expensive because expert pathologists are needed for reliable and accurate annotations. In this paper, we present a semi-supervised mitosis detection method which is designed to leverage a large number of unlabeled breast cancer WSIs. As a result, our method capitalizes on the growing number of digitized histology images, without relying on exhaustive annotations, subsequently improving mitosis detection. Our method first learns a mitosis detector from labeled data, uses this detector to mine additional mitosis samples from unlabeled WSIs, and then trains the final model using this larger and diverse set of mitosis samples. The use of unlabeled data improves F1-score by $\sim$5\% compared to our best performing fully-supervised model on the TUPAC validation set. Our submission (single model) to TUPAC challenge ranks highly on the leaderboard with an F1-score of 0.64.
△ Less
Submitted 31 July, 2018;
originally announced July 2018.
-
ADVIO: An authentic dataset for visual-inertial odometry
Authors:
Santiago Cortés,
Arno Solin,
Esa Rahtu,
Juho Kannala
Abstract:
The lack of realistic and open benchmarking datasets for pedestrian visual-inertial odometry has made it hard to pinpoint differences in published methods. Existing datasets either lack a full six degree-of-freedom ground-truth or are limited to small spaces with optical tracking systems. We take advantage of advances in pure inertial navigation, and develop a set of versatile and challenging real…
▽ More
The lack of realistic and open benchmarking datasets for pedestrian visual-inertial odometry has made it hard to pinpoint differences in published methods. Existing datasets either lack a full six degree-of-freedom ground-truth or are limited to small spaces with optical tracking systems. We take advantage of advances in pure inertial navigation, and develop a set of versatile and challenging real-world computer vision benchmark sets for visual-inertial odometry. For this purpose, we have built a test rig equipped with an iPhone, a Google Pixel Android phone, and a Google Tango device. We provide a wide range of raw sensor data that is accessible on almost any modern-day smartphone together with a high-quality ground-truth track. We also compare resulting visual-inertial tracks from Google Tango, ARCore, and Apple ARKit with two recent methods published in academic forums. The data sets cover both indoor and outdoor cases, with stairs, escalators, elevators, office environments, a shop** mall, and metro station.
△ Less
Submitted 25 July, 2018;
originally announced July 2018.
-
Learning Representations for Soft Skill Matching
Authors:
Luiza Sayfullina,
Eric Malmi,
Juho Kannala
Abstract:
Employers actively look for talents having not only specific hard skills but also various soft skills. To analyze the soft skill demands on the job market, it is important to be able to detect soft skill phrases from job advertisements automatically. However, a naive matching of soft skill phrases can lead to false positive matches when a soft skill phrase, such as friendly, is used to describe a…
▽ More
Employers actively look for talents having not only specific hard skills but also various soft skills. To analyze the soft skill demands on the job market, it is important to be able to detect soft skill phrases from job advertisements automatically. However, a naive matching of soft skill phrases can lead to false positive matches when a soft skill phrase, such as friendly, is used to describe a company, a team, or another entity, rather than a desired candidate.
In this paper, we propose a phrase-matching-based approach which differentiates between soft skill phrases referring to a candidate vs. something else. The disambiguation is formulated as a binary text classification problem where the prediction is made for the potential soft skill based on the context where it occurs. To inform the model about the soft skill for which the prediction is made, we develop several approaches, including soft skill masking and soft skill tagging.
We compare several neural network based approaches, including CNN, LSTM and Hierarchical Attention Model. The proposed tagging-based input representation using LSTM achieved the highest recall of 83.92% on the job dataset when fixing a precision to 95%.
△ Less
Submitted 20 July, 2018;
originally announced July 2018.
-
Pioneer Networks: Progressively Growing Generative Autoencoder
Authors:
Ari Heljakka,
Arno Solin,
Juho Kannala
Abstract:
We introduce a novel generative autoencoder network model that learns to encode and reconstruct images with high quality and resolution, and supports smooth random sampling from the latent space of the encoder. Generative adversarial networks (GANs) are known for their ability to simulate random high-quality images, but they cannot reconstruct existing images. Previous works have attempted to exte…
▽ More
We introduce a novel generative autoencoder network model that learns to encode and reconstruct images with high quality and resolution, and supports smooth random sampling from the latent space of the encoder. Generative adversarial networks (GANs) are known for their ability to simulate random high-quality images, but they cannot reconstruct existing images. Previous works have attempted to extend GANs to support such inference but, so far, have not delivered satisfactory high-quality results. Instead, we propose the Progressively Growing Generative Autoencoder (PIONEER) network which achieves high-quality reconstruction with $128{\times}128$ images without requiring a GAN discriminator. We merge recent techniques for progressively building up the parts of the network with the recently introduced adversarial encoder-generator network. The ability to reconstruct input images is crucial in many real-world applications, and allows for precise intelligent manipulation of existing images. We show promising results in image synthesis and inference, with state-of-the-art results in CelebA inference tasks.
△ Less
Submitted 9 October, 2018; v1 submitted 9 July, 2018;
originally announced July 2018.
-
Robust Gyroscope-Aided Camera Self-Calibration
Authors:
Santiago Cortés Reina,
Arno Solin,
Juho Kannala
Abstract:
Camera calibration for estimating the intrinsic parameters and lens distortion is a prerequisite for various monocular vision applications including feature tracking and video stabilization. This application paper proposes a model for estimating the parameters on the fly by fusing gyroscope and camera data, both readily available in modern day smartphones. The model is based on joint estimation of…
▽ More
Camera calibration for estimating the intrinsic parameters and lens distortion is a prerequisite for various monocular vision applications including feature tracking and video stabilization. This application paper proposes a model for estimating the parameters on the fly by fusing gyroscope and camera data, both readily available in modern day smartphones. The model is based on joint estimation of visual feature positions, camera parameters, and the camera pose, the movement of which is assumed to follow the movement predicted by the gyroscope. Our model assumes the camera movement to be free, but continuous and differentiable, and individual features are assumed to stay stationary. The estimation is performed online using an extended Kalman filter, and it is shown to outperform existing methods in robustness and insensitivity to initialization. We demonstrate the method using simulated data and empirical data from an iPad.
△ Less
Submitted 31 May, 2018;
originally announced May 2018.
-
Fast Motion Deblurring for Feature Detection and Matching Using Inertial Measurements
Authors:
Janne Mustaniemi,
Juho Kannala,
Simo Särkkä,
Jiri Matas,
Janne Heikkilä
Abstract:
Many computer vision and image processing applications rely on local features. It is well-known that motion blur decreases the performance of traditional feature detectors and descriptors. We propose an inertial-based deblurring method for improving the robustness of existing feature detectors and descriptors against the motion blur. Unlike most deblurring algorithms, the method can handle spatial…
▽ More
Many computer vision and image processing applications rely on local features. It is well-known that motion blur decreases the performance of traditional feature detectors and descriptors. We propose an inertial-based deblurring method for improving the robustness of existing feature detectors and descriptors against the motion blur. Unlike most deblurring algorithms, the method can handle spatially-variant blur and rolling shutter distortion. Furthermore, it is capable of running in real-time contrary to state-of-the-art algorithms. The limitations of inertial-based blur estimation are taken into account by validating the blur estimates using image data. The evaluation shows that when the method is used with traditional feature detector and descriptor, it increases the number of detected keypoints, provides higher repeatability and improves the localization accuracy. We also demonstrate that such features will lead to more accurate and complete reconstructions when used in the application of 3D visual reconstruction.
△ Less
Submitted 22 May, 2018;
originally announced May 2018.
-
Learning image-to-image translation using paired and unpaired training samples
Authors:
Soumya Tripathy,
Juho Kannala,
Esa Rahtu
Abstract:
Image-to-image translation is a general name for a task where an image from one domain is converted to a corresponding image in another domain, given sufficient training data. Traditionally different approaches have been proposed depending on whether aligned image pairs or two sets of (unaligned) examples from both domains are available for training. While paired training samples might be difficul…
▽ More
Image-to-image translation is a general name for a task where an image from one domain is converted to a corresponding image in another domain, given sufficient training data. Traditionally different approaches have been proposed depending on whether aligned image pairs or two sets of (unaligned) examples from both domains are available for training. While paired training samples might be difficult to obtain, the unpaired setup leads to a highly under-constrained problem and inferior results. In this paper, we propose a new general purpose image-to-image translation model that is able to utilize both paired and unpaired training data simultaneously. We compare our method with two strong baselines and obtain both qualitatively and quantitatively improved results. Our model outperforms the baselines also in the case of purely paired and unpaired training data. To our knowledge, this is the first work to consider such hybrid setup in image-to-image translation.
△ Less
Submitted 8 May, 2018;
originally announced May 2018.
-
Accurate 3-D Reconstruction with RGB-D Cameras using Depth Map Fusion and Pose Refinement
Authors:
Markus Ylimäki,
Juho Kannala,
Janne Heikkilä
Abstract:
Depth map fusion is an essential part in both stereo and RGB-D based 3-D reconstruction pipelines. Whether produced with a passive stereo reconstruction or using an active depth sensor, such as Microsoft Kinect, the depth maps have noise and may have poor initial registration. In this paper, we introduce a method which is capable of handling outliers, and especially, even significant registration…
▽ More
Depth map fusion is an essential part in both stereo and RGB-D based 3-D reconstruction pipelines. Whether produced with a passive stereo reconstruction or using an active depth sensor, such as Microsoft Kinect, the depth maps have noise and may have poor initial registration. In this paper, we introduce a method which is capable of handling outliers, and especially, even significant registration errors. The proposed method first fuses a sequence of depth maps into a single non-redundant point cloud so that the redundant points are merged together by giving more weight to more certain measurements. Then, the original depth maps are re-registered to the fused point cloud to refine the original camera extrinsic parameters. The fusion is then performed again with the refined extrinsic parameters. This procedure is repeated until the result is satisfying or no significant changes happen between iterations. The method is robust to outliers and erroneous depth measurements as well as even significant depth map registration errors due to inaccurate initial camera poses.
△ Less
Submitted 24 April, 2018;
originally announced April 2018.
-
Devon: Deformable Volume Network for Learning Optical Flow
Authors:
Yao Lu,
Jack Valmadre,
Heng Wang,
Juho Kannala,
Mehrtash Harandi,
Philip H. S. Torr
Abstract:
State-of-the-art neural network models estimate large displacement optical flow in multi-resolution and use war** to propagate the estimation between two resolutions. Despite their impressive results, it is known that there are two problems with the approach. First, the multi-resolution estimation of optical flow fails in situations where small objects move fast. Second, war** creates artifact…
▽ More
State-of-the-art neural network models estimate large displacement optical flow in multi-resolution and use war** to propagate the estimation between two resolutions. Despite their impressive results, it is known that there are two problems with the approach. First, the multi-resolution estimation of optical flow fails in situations where small objects move fast. Second, war** creates artifacts when occlusion or dis-occlusion happens. In this paper, we propose a new neural network module, Deformable Cost Volume, which alleviates the two problems. Based on this module, we designed the Deformable Volume Network (Devon) which can estimate multi-scale optical flow in a single high resolution. Experiments show Devon is more suitable in handling small objects moving fast and achieves comparable results to the state-of-the-art methods in public benchmarks.
△ Less
Submitted 4 March, 2019; v1 submitted 20 February, 2018;
originally announced February 2018.
-
Recursive Chaining of Reversible Image-to-image Translators For Face Aging
Authors:
Ari Heljakka,
Arno Solin,
Juho Kannala
Abstract:
This paper addresses the modeling and simulation of progressive changes over time, such as human face aging. By treating the age phases as a sequence of image domains, we construct a chain of transformers that map images from one age domain to the next. Leveraging recent adversarial image translation methods, our approach requires no training samples of the same individual at different ages. Here,…
▽ More
This paper addresses the modeling and simulation of progressive changes over time, such as human face aging. By treating the age phases as a sequence of image domains, we construct a chain of transformers that map images from one age domain to the next. Leveraging recent adversarial image translation methods, our approach requires no training samples of the same individual at different ages. Here, the model must be flexible enough to translate a child face to a young adult, and all the way through the adulthood to old age. We find that some transformers in the chain can be recursively applied on their own output to cover multiple phases, compressing the chain. The structure of the chain also unearths information about the underlying physical process. We demonstrate the performance of our method with precise and intuitive metrics, and visually match with the face aging state-of-the-art.
△ Less
Submitted 6 August, 2018; v1 submitted 14 February, 2018;
originally announced February 2018.
-
Full-Frame Scene Coordinate Regression for Image-Based Localization
Authors:
Xiaotian Li,
Juha Ylioinas,
Juho Kannala
Abstract:
Image-based localization, or camera relocalization, is a fundamental problem in computer vision and robotics, and it refers to estimating camera pose from an image. Recent state-of-the-art approaches use learning based methods, such as Random Forests (RFs) and Convolutional Neural Networks (CNNs), to regress for each pixel in the image its corresponding position in the scene's world coordinate fra…
▽ More
Image-based localization, or camera relocalization, is a fundamental problem in computer vision and robotics, and it refers to estimating camera pose from an image. Recent state-of-the-art approaches use learning based methods, such as Random Forests (RFs) and Convolutional Neural Networks (CNNs), to regress for each pixel in the image its corresponding position in the scene's world coordinate frame, and solve the final pose via a RANSAC-based optimization scheme using the predicted correspondences. In this paper, instead of in a patch-based manner, we propose to perform the scene coordinate regression in a full-frame manner to make the computation efficient at test time and, more importantly, to add more global context to the regression process to improve the robustness. To do so, we adopt a fully convolutional encoder-decoder neural network architecture which accepts a whole image as input and produces scene coordinate predictions for all pixels in the image. However, using more global context is prone to overfitting. To alleviate this issue, we propose to use data augmentation to generate more data for training. In addition to the data augmentation in 2D image space, we also augment the data in 3D space. We evaluate our approach on the publicly available 7-Scenes dataset, and experiments show that it has better scene coordinate predictions and achieves state-of-the-art results in localization with improved robustness on the hardest frames (e.g., frames with repeated structures).
△ Less
Submitted 25 June, 2018; v1 submitted 9 February, 2018;
originally announced February 2018.
-
Image Patch Matching Using Convolutional Descriptors with Euclidean Distance
Authors:
Iaroslav Melekhov,
Juho Kannala,
Esa Rahtu
Abstract:
In this work we propose a neural network based image descriptor suitable for image patch matching, which is an important task in many computer vision applications. Our approach is influenced by recent success of deep convolutional neural networks (CNNs) in object detection and classification tasks. We develop a model which maps the raw input patch to a low dimensional feature vector so that the di…
▽ More
In this work we propose a neural network based image descriptor suitable for image patch matching, which is an important task in many computer vision applications. Our approach is influenced by recent success of deep convolutional neural networks (CNNs) in object detection and classification tasks. We develop a model which maps the raw input patch to a low dimensional feature vector so that the distance between representations is small for similar patches and large otherwise. As a distance metric we utilize L2 norm, i.e. Euclidean distance, which is fast to evaluate and used in most popular hand-crafted descriptors, such as SIFT. According to the results, our approach outperforms state-of-the-art L2-based descriptors and can be considered as a direct replacement of SIFT. In addition, we conducted experiments with batch normalization and histogram equalization as a preprocessing method of the input data. The results confirm that these techniques further improve the performance of the proposed descriptor. Finally, we show promising preliminary results by appending our CNNs with recently proposed spatial transformer networks and provide a visualisation and interpretation of their impact.
△ Less
Submitted 31 October, 2017;
originally announced October 2017.
-
PIVO: Probabilistic Inertial-Visual Odometry for Occlusion-Robust Navigation
Authors:
Arno Solin,
Santiago Cortes,
Esa Rahtu,
Juho Kannala
Abstract:
This paper presents a novel method for visual-inertial odometry. The method is based on an information fusion framework employing low-cost IMU sensors and the monocular camera in a standard smartphone. We formulate a sequential inference scheme, where the IMU drives the dynamical model and the camera frames are used in coupling trailing sequences of augmented poses. The novelty in the model is in…
▽ More
This paper presents a novel method for visual-inertial odometry. The method is based on an information fusion framework employing low-cost IMU sensors and the monocular camera in a standard smartphone. We formulate a sequential inference scheme, where the IMU drives the dynamical model and the camera frames are used in coupling trailing sequences of augmented poses. The novelty in the model is in taking into account all the cross-terms in the updates, thus propagating the inter-connected uncertainties throughout the model. Stronger coupling between the inertial and visual data sources leads to robustness against occlusion and feature-poor environments. We demonstrate results on data collected with an iPhone and provide comparisons against the Tango device and using the EuRoC data set.
△ Less
Submitted 23 January, 2018; v1 submitted 2 August, 2017;
originally announced August 2017.
-
Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network
Authors:
Zakaria Laskar,
Iaroslav Melekhov,
Surya Kalia,
Juho Kannala
Abstract:
We propose a new deep learning based approach for camera relocalization. Our approach localizes a given query image by using a convolutional neural network (CNN) for first retrieving similar database images and then predicting the relative pose between the query and the database images, whose poses are known. The camera location for the query image is obtained via triangulation from two relative t…
▽ More
We propose a new deep learning based approach for camera relocalization. Our approach localizes a given query image by using a convolutional neural network (CNN) for first retrieving similar database images and then predicting the relative pose between the query and the database images, whose poses are known. The camera location for the query image is obtained via triangulation from two relative translation estimates using a RANSAC based approach. Each relative pose estimate provides a hypothesis for the camera orientation and they are fused in a second RANSAC scheme. The neural network is trained for relative pose estimation in an end-to-end manner using training image pairs. In contrast to previous work, our approach does not require scene-specific training of the network, which improves scalability, and it can also be applied to scenes which are not available during the training of the network. As another main contribution, we release a challenging indoor localisation dataset covering 5 different scenes registered to a common coordinate frame. We evaluate our approach using both our own dataset and the standard 7 Scenes benchmark. The results show that the proposed approach generalizes well to previously unseen scenes and compares favourably to other recent CNN-based methods.
△ Less
Submitted 1 August, 2017; v1 submitted 31 July, 2017;
originally announced July 2017.
-
Learning Image Relations with Contrast Association Networks
Authors:
Yao Lu,
Zhirong Yang,
Juho Kannala,
Samuel Kaski
Abstract:
Inferring the relations between two images is an important class of tasks in computer vision. Examples of such tasks include computing optical flow and stereo disparity. We treat the relation inference tasks as a machine learning problem and tackle it with neural networks. A key to the problem is learning a representation of relations. We propose a new neural network module, contrast association u…
▽ More
Inferring the relations between two images is an important class of tasks in computer vision. Examples of such tasks include computing optical flow and stereo disparity. We treat the relation inference tasks as a machine learning problem and tackle it with neural networks. A key to the problem is learning a representation of relations. We propose a new neural network module, contrast association unit (CAU), which explicitly models the relations between two sets of input variables. Due to the non-negativity of the weights in CAU, we adopt a multiplicative update algorithm for learning these weights. Experiments show that neural networks with CAUs are more effective in learning five fundamental image transformations than conventional neural networks.
△ Less
Submitted 11 March, 2019; v1 submitted 16 May, 2017;
originally announced May 2017.
-
Cell Tracking via Proposal Generation and Selection
Authors:
Saad Ullah Akram,
Juho Kannala,
Lauri Eklund,
Janne Heikkilä
Abstract:
Microscopy imaging plays a vital role in understanding many biological processes in development and disease. The recent advances in automation of microscopes and development of methods and markers for live cell imaging has led to rapid growth in the amount of image data being captured. To efficiently and reliably extract useful insights from these captured sequences, automated cell tracking is ess…
▽ More
Microscopy imaging plays a vital role in understanding many biological processes in development and disease. The recent advances in automation of microscopes and development of methods and markers for live cell imaging has led to rapid growth in the amount of image data being captured. To efficiently and reliably extract useful insights from these captured sequences, automated cell tracking is essential. This is a challenging problem due to large variation in the appearance and shapes of cells depending on many factors including imaging methodology, biological characteristics of cells, cell matrix composition, labeling methodology, etc. Often cell tracking methods require a sequence-specific segmentation method and manual tuning of many tracking parameters, which limits their applicability to sequences other than those they are designed for. In this paper, we propose 1) a deep learning based cell proposal method, which proposes candidates for cells along with their scores, and 2) a cell tracking method, which links proposals in adjacent frames in a graphical model using edges representing different cellular events and poses joint cell detection and tracking as the selection of a subset of cell and edge proposals. Our method is completely automated and given enough training data can be applied to a wide variety of microscopy sequences. We evaluate our method on multiple fluorescence and phase contrast microscopy sequences containing cells of various shapes and appearances from ISBI cell tracking challenge, and show that our method outperforms existing cell tracking methods.
Code is available at: https://github.com/SaadUllahAkram/CellTracker
△ Less
Submitted 9 May, 2017;
originally announced May 2017.
-
Image-based Localization using Hourglass Networks
Authors:
Iaroslav Melekhov,
Juha Ylioinas,
Juho Kannala,
Esa Rahtu
Abstract:
In this paper, we propose an encoder-decoder convolutional neural network (CNN) architecture for estimating camera pose (orientation and location) from a single RGB-image. The architecture has a hourglass shape consisting of a chain of convolution and up-convolution layers followed by a regression part. The up-convolution layers are introduced to preserve the fine-grained information of the input…
▽ More
In this paper, we propose an encoder-decoder convolutional neural network (CNN) architecture for estimating camera pose (orientation and location) from a single RGB-image. The architecture has a hourglass shape consisting of a chain of convolution and up-convolution layers followed by a regression part. The up-convolution layers are introduced to preserve the fine-grained information of the input image. Following the common practice, we train our model in end-to-end manner utilizing transfer learning from large scale classification data. The experiments demonstrate the performance of the approach on data exhibiting different lighting conditions, reflections, and motion blur. The results indicate a clear improvement over the previous state-of-the-art even when compared to methods that utilize sequence of test frames instead of a single frame.
△ Less
Submitted 24 August, 2017; v1 submitted 23 March, 2017;
originally announced March 2017.
-
Context Aware Query Image Representation for Particular Object Retrieval
Authors:
Zakaria Laskar,
Juho Kannala
Abstract:
The current models of image representation based on Convolutional Neural Networks (CNN) have shown tremendous performance in image retrieval. Such models are inspired by the information flow along the visual pathway in the human visual cortex. We propose that in the field of particular object retrieval, the process of extracting CNN representations from query images with a given region of interest…
▽ More
The current models of image representation based on Convolutional Neural Networks (CNN) have shown tremendous performance in image retrieval. Such models are inspired by the information flow along the visual pathway in the human visual cortex. We propose that in the field of particular object retrieval, the process of extracting CNN representations from query images with a given region of interest (ROI) can also be modelled by taking inspiration from human vision. Particularly, we show that by making the CNN pay attention on the ROI while extracting query image representation leads to significant improvement over the baseline methods on challenging Oxford5k and Paris6k datasets. Furthermore, we propose an extension to a recently introduced encoding method for CNN representations, regional maximum activations of convolutions (R-MAC). The proposed extension weights the regional representations using a novel saliency measure prior to aggregation. This leads to further improvement in retrieval accuracy.
△ Less
Submitted 3 March, 2017;
originally announced March 2017.
-
Inertial Odometry on Handheld Smartphones
Authors:
Arno Solin,
Santiago Cortes,
Esa Rahtu,
Juho Kannala
Abstract:
Building a complete inertial navigation system using the limited quality data provided by current smartphones has been regarded challenging, if not impossible. This paper shows that by careful crafting and accounting for the weak information in the sensor samples, smartphones are capable of pure inertial navigation. We present a probabilistic approach for orientation and use-case free inertial odo…
▽ More
Building a complete inertial navigation system using the limited quality data provided by current smartphones has been regarded challenging, if not impossible. This paper shows that by careful crafting and accounting for the weak information in the sensor samples, smartphones are capable of pure inertial navigation. We present a probabilistic approach for orientation and use-case free inertial odometry, which is based on double-integrating rotated accelerations. The strength of the model is in learning additive and multiplicative IMU biases online. We are able to track the phone position, velocity, and pose in real-time and in a computationally lightweight fashion by solving the inference with an extended Kalman filter. The information fusion is completed with zero-velocity updates (if the phone remains stationary), altitude correction from barometric pressure readings (if available), and pseudo-updates constraining the momentary speed. We demonstrate our approach using an iPad and iPhone in several indoor dead-reckoning applications and in a measurement tool setup.
△ Less
Submitted 7 June, 2018; v1 submitted 1 March, 2017;
originally announced March 2017.
-
Relative Camera Pose Estimation Using Convolutional Neural Networks
Authors:
Iaroslav Melekhov,
Juha Ylioinas,
Juho Kannala,
Esa Rahtu
Abstract:
This paper presents a convolutional neural network based approach for estimating the relative pose between two cameras. The proposed network takes RGB images from both cameras as input and directly produces the relative rotation and translation as output. The system is trained in an end-to-end manner utilising transfer learning from a large scale classification dataset. The introduced approach is…
▽ More
This paper presents a convolutional neural network based approach for estimating the relative pose between two cameras. The proposed network takes RGB images from both cameras as input and directly produces the relative rotation and translation as output. The system is trained in an end-to-end manner utilising transfer learning from a large scale classification dataset. The introduced approach is compared with widely used local feature based methods (SURF, ORB) and the results indicate a clear improvement over the baseline. In addition, a variant of the proposed architecture containing a spatial pyramid pooling (SPP) layer is evaluated and shown to further improve the performance.
△ Less
Submitted 28 July, 2017; v1 submitted 5 February, 2017;
originally announced February 2017.
-
Inertial-Based Scale Estimation for Structure from Motion on Mobile Devices
Authors:
Janne Mustaniemi,
Juho Kannala,
Simo Särkkä,
Jiri Matas,
Janne Heikkilä
Abstract:
Structure from motion algorithms have an inherent limitation that the reconstruction can only be determined up to the unknown scale factor. Modern mobile devices are equipped with an inertial measurement unit (IMU), which can be used for estimating the scale of the reconstruction. We propose a method that recovers the metric scale given inertial measurements and camera poses. In the process, we al…
▽ More
Structure from motion algorithms have an inherent limitation that the reconstruction can only be determined up to the unknown scale factor. Modern mobile devices are equipped with an inertial measurement unit (IMU), which can be used for estimating the scale of the reconstruction. We propose a method that recovers the metric scale given inertial measurements and camera poses. In the process, we also perform a temporal and spatial alignment of the camera and the IMU. Therefore, our solution can be easily combined with any existing visual reconstruction software. The method can cope with noisy camera pose estimates, typically caused by motion blur or rolling shutter artifacts, via utilizing a Rauch-Tung-Striebel (RTS) smoother. Furthermore, the scale estimation is performed in the frequency domain, which provides more robustness to inaccurate sensor time stamps and noisy IMU samples than the previously used time domain representation. In contrast to previous methods, our approach has no parameters that need to be tuned for achieving a good performance. In the experiments, we show that the algorithm outperforms the state-of-the-art in both accuracy and convergence speed of the scale estimate. The accuracy of the scale is around $1\%$ from the ground truth depending on the recording. We also demonstrate that our method can improve the scale accuracy of the Project Tango's build-in motion tracking.
△ Less
Submitted 11 August, 2017; v1 submitted 29 November, 2016;
originally announced November 2016.
-
Real-time Human Pose Estimation from Video with Convolutional Neural Networks
Authors:
Marko Linna,
Juho Kannala,
Esa Rahtu
Abstract:
In this paper, we present a method for real-time multi-person human pose estimation from video by utilizing convolutional neural networks. Our method is aimed for use case specific applications, where good accuracy is essential and variation of the background and poses is limited. This enables us to use a generic network architecture, which is both accurate and fast. We divide the problem into two…
▽ More
In this paper, we present a method for real-time multi-person human pose estimation from video by utilizing convolutional neural networks. Our method is aimed for use case specific applications, where good accuracy is essential and variation of the background and poses is limited. This enables us to use a generic network architecture, which is both accurate and fast. We divide the problem into two phases: (1) pre-training and (2) finetuning. In pre-training, the network is learned with highly diverse input data from publicly available datasets, while in finetuning we train with application specific data, which we record with Kinect. Our method differs from most of the state-of-the-art methods in that we consider the whole system, including person detector, pose estimator and an automatic way to record application specific training material for finetuning. Our method is considerably faster than many of the state-of-the-art methods. Our method can be thought of as a replacement for Kinect, and it can be used for higher level tasks, such as gesture control, games, person tracking, action recognition and action tracking. We achieved accuracy of 96.8\% ([email protected]) with application specific data.
△ Less
Submitted 23 September, 2016;
originally announced September 2016.
-
Fine-Grained Visual Classification of Aircraft
Authors:
Subhransu Maji,
Esa Rahtu,
Juho Kannala,
Matthew Blaschko,
Andrea Vedaldi
Abstract:
This paper introduces FGVC-Aircraft, a new dataset containing 10,000 images of aircraft spanning 100 aircraft models, organised in a three-level hierarchy. At the finer level, differences between models are often subtle but always visually measurable, making visual recognition challenging but possible. A benchmark is obtained by defining corresponding classification tasks and evaluation protocols,…
▽ More
This paper introduces FGVC-Aircraft, a new dataset containing 10,000 images of aircraft spanning 100 aircraft models, organised in a three-level hierarchy. At the finer level, differences between models are often subtle but always visually measurable, making visual recognition challenging but possible. A benchmark is obtained by defining corresponding classification tasks and evaluation protocols, and baseline results are presented. The construction of this dataset was made possible by the work of aircraft enthusiasts, a strategy that can extend to the study of number of other object classes. Compared to the domains usually considered in fine-grained visual classification (FGVC), for example animals, aircraft are rigid and hence less deformable. They, however, present other interesting modes of variation, including purpose, size, designation, structure, historical style, and branding.
△ Less
Submitted 21 June, 2013;
originally announced June 2013.