Search | arXiv e-print repository

A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects

Authors: Margret Keuper, Siyu Tang, Yu Zhongjie, Bjoern Andres, Thomas Brox, Bernt Schiele

Abstract: Recently, Minimum Cost Multicut Formulations have been proposed and proven to be successful in both motion trajectory segmentation and multi-target tracking scenarios. Both tasks benefit from decomposing a graphical model into an optimal number of connected components based on attractive and repulsive pairwise terms. The two tasks are formulated on different levels of granularity and, accordingly,… ▽ More Recently, Minimum Cost Multicut Formulations have been proposed and proven to be successful in both motion trajectory segmentation and multi-target tracking scenarios. Both tasks benefit from decomposing a graphical model into an optimal number of connected components based on attractive and repulsive pairwise terms. The two tasks are formulated on different levels of granularity and, accordingly, leverage mostly local information for motion segmentation and mostly high-level information for multi-target tracking. In this paper we argue that point trajectories and their local relationships can contribute to the high-level task of multi-target tracking and also argue that high-level cues from object detection and tracking are helpful to solve motion segmentation. We propose a joint graphical model for point trajectories and object detections whose Multicuts are solutions to motion segmentation {\it and} multi-target tracking problems at once. Results on the FBMS59 motion segmentation benchmark as well as on pedestrian tracking sequences from the 2D MOT 2015 benchmark demonstrate the promise of this joint approach. △ Less

Submitted 21 July, 2016; originally announced July 2016.

arXiv:1606.06650 [pdf, other]

3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation

Authors: Özgün Çiçek, Ahmed Abdulkadir, Soeren S. Lienkamp, Thomas Brox, Olaf Ronneberger

Abstract: This paper introduces a network for volumetric segmentation that learns from sparsely annotated volumetric images. We outline two attractive use cases of this method: (1) In a semi-automated setup, the user annotates some slices in the volume to be segmented. The network learns from these sparse annotations and provides a dense 3D segmentation. (2) In a fully-automated setup, we assume that a repr… ▽ More This paper introduces a network for volumetric segmentation that learns from sparsely annotated volumetric images. We outline two attractive use cases of this method: (1) In a semi-automated setup, the user annotates some slices in the volume to be segmented. The network learns from these sparse annotations and provides a dense 3D segmentation. (2) In a fully-automated setup, we assume that a representative, sparsely annotated training set exists. Trained on this data set, the network densely segments new volumetric images. The proposed network extends the previous u-net architecture from Ronneberger et al. by replacing all 2D operations with their 3D counterparts. The implementation performs on-the-fly elastic deformations for efficient data augmentation during training. It is trained end-to-end from scratch, i.e., no pre-trained network is required. We test the performance of the proposed method on a complex, highly variable 3D structure, the Xenopus kidney, and achieve good results for both use cases. △ Less

Submitted 21 June, 2016; originally announced June 2016.

Comments: Conditionally accepted for MICCAI 2016

arXiv:1606.02467 [pdf, other]

Point-wise mutual information-based video segmentation with high temporal consistency

Authors: Margret Keuper, Thomas Brox

Abstract: In this paper, we tackle the problem of temporally consistent boundary detection and hierarchical segmentation in videos. While finding the best high-level reasoning of region assignments in videos is the focus of much recent research, temporal consistency in boundary detection has so far only rarely been tackled. We argue that temporally consistent boundaries are a key component to temporally con… ▽ More In this paper, we tackle the problem of temporally consistent boundary detection and hierarchical segmentation in videos. While finding the best high-level reasoning of region assignments in videos is the focus of much recent research, temporal consistency in boundary detection has so far only rarely been tackled. We argue that temporally consistent boundaries are a key component to temporally consistent region assignment. The proposed method is based on the point-wise mutual information (PMI) of spatio-temporal voxels. Temporal consistency is established by an evaluation of PMI-based point affinities in the spectral domain over space and time. Thus, the proposed method is independent of any optical flow computation or previously learned motion models. The proposed low-level video segmentation method outperforms the learning-based state of the art in terms of standard region metrics. △ Less

Submitted 8 June, 2016; originally announced June 2016.

arXiv:1605.09304 [pdf, other]

Synthesizing the preferred inputs for neurons in neural networks via deep generator networks

Authors: Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, Jeff Clune

Abstract: Deep neural networks (DNNs) have demonstrated state-of-the-art results on many pattern recognition tasks, especially vision classification problems. Understanding the inner workings of such computational brains is both fascinating basic science that is interesting in its own right - similar to why we study the human brain - and will enable researchers to further improve DNNs. One path to understan… ▽ More Deep neural networks (DNNs) have demonstrated state-of-the-art results on many pattern recognition tasks, especially vision classification problems. Understanding the inner workings of such computational brains is both fascinating basic science that is interesting in its own right - similar to why we study the human brain - and will enable researchers to further improve DNNs. One path to understanding how a neural network functions internally is to study what each of its neurons has learned to detect. One such method is called activation maximization (AM), which synthesizes an input (e.g. an image) that highly activates a neuron. Here we dramatically improve the qualitative state of the art of activation maximization by harnessing a powerful, learned prior: a deep generator network (DGN). The algorithm (1) generates qualitatively state-of-the-art synthetic images that look almost real, (2) reveals the features learned by each neuron in an interpretable way, (3) generalizes well to new datasets and somewhat well to different network architectures without requiring the prior to be relearned, and (4) can be considered as a high-quality generative method (in this case, by generating novel, creative, interesting, recognizable images). △ Less

Submitted 23 November, 2016; v1 submitted 30 May, 2016; originally announced May 2016.

Comments: 29 pages, 35 figures, NIPS camera-ready

arXiv:1604.08610 [pdf, other]

doi 10.1007/978-3-319-45886-1_3

Artistic style transfer for videos

Authors: Manuel Ruder, Alexey Dosovitskiy, Thomas Brox

Abstract: In the past, manually re-drawing an image in a certain artistic style required a professional artist and a long time. Doing this for a video sequence single-handed was beyond imagination. Nowadays computers provide new possibilities. We present an approach that transfers the style from one image (for example, a painting) to a whole video sequence. We make use of recent advances in style transfer i… ▽ More In the past, manually re-drawing an image in a certain artistic style required a professional artist and a long time. Doing this for a video sequence single-handed was beyond imagination. Nowadays computers provide new possibilities. We present an approach that transfers the style from one image (for example, a painting) to a whole video sequence. We make use of recent advances in style transfer in still images and propose new initializations and loss functions applicable to videos. This allows us to generate consistent and stable stylized video sequences, even in cases with large motion and strong occlusion. We show that the proposed method clearly outperforms simpler baselines both qualitatively and quantitatively. △ Less

Submitted 19 October, 2016; v1 submitted 28 April, 2016; originally announced April 2016.

Comments: final version appeared in GCPR-2016; minor changes to improve the clarity

Journal ref: German Conference on Pattern Recognition (GCPR), LNCS 9796, pp. 26-36 (2016)

arXiv:1604.05096 [pdf, other]

Pixel-level Encoding and Depth Layering for Instance-level Semantic Labeling

Authors: Jonas Uhrig, Marius Cordts, Uwe Franke, Thomas Brox

Abstract: Recent approaches for instance-aware semantic labeling have augmented convolutional neural networks (CNNs) with complex multi-task architectures or computationally expensive graphical models. We present a method that leverages a fully convolutional network (FCN) to predict semantic labels, depth and an instance-based encoding using each pixel's direction towards its corresponding instance center.… ▽ More Recent approaches for instance-aware semantic labeling have augmented convolutional neural networks (CNNs) with complex multi-task architectures or computationally expensive graphical models. We present a method that leverages a fully convolutional network (FCN) to predict semantic labels, depth and an instance-based encoding using each pixel's direction towards its corresponding instance center. Subsequently, we apply low-level computer vision techniques to generate state-of-the-art instance segmentation on the street scene datasets KITTI and Cityscapes. Our approach outperforms existing works by a large margin and can additionally predict absolute distances of individual instances from a monocular image as well as a pixel-level semantic labeling. △ Less

Submitted 14 July, 2016; v1 submitted 18 April, 2016; originally announced April 2016.

Comments: Accepted at GCPR 2016. Includes supplementary material

arXiv:1604.03351 [pdf, other]

Orientation-boosted Voxel Nets for 3D Object Recognition

Authors: Nima Sedaghat, Mohammadreza Zolfaghari, Ehsan Amiri, Thomas Brox

Abstract: Recent work has shown good recognition results in 3D object recognition using 3D convolutional networks. In this paper, we show that the object orientation plays an important role in 3D recognition. More specifically, we argue that objects induce different features in the network under rotation. Thus, we approach the category-level classification task as a multi-task problem, in which the network… ▽ More Recent work has shown good recognition results in 3D object recognition using 3D convolutional networks. In this paper, we show that the object orientation plays an important role in 3D recognition. More specifically, we argue that objects induce different features in the network under rotation. Thus, we approach the category-level classification task as a multi-task problem, in which the network is trained to predict the pose of the object in addition to the class label as a parallel task. We show that this yields significant improvements in the classification results. We test our suggested architecture on several datasets representing various 3D data sources: LiDAR data, CAD models, and RGB-D images. We report state-of-the-art results on classification as well as significant improvements in precision and speed over the baseline on 3D detection. △ Less

Submitted 19 October, 2017; v1 submitted 12 April, 2016; originally announced April 2016.

Comments: BMVC'17 version. Added some experiments + auto-alignment of Modelnet40

arXiv:1602.07080 [pdf, other]

Techniques for Gradient Based Bilevel Optimization with Nonsmooth Lower Level Problems

Authors: Peter Ochs, René Ranftl, Thomas Brox, Thomas Pock

Abstract: We propose techniques for approximating bilevel optimization problems with non-smooth lower level problems that can have a non-unique solution. To this end, we substitute the expression of a minimizer of the lower level minimization problem with an iterative algorithm that is guaranteed to converge to a minimizer of the problem. Using suitable non-linear proximal distance functions, the update map… ▽ More We propose techniques for approximating bilevel optimization problems with non-smooth lower level problems that can have a non-unique solution. To this end, we substitute the expression of a minimizer of the lower level minimization problem with an iterative algorithm that is guaranteed to converge to a minimizer of the problem. Using suitable non-linear proximal distance functions, the update map**s of such an iterative algorithm can be differentiable, notwithstanding the fact that the minimization problem is non-smooth. △ Less

Submitted 26 April, 2016; v1 submitted 23 February, 2016; originally announced February 2016.

arXiv:1602.02644 [pdf, other]

Generating Images with Perceptual Similarity Metrics based on Deep Networks

Authors: Alexey Dosovitskiy, Thomas Brox

Abstract: Image-generating machine learning models are typically trained with loss functions based on distance in the image space. This often leads to over-smoothed results. We propose a class of loss functions, which we call deep perceptual similarity metrics (DeePSiM), that mitigate this problem. Instead of computing distances in the image space, we compute distances between image features extracted by de… ▽ More Image-generating machine learning models are typically trained with loss functions based on distance in the image space. This often leads to over-smoothed results. We propose a class of loss functions, which we call deep perceptual similarity metrics (DeePSiM), that mitigate this problem. Instead of computing distances in the image space, we compute distances between image features extracted by deep neural networks. This metric better reflects perceptually similarity of images and thus leads to better results. We show three applications: autoencoder training, a modification of a variational autoencoder, and inversion of deep convolutional networks. In all cases, the generated images look sharp and resemble natural images. △ Less

Submitted 9 February, 2016; v1 submitted 8 February, 2016; originally announced February 2016.

Comments: minor corrections

arXiv:1512.02134 [pdf, other]

doi 10.1109/CVPR.2016.438

A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation

Authors: Nikolaus Mayer, Eddy Ilg, Philip Häusser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, Thomas Brox

Abstract: Recent work has shown that optical flow estimation can be formulated as a supervised learning task and can be successfully solved with convolutional networks. Training of the so-called FlowNet was enabled by a large synthetically generated dataset. The present paper extends the concept of optical flow estimation via convolutional networks to disparity and scene flow estimation. To this end, we pro… ▽ More Recent work has shown that optical flow estimation can be formulated as a supervised learning task and can be successfully solved with convolutional networks. Training of the so-called FlowNet was enabled by a large synthetically generated dataset. The present paper extends the concept of optical flow estimation via convolutional networks to disparity and scene flow estimation. To this end, we propose three synthetic stereo video datasets with sufficient realism, variation, and size to successfully train large networks. Our datasets are the first large-scale datasets to enable training and evaluating scene flow methods. Besides the datasets, we present a convolutional network for real-time disparity estimation that provides state-of-the-art results. By combining a flow and disparity estimation network and training it jointly, we demonstrate the first scene flow estimation with a convolutional network. △ Less

Submitted 7 December, 2015; originally announced December 2015.

Comments: Includes supplementary material

ACM Class: I.2.6; I.2.10; I.4.8

arXiv:1511.06702 [pdf, other]

Multi-view 3D Models from Single Images with a Convolutional Network

Authors: Maxim Tatarchenko, Alexey Dosovitskiy, Thomas Brox

Abstract: We present a convolutional network capable of inferring a 3D representation of a previously unseen object given a single image of this object. Concretely, the network can predict an RGB image and a depth map of the object as seen from an arbitrary view. Several of these depth maps fused together give a full point cloud of the object. The point cloud can in turn be transformed into a surface mesh.… ▽ More We present a convolutional network capable of inferring a 3D representation of a previously unseen object given a single image of this object. Concretely, the network can predict an RGB image and a depth map of the object as seen from an arbitrary view. Several of these depth maps fused together give a full point cloud of the object. The point cloud can in turn be transformed into a surface mesh. The network is trained on renderings of synthetic 3D models of cars and chairs. It successfully deals with objects on cluttered background and generates reasonable predictions for real images of cars. △ Less

Submitted 2 August, 2016; v1 submitted 20 November, 2015; originally announced November 2015.

arXiv:1506.02753 [pdf, other]

Inverting Visual Representations with Convolutional Networks

Authors: Alexey Dosovitskiy, Thomas Brox

Abstract: Feature representations, both hand-designed and learned ones, are often hard to analyze and interpret, even when they are extracted from visual data. We propose a new approach to study image representations by inverting them with an up-convolutional neural network. We apply the method to shallow representations (HOG, SIFT, LBP), as well as to deep networks. For shallow representations our approach… ▽ More Feature representations, both hand-designed and learned ones, are often hard to analyze and interpret, even when they are extracted from visual data. We propose a new approach to study image representations by inverting them with an up-convolutional neural network. We apply the method to shallow representations (HOG, SIFT, LBP), as well as to deep networks. For shallow representations our approach provides significantly better reconstructions than existing methods, revealing that there is surprisingly rich information contained in these features. Inverting a deep network trained on ImageNet provides several insights into the properties of the feature representation learned by the network. Most strikingly, the colors and the rough contours of an image can be reconstructed from activations in higher network layers and even from the predicted class probabilities. △ Less

Submitted 26 April, 2016; v1 submitted 8 June, 2015; originally announced June 2015.

Comments: Version 4 - final version to appear in CVPR-2016. Visually better results obtained with feature similarity and adversarial training are in a different paper - arXiv:1602.02644

arXiv:1505.06973 [pdf, other]

Efficient Decomposition of Image and Mesh Graphs by Lifted Multicuts

Authors: Margret Keuper, Evgeny Levinkov, Nicolas Bonneel, Guillaume Lavoué, Thomas Brox, Bjoern Andres

Abstract: Formulations of the Image Decomposition Problem as a Multicut Problem (MP) w.r.t. a superpixel graph have received considerable attention. In contrast, instances of the MP w.r.t. a pixel grid graph have received little attention, firstly, because the MP is NP-hard and instances w.r.t. a pixel grid graph are hard to solve in practice, and, secondly, due to the lack of long-range terms in the object… ▽ More Formulations of the Image Decomposition Problem as a Multicut Problem (MP) w.r.t. a superpixel graph have received considerable attention. In contrast, instances of the MP w.r.t. a pixel grid graph have received little attention, firstly, because the MP is NP-hard and instances w.r.t. a pixel grid graph are hard to solve in practice, and, secondly, due to the lack of long-range terms in the objective function of the MP. We propose a generalization of the MP with long-range terms (LMP). We design and implement two efficient algorithms (primal feasible heuristics) for the MP and LMP which allow us to study instances of both problems w.r.t. the pixel grid graphs of the images in the BSDS-500 benchmark. The decompositions we obtain do not differ significantly from the state of the art, suggesting that the LMP is a competitive formulation of the Image Decomposition Problem. To demonstrate the generality of the LMP, we apply it also to the Mesh Decomposition Problem posed by the Princeton benchmark, obtaining state-of-the-art decompositions. △ Less

Submitted 11 September, 2015; v1 submitted 26 May, 2015; originally announced May 2015.

arXiv:1505.04597 [pdf, other]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Authors: Olaf Ronneberger, Philipp Fischer, Thomas Brox

Abstract: There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise… ▽ More There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net . △ Less

Submitted 18 May, 2015; originally announced May 2015.

Comments: conditionally accepted at MICCAI 2015

arXiv:1504.06852 [pdf, other]

FlowNet: Learning Optical Flow with Convolutional Networks

Authors: Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip Häusser, Caner Hazırbaş, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, Thomas Brox

Abstract: Convolutional neural networks (CNNs) have recently been very successful in a variety of computer vision tasks, especially on those linked to recognition. Optical flow estimation has not been among the tasks where CNNs were successful. In this paper we construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task. We propose and compare tw… ▽ More Convolutional neural networks (CNNs) have recently been very successful in a variety of computer vision tasks, especially on those linked to recognition. Optical flow estimation has not been among the tasks where CNNs were successful. In this paper we construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task. We propose and compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations. Since existing ground truth data sets are not sufficiently large to train a CNN, we generate a synthetic Flying Chairs dataset. We show that networks trained on this unrealistic data still generalize very well to existing datasets such as Sintel and KITTI, achieving competitive accuracy at frame rates of 5 to 10 fps. △ Less

Submitted 4 May, 2015; v1 submitted 26 April, 2015; originally announced April 2015.

Comments: Added supplementary material

ACM Class: I.2.6; I.4.8

arXiv:1412.6806 [pdf, other]

Striving for Simplicity: The All Convolutional Net

Authors: Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, Martin Riedmiller

Abstract: Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that… ▽ More Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks. Following this finding -- and building on other recent work for finding simple network structures -- we propose a new architecture that consists solely of convolutional layers and yields competitive or state of the art performance on several object recognition datasets (CIFAR-10, CIFAR-100, ImageNet). To analyze the network we introduce a new variant of the "deconvolution approach" for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches. △ Less

Submitted 13 April, 2015; v1 submitted 21 December, 2014; originally announced December 2014.

Comments: accepted to ICLR-2015 workshop track; no changes other than style

arXiv:1411.5928 [pdf, other]

Learning to Generate Chairs, Tables and Cars with Convolutional Networks

Authors: Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim Tatarchenko, Thomas Brox

Abstract: We train generative 'up-convolutional' neural networks which are able to generate images of objects given object style, viewpoint, and color. We train the networks on rendered 3D models of chairs, tables, and cars. Our experiments show that the networks do not merely learn all images by heart, but rather find a meaningful representation of 3D models allowing them to assess the similarity of differ… ▽ More We train generative 'up-convolutional' neural networks which are able to generate images of objects given object style, viewpoint, and color. We train the networks on rendered 3D models of chairs, tables, and cars. Our experiments show that the networks do not merely learn all images by heart, but rather find a meaningful representation of 3D models allowing them to assess the similarity of different models, interpolate between given views to generate the missing ones, extrapolate views, and invent new objects not present in the training set by recombining training instances, or even two different object classes. Moreover, we show that such generative networks can be used to find correspondences between different objects from the dataset, outperforming existing approaches on this task. △ Less

Submitted 2 August, 2017; v1 submitted 21 November, 2014; originally announced November 2014.

Comments: v4: final PAMI version. New architecture figure

arXiv:1406.6909 [pdf, other]

Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks

Authors: Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, Thomas Brox

Abstract: Deep convolutional networks have proven to be very successful in learning task specific features that allow for unprecedented performance on various computer vision tasks. Training of such networks follows mostly the supervised learning paradigm, where sufficiently many input-output pairs are required for training. Acquisition of large training sets is one of the key challenges, when approaching a… ▽ More Deep convolutional networks have proven to be very successful in learning task specific features that allow for unprecedented performance on various computer vision tasks. Training of such networks follows mostly the supervised learning paradigm, where sufficiently many input-output pairs are required for training. Acquisition of large training sets is one of the key challenges, when approaching a new task. In this paper, we aim for generic feature learning and present an approach for training a convolutional network using only unlabeled data. To this end, we train the network to discriminate between a set of surrogate classes. Each surrogate class is formed by applying a variety of transformations to a randomly sampled 'seed' image patch. In contrast to supervised network training, the resulting feature representation is not class specific. It rather provides robustness to the transformations that have been applied during training. This generic feature representation allows for classification results that outperform the state of the art for unsupervised learning on several popular datasets (STL-10, CIFAR-10, Caltech-101, Caltech-256). While such generic features cannot compete with class specific features from supervised training on a classification task, we show that they are advantageous on geometric matching problems, where they also outperform the SIFT descriptor. △ Less

Submitted 19 June, 2015; v1 submitted 26 June, 2014; originally announced June 2014.

Comments: PAMI submission. Includes matching experiments as in arXiv:1405.5769v1. Also includes new network architectures, experiments on Caltech-256, experiment on combining Exemplar-CNN with clustering

arXiv:1405.5769

Descriptor Matching with Convolutional Neural Networks: a Comparison to SIFT

Authors: Philipp Fischer, Alexey Dosovitskiy, Thomas Brox

Abstract: Latest results indicate that features learned via convolutional neural networks outperform previous descriptors on classification tasks by a large margin. It has been shown that these networks still work well when they are applied to datasets or recognition tasks different from those they were trained on. However, descriptors like SIFT are not only used in recognition but also for many corresponde… ▽ More Latest results indicate that features learned via convolutional neural networks outperform previous descriptors on classification tasks by a large margin. It has been shown that these networks still work well when they are applied to datasets or recognition tasks different from those they were trained on. However, descriptors like SIFT are not only used in recognition but also for many correspondence problems that rely on descriptor matching. In this paper we compare features from various layers of convolutional neural nets to standard SIFT descriptors. We consider a network that was trained on ImageNet and another one that was trained without supervision. Surprisingly, convolutional neural networks clearly outperform SIFT on descriptor matching. This paper has been merged with arXiv:1406.6909 △ Less

Submitted 24 June, 2015; v1 submitted 22 May, 2014; originally announced May 2014.

Comments: This paper has been merged with arXiv:1406.6909

ACM Class: I.2.6; I.4.7; I.4.8

arXiv:1404.4805 [pdf, other]

iPiano: Inertial Proximal Algorithm for Non-Convex Optimization

Authors: Peter Ochs, Yun** Chen, Thomas Brox, Thomas Pock

Abstract: In this paper we study an algorithm for solving a minimization problem composed of a differentiable (possibly non-convex) and a convex (possibly non-differentiable) function. The algorithm iPiano combines forward-backward splitting with an inertial force. It can be seen as a non-smooth split version of the Heavy-ball method from Polyak. A rigorous analysis of the algorithm for the proposed class o… ▽ More In this paper we study an algorithm for solving a minimization problem composed of a differentiable (possibly non-convex) and a convex (possibly non-differentiable) function. The algorithm iPiano combines forward-backward splitting with an inertial force. It can be seen as a non-smooth split version of the Heavy-ball method from Polyak. A rigorous analysis of the algorithm for the proposed class of problems yields global convergence of the function values and the arguments. This makes the algorithm robust for usage on non-convex problems. The convergence result is obtained based on the \KL inequality. This is a very weak restriction, which was used to prove convergence for several other gradient methods. First, an abstract convergence theorem for a generic algorithm is proved, and, then iPiano is shown to satisfy the requirements of this theorem. Furthermore, a convergence rate is established for the general problem class. We demonstrate iPiano on computer vision problems: image denoising with learned priors and diffusion based image compression. △ Less

Submitted 18 April, 2014; originally announced April 2014.

Comments: 32pages, 7 figures, to appear in SIAM Journal on Imaging Sciences

arXiv:1312.5242 [pdf, other]

Unsupervised feature learning by augmenting single images

Authors: Alexey Dosovitskiy, Jost Tobias Springenberg, Thomas Brox

Abstract: When deep learning is applied to visual object recognition, data augmentation is often used to generate additional training data without extra labeling cost. It helps to reduce overfitting and increase the performance of the algorithm. In this paper we investigate if it is possible to use data augmentation as the main component of an unsupervised feature learning architecture. To that end we sampl… ▽ More When deep learning is applied to visual object recognition, data augmentation is often used to generate additional training data without extra labeling cost. It helps to reduce overfitting and increase the performance of the algorithm. In this paper we investigate if it is possible to use data augmentation as the main component of an unsupervised feature learning architecture. To that end we sample a set of random image patches and declare each of them to be a separate single-image surrogate class. We then extend these trivial one-element classes by applying a variety of transformations to the initial 'seed' patches. Finally we train a convolutional neural network to discriminate between these surrogate classes. The feature representation learned by the network can then be used in various vision tasks. We find that this simple feature learning algorithm is surprisingly successful, achieving competitive classification results on several popular vision datasets (STL-10, CIFAR-10, Caltech-101). △ Less

Submitted 16 February, 2014; v1 submitted 18 December, 2013; originally announced December 2013.

Comments: ICLR 2014 workshop track submission (7 pages, 4 figures, 1 table)

Showing 101–121 of 121 results for author: Brox, T