Skip to main content

Showing 1–48 of 48 results for author: Dosovitskiy, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.04312  [pdf, other

    cs.CV

    ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization

    Authors: Luca Eyring, Shyamgopal Karthik, Karsten Roth, Alexey Dosovitskiy, Zeynep Akata

    Abstract: Text-to-Image (T2I) models have made significant advancements in recent years, but they still struggle to accurately capture intricate details specified in complex compositional prompts. While fine-tuning T2I models with reward objectives has shown promise, it suffers from "reward hacking" and may not generalize well to unseen prompt distributions. In this work, we propose Reward-based Noise Optim… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Preprint

  2. arXiv:2205.06230  [pdf, other

    cs.CV

    Simple Open-Vocabulary Object Detection with Vision Transformers

    Authors: Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby

    Abstract: Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary… ▽ More

    Submitted 20 July, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

    Comments: ECCV 2022 camera-ready version

  3. arXiv:2111.13152  [pdf, other

    cs.CV cs.AI cs.GR cs.LG cs.RO

    Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations

    Authors: Mehdi S. M. Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani Vora, Mario Lucic, Daniel Duckworth, Alexey Dosovitskiy, Jakob Uszkoreit, Thomas Funkhouser, Andrea Tagliasacchi

    Abstract: A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel sc… ▽ More

    Submitted 29 March, 2022; v1 submitted 25 November, 2021; originally announced November 2021.

    Comments: Accepted to CVPR 2022, Project website: https://srt-paper.github.io/

    Journal ref: CVPR 2022

  4. arXiv:2111.12594  [pdf, other

    cs.CV cs.LG stat.ML

    Conditional Object-Centric Learning from Video

    Authors: Thomas Kipf, Gamaleldin F. Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, Klaus Greff

    Abstract: Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built. Recent work on simple 2D and 3D datasets has shown that models with object-centric inductive biases can learn to segment and represent meaningful objects from the statistical structure of the data alone without the need for… ▽ More

    Submitted 15 March, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

    Comments: Published at ICLR 2022. Project page at https://slot-attention-video.github.io/

  5. arXiv:2108.08810  [pdf, other

    cs.CV cs.AI cs.LG stat.ML

    Do Vision Transformers See Like Convolutional Neural Networks?

    Authors: Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, Alexey Dosovitskiy

    Abstract: Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual re… ▽ More

    Submitted 3 March, 2022; v1 submitted 19 August, 2021; originally announced August 2021.

  6. arXiv:2105.01601  [pdf, other

    cs.CV cs.AI cs.LG

    MLP-Mixer: An all-MLP Architecture for Vision

    Authors: Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy

    Abstract: Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-… ▽ More

    Submitted 11 June, 2021; v1 submitted 4 May, 2021; originally announced May 2021.

    Comments: v2: Fixed parameter counts in Table 1. v3: Added results on JFT-3B in Figure 2(right); Added Section 3.4 on the input permutations. v4: Updated the x label in Figure 2(right)

  7. arXiv:2104.03059  [pdf, other

    cs.CV cs.AI cs.LG stat.ML

    Differentiable Patch Selection for Image Recognition

    Authors: Jean-Baptiste Cordonnier, Aravindh Mahendran, Alexey Dosovitskiy, Dirk Weissenborn, Jakob Uszkoreit, Thomas Unterthiner

    Abstract: Neural Networks require large amounts of memory and compute to process high resolution images, even when only a small part of the image is actually informative for the task at hand. We propose a method based on a differentiable Top-K operator to select the most relevant parts of the input to efficiently process high resolution images. Our method may be interfaced with any downstream neural network… ▽ More

    Submitted 7 April, 2021; originally announced April 2021.

    Comments: Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021. Code available at https://github.com/google-research/google-research/tree/master/ptopk_patch_selection/

  8. arXiv:2011.10287  [pdf, other

    cs.CV cs.LG

    Learning Object-Centric Video Models by Contrasting Sets

    Authors: Sindy Löwe, Klaus Greff, Rico Jonschkowski, Alexey Dosovitskiy, Thomas Kipf

    Abstract: Contrastive, self-supervised learning of object representations recently emerged as an attractive alternative to reconstruction-based training. Prior approaches focus on contrasting individual object representations (slots) against one another. However, a fundamental problem with this approach is that the overall contrastive loss is the same for (i) representing a different object in each slot, as… ▽ More

    Submitted 20 November, 2020; originally announced November 2020.

    Comments: NeurIPS 2020 Workshop on Object Representations for Learning and Reasoning

  9. arXiv:2010.11929  [pdf, other

    cs.CV cs.AI cs.LG

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Authors: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby

    Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while kee** their overall structure in place. We show that this reliance on CNNs is not nece… ▽ More

    Submitted 3 June, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Fine-tuning code and pre-trained models are available at https://github.com/google-research/vision_transformer. ICLR camera-ready version with 2 small modifications: 1) Added a discussion of CLS vs GAP classifier in the appendix, 2) Fixed an error in exaFLOPs computation in Figure 5 and Table 6 (relative performance of models is basically not affected)

  10. arXiv:2008.02268  [pdf, other

    cs.CV cs.GR cs.LG

    NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections

    Authors: Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, Daniel Duckworth

    Abstract: We present a learning-based method for synthesizing novel views of complex scenes using only unstructured collections of in-the-wild photographs. We build on Neural Radiance Fields (NeRF), which uses the weights of a multilayer perceptron to model the density and color of a scene as a function of 3D coordinates. While NeRF works well on images of static subjects captured under controlled settings,… ▽ More

    Submitted 6 January, 2021; v1 submitted 5 August, 2020; originally announced August 2020.

    Comments: Project website: https://nerf-w.github.io. Ricardo Martin-Brualla, Noha Radwan, and Mehdi S. M. Sajjadi contributed equally to this work. Updated with results for three additional scenes

  11. arXiv:2006.15055  [pdf, other

    cs.LG cs.CV stat.ML

    Object-Centric Learning with Slot Attention

    Authors: Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, Thomas Kipf

    Abstract: Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features. Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes. In this paper, we present the Slot Attention module, an architectural component that interfaces with pe… ▽ More

    Submitted 14 October, 2020; v1 submitted 26 June, 2020; originally announced June 2020.

    Comments: NeurIPS 2020. Code available at https://github.com/google-research/google-research/tree/master/slot_attention

  12. Learning Depth With Very Sparse Supervision

    Authors: Antonio Loquercio, Alexey Dosovitskiy, Davide Scaramuzza

    Abstract: Motivated by the astonishing capabilities of natural intelligent agents and inspired by theories from psychology, this paper explores the idea that perception gets coupled to 3D properties of the world via interaction with the environment. Existing works for depth estimation require either massive amounts of annotated training data or some form of hard-coded geometrical constraint. This paper expl… ▽ More

    Submitted 16 July, 2020; v1 submitted 2 March, 2020; originally announced March 2020.

    Comments: Accepted for Publication at the IEEE Robotics and Automation Letters (RA-L) 2020, and International Conference on Intelligent Robots and Systems (IROS) 2020

    Journal ref: IEEE Robotics and Automation Letters (RA-L) 2020

  13. arXiv:1910.04867  [pdf, other

    cs.CV cs.LG stat.ML

    A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

    Authors: Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly, Neil Houlsby

    Abstract: Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visual representations hinders progress. Popular protocols are often too constrained (linear classification), limited in diversity (ImageNet, CIFAR, Pascal-VOC), or only weakly related to representation quality (ELBO, r… ▽ More

    Submitted 21 February, 2020; v1 submitted 1 October, 2019; originally announced October 2019.

  14. Deep Drone Racing: From Simulation to Reality with Domain Randomization

    Authors: Antonio Loquercio, Elia Kaufmann, René Ranftl, Alexey Dosovitskiy, Vladlen Koltun, Davide Scaramuzza

    Abstract: Dynamically changing environments, unreliable state estimation, and operation under severe resource constraints are fundamental challenges that limit the deployment of small autonomous drones. We address these challenges in the context of autonomous, vision-based drone racing in dynamic environments. A racing drone must traverse a track with possibly moving gates at high speed. We enable this func… ▽ More

    Submitted 25 November, 2019; v1 submitted 20 May, 2019; originally announced May 2019.

    Comments: Accepted as a Regular Paper to the IEEE Transactions on Robotics Journal. arXiv admin note: substantial text overlap with arXiv:1806.08548

    Journal ref: IEEE Transactions on Robotics 2019

  15. arXiv:1901.10915  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    Benchmarking Classic and Learned Navigation in Complex 3D Environments

    Authors: Dmytro Mishkin, Alexey Dosovitskiy, Vladlen Koltun

    Abstract: Navigation research is attracting renewed interest with the advent of learning-based methods. However, this new line of work is largely disconnected from well-established classic navigation approaches. In this paper, we take a step towards coordinating these two directions of research. We set up classic and learning-based navigation systems in common simulated environments and thoroughly evaluate… ▽ More

    Submitted 28 March, 2019; v1 submitted 30 January, 2019; originally announced January 2019.

    Comments: Added CNN-Monodepth and OpenCV Stereo agents

  16. arXiv:1901.08652  [pdf, other

    cs.RO cs.LG stat.ML

    Learning agile and dynamic motor skills for legged robots

    Authors: Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, Marco Hutter

    Abstract: Legged robots pose one of the greatest challenges in robotics. Dynamic and agile maneuvers of animals cannot be imitated by existing methods that are crafted by humans. A compelling alternative is reinforcement learning, which requires minimal craftsmanship and promotes the natural evolution of a control policy. However, so far, reinforcement learning research for legged robots is mainly limited t… ▽ More

    Submitted 24 January, 2019; originally announced January 2019.

    Journal ref: Science Robotics 4.26 (2019): eaau5872

  17. arXiv:1901.03162  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Motion Perception in Reinforcement Learning with Dynamic Objects

    Authors: Artemij Amiranashvili, Alexey Dosovitskiy, Vladlen Koltun, Thomas Brox

    Abstract: In dynamic environments, learned controllers are supposed to take motion into account when selecting the action to be taken. However, in existing reinforcement learning works motion is rarely treated explicitly; it is rather assumed that the controller learns the necessary motion representation from temporal stacks of frames implicitly. In this paper, we show that for continuous control tasks lear… ▽ More

    Submitted 1 February, 2019; v1 submitted 10 January, 2019; originally announced January 2019.

  18. arXiv:1810.09381  [pdf, other

    cs.CV cs.LG

    Unsupervised Learning of Shape and Pose with Differentiable Point Clouds

    Authors: Eldar Insafutdinov, Alexey Dosovitskiy

    Abstract: We address the problem of learning accurate 3D shape and camera pose from a collection of unlabeled category-specific images. We train a convolutional network to predict both the shape and the pose from a single image by minimizing the reprojection error: given several views of an object, the projections of the predicted shapes to the predicted camera poses should match the provided views. To deal… ▽ More

    Submitted 22 October, 2018; originally announced October 2018.

  19. arXiv:1810.06224  [pdf, other

    cs.RO

    Beauty and the Beast: Optimal Methods Meet Learning for Drone Racing

    Authors: Elia Kaufmann, Mathias Gehrig, Philipp Foehn, René Ranftl, Alexey Dosovitskiy, Vladlen Koltun, Davide Scaramuzza

    Abstract: Autonomous micro aerial vehicles still struggle with fast and agile maneuvers, dynamic environments, imperfect sensing, and state estimation drift. Autonomous drone racing brings these challenges to the fore. Human pilots can fly a previously unseen track after a handful of practice runs. In contrast, state-of-the-art autonomous navigation algorithms require either a precise metric map of the envi… ▽ More

    Submitted 1 March, 2019; v1 submitted 15 October, 2018; originally announced October 2018.

    Comments: 6 pages (+1 references)

    Journal ref: IEEE International Conference on Robotics and Automation (ICRA), 2019

  20. arXiv:1809.04843  [pdf, other

    cs.CV

    On Offline Evaluation of Vision-based Driving Models

    Authors: Felipe Codevilla, Antonio M. López, Vladlen Koltun, Alexey Dosovitskiy

    Abstract: Autonomous driving models should ideally be evaluated by deploying them on a fleet of physical vehicles in the real world. Unfortunately, this approach is not practical for the vast majority of researchers. An attractive alternative is to evaluate models offline, on a pre-collected validation dataset with ground truth annotation. In this paper, we investigate the relation between various online an… ▽ More

    Submitted 13 September, 2018; originally announced September 2018.

    Comments: Published at the ECCV 2018 conference

  21. Frequency-Aware Model Predictive Control

    Authors: Ruben Grandia, Farbod Farshidian, Alexey Dosovitskiy, René Ranftl, Marco Hutter

    Abstract: Transferring solutions found by trajectory optimization to robotic hardware remains a challenging task. When the optimization fully exploits the provided model to perform dynamic tasks, the presence of unmodeled dynamics renders the motion infeasible on the real system. Model errors can be a result of model simplifications, but also naturally arise when deploying the robot in unstructured and nond… ▽ More

    Submitted 8 February, 2019; v1 submitted 12 September, 2018; originally announced September 2018.

    Journal ref: IEEE Robotics and Automation Letters 2019

  22. arXiv:1807.06757  [pdf, other

    cs.AI cs.CV cs.LG cs.RO

    On Evaluation of Embodied Navigation Agents

    Authors: Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, Amir R. Zamir

    Abstract: Skillful mobile operation in three-dimensional environments is a primary topic of study in Artificial Intelligence. The past two years have seen a surge of creative work on navigation. This creative output has produced a plethora of sometimes incompatible task definitions and evaluation protocols. To coordinate ongoing and future research in this area, we have convened a working group to study emp… ▽ More

    Submitted 17 July, 2018; originally announced July 2018.

    Comments: Report of a working group on empirical methodology in navigation research. Authors are listed in alphabetical order

  23. arXiv:1806.08548  [pdf, other

    cs.RO

    Deep Drone Racing: Learning Agile Flight in Dynamic Environments

    Authors: Elia Kaufmann, Antonio Loquercio, Rene Ranftl, Alexey Dosovitskiy, Vladlen Koltun, Davide Scaramuzza

    Abstract: Autonomous agile flight brings up fundamental challenges in robotics, such as co** with unreliable state estimation, reacting optimally to dynamically changing environments, and coupling perception and action in real time under severe resource constraints. In this paper, we consider these challenges in the context of autonomous, vision-based drone racing in dynamic environments. Our approach com… ▽ More

    Submitted 9 October, 2018; v1 submitted 22 June, 2018; originally announced June 2018.

    Comments: Accepted for publication in the Conference on Robotic Learning (CoRL) 2018, Zurich. 10 pages (+3 supplementary)

    Journal ref: Conference on Robotic Learning (CoRL), 2018

  24. arXiv:1806.01175  [pdf, other

    cs.LG cs.AI stat.ML

    TD or not TD: Analyzing the Role of Temporal Differencing in Deep Reinforcement Learning

    Authors: Artemij Amiranashvili, Alexey Dosovitskiy, Vladlen Koltun, Thomas Brox

    Abstract: Our understanding of reinforcement learning (RL) has been shaped by theoretical and empirical results that were obtained decades ago using tabular representations and linear function approximators. These results suggest that RL methods that use temporal differencing (TD) are superior to direct Monte Carlo estimation (MC). How do these results hold up in deep RL, which deals with perceptually compl… ▽ More

    Submitted 4 June, 2018; originally announced June 2018.

  25. arXiv:1804.09364  [pdf, other

    cs.RO cs.CV cs.LG

    Driving Policy Transfer via Modularity and Abstraction

    Authors: Matthias Müller, Alexey Dosovitskiy, Bernard Ghanem, Vladlen Koltun

    Abstract: End-to-end approaches to autonomous driving have high sample complexity and are difficult to scale to realistic urban driving. Simulation can help end-to-end driving systems by providing a cheap, safe, and diverse training environment. Yet training driving policies in simulation brings up the problem of transferring such policies to the real world. We present an approach to transferring driving po… ▽ More

    Submitted 13 December, 2018; v1 submitted 25 April, 2018; originally announced April 2018.

    Comments: Accepted at Conference on Robotic Learning (CoRL'18) http://proceedings.mlr.press/v87/mueller18a.html

  26. arXiv:1803.00653  [pdf, other

    cs.LG cs.AI cs.CV cs.RO

    Semi-parametric Topological Memory for Navigation

    Authors: Nikolay Savinov, Alexey Dosovitskiy, Vladlen Koltun

    Abstract: We introduce a new memory architecture for navigation in previously unseen environments, inspired by landmark-based navigation in animals. The proposed semi-parametric topological memory (SPTM) consists of a (non-parametric) graph with nodes corresponding to locations in the environment and a (parametric) deep network capable of retrieving nodes from the graph based on observations. The graph stor… ▽ More

    Submitted 1 March, 2018; originally announced March 2018.

    Comments: Published at International Conference on Learning Representations (ICLR) 2018. Project website at https://sites.google.com/view/SPTM

  27. What Makes Good Synthetic Training Data for Learning Disparity and Optical Flow Estimation?

    Authors: Nikolaus Mayer, Eddy Ilg, Philipp Fischer, Caner Hazirbas, Daniel Cremers, Alexey Dosovitskiy, Thomas Brox

    Abstract: The finding that very large networks can be trained efficiently and reliably has led to a paradigm shift in computer vision from engineered solutions to learning formulations. As a result, the research challenge shifts from devising algorithms to creating suitable and abundant training data for supervised learning. How to efficiently create such training data? The dominant data acquisition method… ▽ More

    Submitted 22 March, 2018; v1 submitted 19 January, 2018; originally announced January 2018.

    Comments: added references (UCL dataset); added IJCV copyright information

  28. arXiv:1712.03931  [pdf, other

    cs.LG cs.AI cs.CV cs.GR cs.RO

    MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments

    Authors: Manolis Savva, Angel X. Chang, Alexey Dosovitskiy, Thomas Funkhouser, Vladlen Koltun

    Abstract: We present MINOS, a simulator designed to support the development of multisensory models for goal-directed navigation in complex indoor environments. The simulator leverages large datasets of complex 3D environments and supports flexible configuration of multimodal sensor suites. We use MINOS to benchmark deep-learning-based navigation methods, to analyze the influence of environmental complexity… ▽ More

    Submitted 11 December, 2017; originally announced December 2017.

    Comments: MINOS is a simulator designed to support research on end-to-end navigation

  29. arXiv:1711.03938  [pdf, other

    cs.LG cs.AI cs.CV cs.RO

    CARLA: An Open Urban Driving Simulator

    Authors: Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, Vladlen Koltun

    Abstract: We introduce CARLA, an open-source simulator for autonomous driving research. CARLA has been developed from the ground up to support development, training, and validation of autonomous urban driving systems. In addition to open-source code and protocols, CARLA provides open digital assets (urban layouts, buildings, vehicles) that were created for this purpose and can be used freely. The simulation… ▽ More

    Submitted 10 November, 2017; originally announced November 2017.

    Comments: Published at the 1st Conference on Robot Learning (CoRL)

  30. arXiv:1710.02410  [pdf, other

    cs.RO cs.CV cs.LG

    End-to-end Driving via Conditional Imitation Learning

    Authors: Felipe Codevilla, Matthias Müller, Antonio López, Vladlen Koltun, Alexey Dosovitskiy

    Abstract: Deep networks trained on demonstrations of human driving have learned to follow roads and avoid obstacles. However, driving policies trained via imitation learning cannot be controlled at test time. A vehicle trained end-to-end to imitate an expert cannot be guided to take a specific turn at an upcoming intersection. This limits the utility of such systems. We propose to condition imitation learni… ▽ More

    Submitted 2 March, 2018; v1 submitted 6 October, 2017; originally announced October 2017.

    Comments: Published at the International Conference on Robotics and Automation (ICRA), 2018

  31. Artistic style transfer for videos and spherical images

    Authors: Manuel Ruder, Alexey Dosovitskiy, Thomas Brox

    Abstract: Manually re-drawing an image in a certain artistic style takes a professional artist a long time. Doing this for a video sequence single-handedly is beyond imagination. We present two computational approaches that transfer the style from one image (for example, a painting) to a whole video sequence. In our first approach, we adapt to videos the original image style transfer technique by Gatys et a… ▽ More

    Submitted 5 August, 2018; v1 submitted 13 August, 2017; originally announced August 2017.

    Comments: v3: added ref to conference. This paper is a successor of and overlaps with arXiv:1604.08610, International Journal of Computer Vision (IJCV), 2018

  32. arXiv:1703.09438  [pdf, other

    cs.CV

    Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs

    Authors: Maxim Tatarchenko, Alexey Dosovitskiy, Thomas Brox

    Abstract: We present a deep convolutional decoder architecture that can generate volumetric 3D outputs in a compute- and memory-efficient manner by using an octree representation. The network learns to predict both the structure of the octree, and the occupancy values of individual cells. This makes it a particularly valuable technique for generating 3D shapes. In contrast to standard decoders acting on reg… ▽ More

    Submitted 7 August, 2017; v1 submitted 28 March, 2017; originally announced March 2017.

  33. DeMoN: Depth and Motion Network for Learning Monocular Stereo

    Authors: Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, Thomas Brox

    Abstract: In this paper we formulate structure from motion as a learning problem. We train a convolutional network end-to-end to compute depth and camera motion from successive, unconstrained image pairs. The architecture is composed of multiple stacked encoder-decoder networks, the core part being an iterative network that is able to improve its own predictions. The network estimates not only depth and mot… ▽ More

    Submitted 11 April, 2017; v1 submitted 7 December, 2016; originally announced December 2016.

    Comments: Camera ready version for CVPR 2017. Supplementary material included. Project page: http://lmb.informatik.uni-freiburg.de/people/ummenhof/depthmotionnet/

  34. arXiv:1612.01925  [pdf, other

    cs.CV

    FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

    Authors: Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, Thomas Brox

    Abstract: The FlowNet demonstrated that optical flow estimation can be cast as a learning problem. However, the state of the art with regard to the quality of the flow has still been defined by traditional methods. Particularly on small displacements and real-world data, FlowNet cannot compete with variational methods. In this paper, we advance the concept of end-to-end learning of optical flow and make it… ▽ More

    Submitted 6 December, 2016; originally announced December 2016.

    Comments: Including supplementary material. For the video see: http://lmb.informatik.uni-freiburg.de/Publications/2016/IMKDB16/

  35. arXiv:1612.00005  [pdf, other

    cs.CV

    Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

    Authors: Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, Jason Yosinski

    Abstract: Generating high-resolution, photo-realistic images has been a long-standing goal in machine learning. Recently, Nguyen et al. (2016) showed one interesting way to synthesize novel images by performing gradient ascent in the latent space of a generator network to maximize the activations of one or multiple neurons in a separate classifier network. In this paper we extend this method by introducing… ▽ More

    Submitted 12 April, 2017; v1 submitted 30 November, 2016; originally announced December 2016.

    Comments: CVPR camera-ready

  36. arXiv:1611.01779  [pdf, other

    cs.LG cs.AI cs.CV

    Learning to Act by Predicting the Future

    Authors: Alexey Dosovitskiy, Vladlen Koltun

    Abstract: We present an approach to sensorimotor control in immersive environments. Our approach utilizes a high-dimensional sensory stream and a lower-dimensional measurement stream. The cotemporal structure of these streams provides a rich supervisory signal, which enables training a sensorimotor control model by interacting with the environment. The model is trained using supervised learning techniques,… ▽ More

    Submitted 14 February, 2017; v1 submitted 6 November, 2016; originally announced November 2016.

    Comments: Published as a conference paper at ICLR 2017

  37. arXiv:1605.09304  [pdf, other

    cs.NE cs.AI cs.CV cs.LG

    Synthesizing the preferred inputs for neurons in neural networks via deep generator networks

    Authors: Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, Jeff Clune

    Abstract: Deep neural networks (DNNs) have demonstrated state-of-the-art results on many pattern recognition tasks, especially vision classification problems. Understanding the inner workings of such computational brains is both fascinating basic science that is interesting in its own right - similar to why we study the human brain - and will enable researchers to further improve DNNs. One path to understan… ▽ More

    Submitted 23 November, 2016; v1 submitted 30 May, 2016; originally announced May 2016.

    Comments: 29 pages, 35 figures, NIPS camera-ready

  38. Artistic style transfer for videos

    Authors: Manuel Ruder, Alexey Dosovitskiy, Thomas Brox

    Abstract: In the past, manually re-drawing an image in a certain artistic style required a professional artist and a long time. Doing this for a video sequence single-handed was beyond imagination. Nowadays computers provide new possibilities. We present an approach that transfers the style from one image (for example, a painting) to a whole video sequence. We make use of recent advances in style transfer i… ▽ More

    Submitted 19 October, 2016; v1 submitted 28 April, 2016; originally announced April 2016.

    Comments: final version appeared in GCPR-2016; minor changes to improve the clarity

    Journal ref: German Conference on Pattern Recognition (GCPR), LNCS 9796, pp. 26-36 (2016)

  39. arXiv:1602.02644  [pdf, other

    cs.LG cs.CV cs.NE

    Generating Images with Perceptual Similarity Metrics based on Deep Networks

    Authors: Alexey Dosovitskiy, Thomas Brox

    Abstract: Image-generating machine learning models are typically trained with loss functions based on distance in the image space. This often leads to over-smoothed results. We propose a class of loss functions, which we call deep perceptual similarity metrics (DeePSiM), that mitigate this problem. Instead of computing distances in the image space, we compute distances between image features extracted by de… ▽ More

    Submitted 9 February, 2016; v1 submitted 8 February, 2016; originally announced February 2016.

    Comments: minor corrections

  40. arXiv:1512.02134  [pdf, other

    cs.CV cs.LG stat.ML

    A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation

    Authors: Nikolaus Mayer, Eddy Ilg, Philip Häusser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, Thomas Brox

    Abstract: Recent work has shown that optical flow estimation can be formulated as a supervised learning task and can be successfully solved with convolutional networks. Training of the so-called FlowNet was enabled by a large synthetically generated dataset. The present paper extends the concept of optical flow estimation via convolutional networks to disparity and scene flow estimation. To this end, we pro… ▽ More

    Submitted 7 December, 2015; originally announced December 2015.

    Comments: Includes supplementary material

    ACM Class: I.2.6; I.2.10; I.4.8

  41. arXiv:1511.06702  [pdf, other

    cs.CV

    Multi-view 3D Models from Single Images with a Convolutional Network

    Authors: Maxim Tatarchenko, Alexey Dosovitskiy, Thomas Brox

    Abstract: We present a convolutional network capable of inferring a 3D representation of a previously unseen object given a single image of this object. Concretely, the network can predict an RGB image and a depth map of the object as seen from an arbitrary view. Several of these depth maps fused together give a full point cloud of the object. The point cloud can in turn be transformed into a surface mesh.… ▽ More

    Submitted 2 August, 2016; v1 submitted 20 November, 2015; originally announced November 2015.

  42. arXiv:1506.02753  [pdf, other

    cs.NE cs.CV cs.LG

    Inverting Visual Representations with Convolutional Networks

    Authors: Alexey Dosovitskiy, Thomas Brox

    Abstract: Feature representations, both hand-designed and learned ones, are often hard to analyze and interpret, even when they are extracted from visual data. We propose a new approach to study image representations by inverting them with an up-convolutional neural network. We apply the method to shallow representations (HOG, SIFT, LBP), as well as to deep networks. For shallow representations our approach… ▽ More

    Submitted 26 April, 2016; v1 submitted 8 June, 2015; originally announced June 2015.

    Comments: Version 4 - final version to appear in CVPR-2016. Visually better results obtained with feature similarity and adversarial training are in a different paper - arXiv:1602.02644

  43. arXiv:1504.06852  [pdf, other

    cs.CV cs.LG

    FlowNet: Learning Optical Flow with Convolutional Networks

    Authors: Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip Häusser, Caner Hazırbaş, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, Thomas Brox

    Abstract: Convolutional neural networks (CNNs) have recently been very successful in a variety of computer vision tasks, especially on those linked to recognition. Optical flow estimation has not been among the tasks where CNNs were successful. In this paper we construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task. We propose and compare tw… ▽ More

    Submitted 4 May, 2015; v1 submitted 26 April, 2015; originally announced April 2015.

    Comments: Added supplementary material

    ACM Class: I.2.6; I.4.8

  44. arXiv:1412.6806  [pdf, other

    cs.LG cs.CV cs.NE

    Striving for Simplicity: The All Convolutional Net

    Authors: Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, Martin Riedmiller

    Abstract: Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that… ▽ More

    Submitted 13 April, 2015; v1 submitted 21 December, 2014; originally announced December 2014.

    Comments: accepted to ICLR-2015 workshop track; no changes other than style

  45. arXiv:1411.5928  [pdf, other

    cs.CV cs.LG cs.NE

    Learning to Generate Chairs, Tables and Cars with Convolutional Networks

    Authors: Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim Tatarchenko, Thomas Brox

    Abstract: We train generative 'up-convolutional' neural networks which are able to generate images of objects given object style, viewpoint, and color. We train the networks on rendered 3D models of chairs, tables, and cars. Our experiments show that the networks do not merely learn all images by heart, but rather find a meaningful representation of 3D models allowing them to assess the similarity of differ… ▽ More

    Submitted 2 August, 2017; v1 submitted 21 November, 2014; originally announced November 2014.

    Comments: v4: final PAMI version. New architecture figure

  46. arXiv:1406.6909  [pdf, other

    cs.LG cs.CV cs.NE

    Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks

    Authors: Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, Thomas Brox

    Abstract: Deep convolutional networks have proven to be very successful in learning task specific features that allow for unprecedented performance on various computer vision tasks. Training of such networks follows mostly the supervised learning paradigm, where sufficiently many input-output pairs are required for training. Acquisition of large training sets is one of the key challenges, when approaching a… ▽ More

    Submitted 19 June, 2015; v1 submitted 26 June, 2014; originally announced June 2014.

    Comments: PAMI submission. Includes matching experiments as in arXiv:1405.5769v1. Also includes new network architectures, experiments on Caltech-256, experiment on combining Exemplar-CNN with clustering

  47. arXiv:1405.5769   

    cs.CV cs.LG

    Descriptor Matching with Convolutional Neural Networks: a Comparison to SIFT

    Authors: Philipp Fischer, Alexey Dosovitskiy, Thomas Brox

    Abstract: Latest results indicate that features learned via convolutional neural networks outperform previous descriptors on classification tasks by a large margin. It has been shown that these networks still work well when they are applied to datasets or recognition tasks different from those they were trained on. However, descriptors like SIFT are not only used in recognition but also for many corresponde… ▽ More

    Submitted 24 June, 2015; v1 submitted 22 May, 2014; originally announced May 2014.

    Comments: This paper has been merged with arXiv:1406.6909

    ACM Class: I.2.6; I.4.7; I.4.8

  48. arXiv:1312.5242  [pdf, other

    cs.CV cs.LG cs.NE

    Unsupervised feature learning by augmenting single images

    Authors: Alexey Dosovitskiy, Jost Tobias Springenberg, Thomas Brox

    Abstract: When deep learning is applied to visual object recognition, data augmentation is often used to generate additional training data without extra labeling cost. It helps to reduce overfitting and increase the performance of the algorithm. In this paper we investigate if it is possible to use data augmentation as the main component of an unsupervised feature learning architecture. To that end we sampl… ▽ More

    Submitted 16 February, 2014; v1 submitted 18 December, 2013; originally announced December 2013.

    Comments: ICLR 2014 workshop track submission (7 pages, 4 figures, 1 table)