Skip to main content

Showing 1–23 of 23 results for author: Rombach, R

.
  1. arXiv:2403.12015  [pdf, other

    cs.CV

    Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

    Authors: Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, Robin Rombach

    Abstract: Diffusion models are the main driver of progress in image and video synthesis, but suffer from slow inference speed. Distillation methods, like the recently introduced adversarial diffusion distillation (ADD) aim to shift the model from many-shot to single-step inference, albeit at the cost of expensive and difficult optimization due to its reliance on a fixed pretrained DINOv2 discriminator. We i… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  2. arXiv:2403.12008  [pdf, other

    cs.CV

    SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

    Authors: Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, Varun Jampani

    Abstract: We present Stable Video 3D (SV3D) -- a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object. Recent work on 3D generation propose techniques to adapt 2D generative models for novel view synthesis (NVS) and 3D optimization. However, these methods have several disadvantages due to either limited views or inconsistent NVS, thereby affec… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Project page: https://sv3d.github.io/

  3. arXiv:2403.03206  [pdf, other

    cs.CV

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Authors: Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, Robin Rombach

    Abstract: Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is n… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

  4. arXiv:2401.01808  [pdf, other

    cs.CV

    aMUSEd: An Open MUSE Reproduction

    Authors: Suraj Patil, William Berman, Robin Rombach, Patrick von Platen

    Abstract: We present aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. With 10 percent of MUSE's parameters, aMUSEd is focused on fast image generation. We believe MIM is under-explored compared to latent diffusion, the prevailing approach for text-to-image generation. Compared to latent diffusion, MIM requires fewer inference steps and is more interpre… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

  5. arXiv:2312.03606  [pdf, other

    cs.CV cs.AI cs.LG

    DiffusionSat: A Generative Foundation Model for Satellite Imagery

    Authors: Samar Khanna, Patrick Liu, Linqi Zhou, Chenlin Meng, Robin Rombach, Marshall Burke, David Lobell, Stefano Ermon

    Abstract: Diffusion models have achieved state-of-the-art results on many modalities including images, speech, and video. However, existing models are not tailored to support remote sensing data, which is widely used in important applications including environmental monitoring and crop-yield prediction. Satellite images are significantly different from natural images -- they can be multi-spectral, irregular… ▽ More

    Submitted 25 May, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

    Comments: Published at ICLR 2024

  6. arXiv:2311.17042  [pdf, other

    cs.CV

    Adversarial Diffusion Distillation

    Authors: Axel Sauer, Dominik Lorenz, Andreas Blattmann, Robin Rombach

    Abstract: We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1-4 steps while maintaining high image quality. We use score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal in combination with an adversarial loss to ensure high image fidelity even in the l… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  7. arXiv:2311.15127  [pdf, other

    cs.CV

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Authors: Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach

    Abstract: We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary wi… ▽ More

    Submitted 25 November, 2023; originally announced November 2023.

  8. arXiv:2307.01952  [pdf, other

    cs.CV cs.AI

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Authors: Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach

    Abstract: We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ra… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

  9. arXiv:2304.09787  [pdf, other

    cs.CV

    NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

    Authors: Seung Wook Kim, Bradley Brown, Kangxue Yin, Karsten Kreis, Katja Schwarz, Daiqing Li, Robin Rombach, Antonio Torralba, Sanja Fidler

    Abstract: Automatically generating high-quality real world 3D scenes is of enormous interest for applications such as virtual reality and robotics simulation. Towards this goal, we introduce NeuralField-LDM, a generative model capable of synthesizing complex 3D environments. We leverage Latent Diffusion Models that have been successfully utilized for efficient high-quality 2D content creation. We first trai… ▽ More

    Submitted 19 April, 2023; originally announced April 2023.

    Comments: CVPR 2023

  10. arXiv:2304.08818  [pdf, other

    cs.CV cs.LG

    Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

    Authors: Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis

    Abstract: Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by int… ▽ More

    Submitted 27 December, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

    Comments: Conference on Computer Vision and Pattern Recognition (CVPR) 2023. Project page: https://research.nvidia.com/labs/toronto-ai/VideoLDM/

  11. arXiv:2210.03142  [pdf, other

    cs.CV cs.AI cs.LG

    On Distillation of Guided Diffusion Models

    Authors: Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, Tim Salimans

    Abstract: Classifier-free guided diffusion models have recently been shown to be highly effective at high-resolution image generation, and they have been widely used in large-scale diffusion frameworks including DALLE-2, Stable Diffusion and Imagen. However, a downside of classifier-free guided diffusion models is that they are computationally expensive at inference time since they require evaluating two di… ▽ More

    Submitted 12 April, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: CVPR 2023, Award candidate

  12. arXiv:2207.13038  [pdf, other

    cs.CV

    Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models

    Authors: Robin Rombach, Andreas Blattmann, Björn Ommer

    Abstract: Novel architectures have recently improved generative image synthesis leading to excellent visual quality in various tasks. Of particular note is the field of ``AI-Art'', which has seen unprecedented growth with the emergence of powerful multimodal models such as CLIP. By combining speech and image synthesis models, so-called ``prompt-engineering'' has become established, in which carefully select… ▽ More

    Submitted 26 July, 2022; originally announced July 2022.

    Comments: 4 pages

  13. arXiv:2204.11824  [pdf, other

    cs.CV

    Semi-Parametric Neural Image Synthesis

    Authors: Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, Björn Ommer

    Abstract: Novel architectures have recently improved generative image synthesis leading to excellent visual quality in various tasks. Much of this success is due to the scalability of these architectures and hence caused by a dramatic increase in model complexity and in the computational resources invested in training these models. Our work questions the underlying paradigm of compressing large training dat… ▽ More

    Submitted 24 October, 2022; v1 submitted 25 April, 2022; originally announced April 2022.

    Comments: NeurIPS 2022

  14. arXiv:2112.10752  [pdf, other

    cs.CV

    High-Resolution Image Synthesis with Latent Diffusion Models

    Authors: Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer

    Abstract: By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization o… ▽ More

    Submitted 13 April, 2022; v1 submitted 20 December, 2021; originally announced December 2021.

    Comments: CVPR 2022

  15. arXiv:2108.08827  [pdf, other

    cs.CV

    ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

    Authors: Patrick Esser, Robin Rombach, Andreas Blattmann, Björn Ommer

    Abstract: Autoregressive models and their sequential factorization of the data likelihood have recently demonstrated great potential for image representation and synthesis. Nevertheless, they incorporate image context in a linear 1D order by attending only to previously synthesized image patches above or to the left. Not only is this unidirectional, sequential bias of attention unnatural for images as it di… ▽ More

    Submitted 19 August, 2021; originally announced August 2021.

  16. arXiv:2105.06458  [pdf, other

    cs.CV

    High-Resolution Complex Scene Synthesis with Transformers

    Authors: Manuel Jahn, Robin Rombach, Björn Ommer

    Abstract: The use of coarse-grained layouts for controllable synthesis of complex scene images via deep generative models has recently gained popularity. However, results of current approaches still fall short of their promise of high-resolution synthesis. We hypothesize that this is mostly due to the highly engineered nature of these approaches which often rely on auxiliary losses and intermediate steps su… ▽ More

    Submitted 13 May, 2021; originally announced May 2021.

    Comments: AI for Content Creation Workshop, CVPR 2021

  17. arXiv:2105.04551  [pdf, other

    cs.CV

    Stochastic Image-to-Video Synthesis using cINNs

    Authors: Michael Dorkenwald, Timo Milbich, Andreas Blattmann, Robin Rombach, Konstantinos G. Derpanis, Björn Ommer

    Abstract: Video understanding calls for a model to learn the characteristic interplay between static scene content and its dynamics: Given an image, the model must be able to predict a future progression of the portrayed scene and, conversely, a video should be explained in terms of its static image content and all the remaining characteristics not present in the initial frame. This naturally suggests a bij… ▽ More

    Submitted 17 June, 2021; v1 submitted 10 May, 2021; originally announced May 2021.

    Comments: Accepted to CVPR 2021

  18. arXiv:2104.07652  [pdf, other

    cs.CV

    Geometry-Free View Synthesis: Transformers and no 3D Priors

    Authors: Robin Rombach, Patrick Esser, Björn Ommer

    Abstract: Is a geometric model required to synthesize novel views from a single image? Being bound to local convolutions, CNNs need explicit 3D biases to model geometric transformations. In contrast, we demonstrate that a transformer-based model can synthesize entirely novel views without any hand-engineered 3D biases. This is achieved by (i) a global attention mechanism for implicitly learning long-range 3… ▽ More

    Submitted 30 August, 2021; v1 submitted 15 April, 2021; originally announced April 2021.

    Comments: Published at ICCV 2021. Code available at https://git.io/JOnwn

  19. arXiv:2012.09841  [pdf, other

    cs.CV

    Taming Transformers for High-Resolution Image Synthesis

    Authors: Patrick Esser, Robin Rombach, Björn Ommer

    Abstract: Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of… ▽ More

    Submitted 23 June, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

    Comments: Changelog can be found in the supplementary

  20. arXiv:2012.02516  [pdf, other

    cs.CV cs.LG

    A Note on Data Biases in Generative Models

    Authors: Patrick Esser, Robin Rombach, Björn Ommer

    Abstract: It is tempting to think that machines are less prone to unfairness and prejudice. However, machine learning approaches compute their outputs based on data. While biases can enter at any stage of the development pipeline, models are particularly receptive to mirror biases of the datasets they are trained on and therefore do not necessarily reflect truths about the world but, primarily, truths about… ▽ More

    Submitted 4 December, 2020; originally announced December 2020.

    Comments: Extended Abstract for the NeurIPS 2020 Workshop on Machine Learning for Creativity and Design

  21. arXiv:2008.01777  [pdf, other

    cs.CV

    Making Sense of CNNs: Interpreting Deep Representations & Their Invariances with INNs

    Authors: Robin Rombach, Patrick Esser, Björn Ommer

    Abstract: To tackle increasingly complex tasks, it has become an essential ability of neural networks to learn abstract representations. These task-specific representations and, particularly, the invariances they capture turn neural networks into black box models that lack interpretability. To open such a black box, it is, therefore, crucial to uncover the different semantic concepts a model has learned as… ▽ More

    Submitted 4 August, 2020; originally announced August 2020.

    Comments: ECCV 2020. Project page and code at https://compvis.github.io/invariances/

  22. arXiv:2005.13580  [pdf, other

    cs.CV cs.LG

    Network-to-Network Translation with Conditional Invertible Neural Networks

    Authors: Robin Rombach, Patrick Esser, Björn Ommer

    Abstract: Given the ever-increasing computational costs of modern machine learning models, we need to find new ways to reuse such expert models and thus tap into the resources that have been invested in their creation. Recent work suggests that the power of these massive models is captured by the representations they learn. Therefore, we seek a model that can relate between different existing representation… ▽ More

    Submitted 9 November, 2020; v1 submitted 27 May, 2020; originally announced May 2020.

    Comments: NeurIPS 2020 (oral). Code at https://github.com/CompVis/net2net

  23. arXiv:2004.13166  [pdf, other

    cs.CV

    A Disentangling Invertible Interpretation Network for Explaining Latent Representations

    Authors: Patrick Esser, Robin Rombach, Björn Ommer

    Abstract: Neural networks have greatly boosted performance in computer vision by learning powerful representations of input data. The drawback of end-to-end training for maximal overall performance are black-box models whose hidden representations are lacking interpretability: Since distributed coding is optimal for latent layers to improve their robustness, attributing meaning to parts of a hidden feature… ▽ More

    Submitted 27 April, 2020; originally announced April 2020.

    Comments: CVPR 2020. Project Page at https://compvis.github.io/iin/