Skip to main content

Showing 1–18 of 18 results for author: Lezama, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.13762  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation

    Authors: Gwanghyun Kim, Alonso Martinez, Yu-Chuan Su, Brendan Jou, José Lezama, Agrim Gupta, Lijun Yu, Lu Jiang, Aren Jansen, Jacob Walker, Krishna Somandepalli

    Abstract: Training diffusion models for audiovisual sequences allows for a range of generation tasks by learning conditional distributions of various input-output combinations of the two modalities. Nevertheless, this strategy often requires training a separate model for each task which is expensive. Here, we propose a novel training approach to effectively learn arbitrary conditional distributions in the a… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  2. arXiv:2405.13195  [pdf, other

    cs.CV cs.AI

    CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers

    Authors: Andrew Marmon, Grant Schindler, José Lezama, Dan Kondratyuk, Bryan Seybold, Irfan Essa

    Abstract: We extend multimodal transformers to include 3D camera motion as a conditioning signal for the task of video generation. Generative video models are becoming increasingly powerful, thus focusing research efforts on methods of controlling the output of such models. We propose to add virtual 3D camera controls to generative video methods by conditioning generated video on an encoding of three-dimens… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  3. arXiv:2312.14125  [pdf, other

    cs.CV cs.AI

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Authors: Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam , et al. (6 additional authors not shown)

    Abstract: We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and tas… ▽ More

    Submitted 4 June, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: To appear at ICML 2024; Project page: http://sites.research.google/videopoet/

  4. arXiv:2312.06662  [pdf, other

    cs.CV cs.AI cs.LG

    Photorealistic Video Generation with Diffusion Models

    Authors: Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama

    Abstract: We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

    Comments: Project website https://walt-video-diffusion.github.io/

  5. arXiv:2310.05737  [pdf, other

    cs.CV cs.AI cs.MM

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Authors: Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang

    Abstract: While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer… ▽ More

    Submitted 29 March, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  6. arXiv:2308.02947  [pdf, other

    cs.CV

    Blind Motion Deblurring with Pixel-Wise Kernel Estimation via Kernel Prediction Networks

    Authors: Guillermo Carbajal, Patricia Vitoria, José Lezama, Pablo Musé

    Abstract: In recent years, the removal of motion blur in photographs has seen impressive progress in the hands of deep learning-based methods, trained to map directly from blurry to sharp images. For this reason, approaches that explicitly use a forward degradation model received significantly less attention. However, a well-defined specification of the blur genesis, as an intermediate step, promotes the ge… ▽ More

    Submitted 5 August, 2023; originally announced August 2023.

  7. arXiv:2302.05496  [pdf, other

    cs.CV cs.AI

    MaskSketch: Unpaired Structure-guided Masked Image Generation

    Authors: Dina Bashkirova, Jose Lezama, Kihyuk Sohn, Kate Saenko, Irfan Essa

    Abstract: Recent conditional image generation methods produce images of remarkable diversity, fidelity and realism. However, the majority of these methods allow conditioning only on labels or text prompts, which limits their level of control over the generation result. In this paper, we introduce MaskSketch, an image generation method that allows spatial conditioning of the generation result using a guiding… ▽ More

    Submitted 10 February, 2023; originally announced February 2023.

  8. arXiv:2301.00704  [pdf, other

    cs.CV cs.AI cs.LG

    Muse: Text-To-Image Generation via Masked Generative Transformers

    Authors: Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, Dilip Krishnan

    Abstract: We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. C… ▽ More

    Submitted 2 January, 2023; originally announced January 2023.

  9. arXiv:2212.13459  [pdf, other

    cs.CV eess.IV

    Scaling Painting Style Transfer

    Authors: Bruno Galerne, Lara Raad, José Lezama, Jean-Michel Morel

    Abstract: Neural style transfer (NST) is a deep learning technique that produces an unprecedentedly rich style transfer from a style image to a content image. It is particularly impressive when it comes to transferring style from a painting to an image. NST was originally achieved by solving an optimization problem to match the global statistics of the style image while preserving the local geometric featur… ▽ More

    Submitted 26 June, 2024; v1 submitted 27 December, 2022; originally announced December 2022.

    Comments: 14 pages, 9 figures, 4 tables, accepted at EGSR 2024

  10. arXiv:2212.05199  [pdf, other

    cs.CV

    MAGVIT: Masked Generative Video Transformer

    Authors: Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang

    Abstract: We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MA… ▽ More

    Submitted 4 April, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

    Comments: CVPR 2023 highlight

  11. arXiv:2210.00990  [pdf, other

    cs.CV cs.AI

    Visual Prompt Tuning for Generative Transfer Learning

    Authors: Kihyuk Sohn, Yuan Hao, José Lezama, Luisa Polania, Huiwen Chang, Han Zhang, Irfan Essa, Lu Jiang

    Abstract: Transferring knowledge from an image synthesis model trained on a large dataset is a promising direction for learning generative image models from various domains efficiently. While previous works have studied GAN models, we present a recipe for learning vision transformers by generative knowledge transfer. We base our framework on state-of-the-art generative vision transformers that represent an… ▽ More

    Submitted 3 October, 2022; originally announced October 2022.

    Comments: technical report

  12. arXiv:2209.12675  [pdf, other

    cs.CV

    Rethinking Motion Deblurring Training: A Segmentation-Based Method for Simulating Non-Uniform Motion Blurred Images

    Authors: Guillermo Carbajal, Patricia Vitoria, Pablo Musé, José Lezama

    Abstract: Successful training of end-to-end deep networks for real motion deblurring requires datasets of sharp/blurred image pairs that are realistic and diverse enough to achieve generalization to real blurred images. Obtaining such datasets remains a challenging task. In this paper, we first review the limitations of existing deblurring benchmark datasets from the perspective of generalization to blurry… ▽ More

    Submitted 26 September, 2022; originally announced September 2022.

  13. arXiv:2209.04439  [pdf, other

    cs.CV

    Improved Masked Image Generation with Token-Critic

    Authors: José Lezama, Huiwen Chang, Lu Jiang, Irfan Essa

    Abstract: Non-autoregressive generative transformers recently demonstrated impressive image generation performance, and orders of magnitude faster sampling than their autoregressive counterparts. However, optimal parallel sampling from the true joint distribution of visual tokens remains an open challenge. In this paper we introduce Token-Critic, an auxiliary model to guide the sampling of a non-autoregress… ▽ More

    Submitted 9 September, 2022; originally announced September 2022.

    Comments: Accepted to ECCV 2022

  14. arXiv:2102.01026  [pdf, other

    cs.CV cs.AI cs.LG

    Non-uniform Blur Kernel Estimation via Adaptive Basis Decomposition

    Authors: Guillermo Carbajal, Patricia Vitoria, Mauricio Delbracio, Pablo Musé, José Lezama

    Abstract: Motion blur estimation remains an important task for scene analysis and image restoration. In recent years, the removal of motion blur in photographs has seen impressive progress in the hands of deep learning-based methods, trained to map directly from blurry to sharp images. Characterization of the motion blur, on the other hand, has received less attention, and progress in model-based methods fo… ▽ More

    Submitted 26 April, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

  15. Psychophysics, Gestalts and Games

    Authors: José Lezama, Samy Blusseau, Jean-Michel Morel, Gregory Randall, Rafael Grompone von Gioi

    Abstract: Many psychophysical studies are dedicated to the evaluation of the human gestalt detection on dot or Gabor patterns, and to model its dependence on the pattern and background parameters. Nevertheless, even for these constrained percepts, psychophysics have not yet reached the challenging prediction stage, where human detection would be quantitatively predicted by a (generic) model. On the other ha… ▽ More

    Submitted 25 May, 2018; originally announced May 2018.

    Journal ref: Giovanna Citti, Alessandro Sarti. Neuromathematics of Vision, Springer Berlin Heidelberg, pp.217-242, 2014, Lecture Notes in Morphogenesis

  16. arXiv:1712.01727  [pdf, other

    cs.CV cs.LG stat.ML

    OLÉ: Orthogonal Low-rank Embedding, A Plug and Play Geometric Loss for Deep Learning

    Authors: José Lezama, Qiang Qiu, Pablo Musé, Guillermo Sapiro

    Abstract: Deep neural networks trained using a softmax layer at the top and the cross-entropy loss are ubiquitous tools for image classification. Yet, this does not naturally enforce intra-class similarity nor inter-class margin of the learned deep representations. To simultaneously achieve these two goals, different solutions have been proposed in the literature, such as the pairwise or triplet losses. How… ▽ More

    Submitted 5 December, 2017; originally announced December 2017.

  17. arXiv:1711.08364  [pdf, other

    cs.CV stat.ML

    ForestHash: Semantic Hashing With Shallow Random Forests and Tiny Convolutional Networks

    Authors: Qiang Qiu, Jose Lezama, Alex Bronstein, Guillermo Sapiro

    Abstract: Hash codes are efficient data representations for co** with the ever growing amounts of data. In this paper, we introduce a random forest semantic hashing scheme that embeds tiny convolutional neural networks (CNN) into shallow random forests, with near-optimal information-theoretic code aggregation among trees. We start with a simple hashing scheme, where random trees in a forest act as hashing… ▽ More

    Submitted 27 July, 2018; v1 submitted 22 November, 2017; originally announced November 2017.

    Comments: Accepted to ECCV 2018

  18. arXiv:1611.06638  [pdf, other

    cs.CV

    Not Afraid of the Dark: NIR-VIS Face Recognition via Cross-spectral Hallucination and Low-rank Embedding

    Authors: Jose Lezama, Qiang Qiu, Guillermo Sapiro

    Abstract: Surveillance cameras today often capture NIR (near infrared) images in low-light environments. However, most face datasets accessible for training and verification are only collected in the VIS (visible light) spectrum. It remains a challenging problem to match NIR to VIS face images due to the different light spectrum. Recently, breakthroughs have been made for VIS face recognition by applying de… ▽ More

    Submitted 20 November, 2016; originally announced November 2016.