Skip to main content

Showing 1–50 of 87 results for author: Tulyakov, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19388  [pdf, other

    cs.SD cs.CL cs.CV cs.MM eess.AS

    Taming Data and Transformers for Audio Generation

    Authors: Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha Balakrishnan, Sergey Tulyakov, Vicente Ordonez

    Abstract: Generating ambient sounds and effects is a challenging problem due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle the problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. We show that by leveraging metadata available… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: Project Webpage: https://snap-research.github.io/GenAU/

  2. arXiv:2406.12831  [pdf, other

    cs.CV cs.AI cs.MM

    VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing

    Authors: **g Gu, Yuwei Fang, Ivan Skorokhodov, Peter Wonka, Xinya Du, Sergey Tulyakov, Xin Eric Wang

    Abstract: Video editing stands as a cornerstone of digital media, from entertainment and education to professional communication. However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to inaccurate and inconsistency edits in the spatiotemporal dimension, especially for long videos. In this paper, we introduce VIA, a unified spatiotemp… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 13 pages, 11 figures

  3. arXiv:2406.07792  [pdf, other

    cs.CV

    Hierarchical Patch Diffusion Models for High-Resolution Video Generation

    Authors: Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov

    Abstract: Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. This makes it very efficient during training and unlocks end-to-end optimization on high-resolutio… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: CVPR 2024

  4. arXiv:2406.07472  [pdf, other

    cs.CV

    4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

    Authors: Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Laszlo A Jeni, Sergey Tulyakov, Hsin-Ying Lee

    Abstract: Existing dynamic scene generation methods mostly rely on distilling knowledge from pre-trained 3D generative models, which are typically fine-tuned on synthetic object datasets. As a result, the generated scenes are often object-centric and lack photorealism. To address these limitations, we introduce a novel pipeline designed for photorealistic text-to-4D scene generation, discarding the dependen… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  5. arXiv:2406.05649  [pdf, other

    cs.CV cs.AI

    GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement

    Authors: Peiye Zhuang, Songfang Han, Chaoyang Wang, Aliaksandr Siarohin, Jiaxu Zou, Michael Vasilkovsky, Vladislav Shakhrai, Sergey Korolev, Sergey Tulyakov, Hsin-Ying Lee

    Abstract: We propose a novel approach for 3D mesh reconstruction from multi-view images. Our method takes inspiration from large reconstruction models like LRM that use a transformer-based triplane generator and a Neural Radiance Field (NeRF) model trained on multi-view images. However, in our method, we introduce several important modifications that allow us to significantly enhance 3D reconstruction quali… ▽ More

    Submitted 13 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

    Comments: 19 pages, 17 figures. Project page: https://snap-research.github.io/GTR/

  6. arXiv:2406.04333  [pdf, other

    cs.CV

    BitsFusion: 1.99 bits Weight Quantization of Diffusion Model

    Authors: Yang Sui, Yanyu Li, Anil Kag, Yerlan Idelbayev, Junli Cao, Ju Hu, Dhritiman Sagar, Bo Yuan, Sergey Tulyakov, Jian Ren

    Abstract: Diffusion-based image generation models have achieved great success in recent years by showing the capability of synthesizing high-quality content. However, these models contain a huge number of parameters, resulting in a significantly large model size. Saving and transferring them is a major bottleneck for various applications, especially those running on resource-constrained devices. In this wor… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Project Page: https://snap-research.github.io/BitsFusion

  7. arXiv:2406.04324  [pdf, other

    cs.CV eess.IV

    SF-V: Single Forward Video Generation Model

    Authors: Zhixing Zhang, Yanyu Li, Yushu Wu, Yanwu Xu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Dimitris Metaxas, Sergey Tulyakov, Jian Ren

    Abstract: Diffusion-based video generation models have demonstrated remarkable success in obtaining high-fidelity videos through the iterative denoising process. However, these models require multiple denoising steps during sampling, resulting in high computational costs. In this work, we propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune p… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Project Page: https://snap-research.github.io/SF-V

  8. arXiv:2404.11565  [pdf, other

    cs.CV cs.AI cs.GR

    MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation

    Authors: Kuan-Chieh Wang, Daniil Ostashev, Yuwei Fang, Sergey Tulyakov, Kfir Aberman

    Abstract: We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts mechanism utilized in large language models (LLMs), MoA distributes the generation workload between two attention pathways: a personalized branch and a non-personalized prior branch. MoA is designed to retain the original model's prior by fixi… ▽ More

    Submitted 6 May, 2024; v1 submitted 17 April, 2024; originally announced April 2024.

    Comments: Project Website: https://snap-research.github.io/mixture-of-attention, Same as previous version, only updated metadata because bib was missing an author name

  9. arXiv:2403.18978  [pdf, other

    cs.CV cs.AI cs.LG

    TextCraftor: Your Text Encoder Can be Image Quality Controller

    Authors: Yanyu Li, Xian Liu, Anil Kag, Ju Hu, Yerlan Idelbayev, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov, Jian Ren

    Abstract: Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation, enabling significant advancements in areas like image editing and video synthesis. Despite their formidable capabilities, these models are not without their limitations. It is still challenging to synthesize an image that aligns well with the input text, and multiple runs w… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

  10. arXiv:2403.17920  [pdf, other

    cs.CV

    TC4D: Trajectory-Conditioned Text-to-4D Generation

    Authors: Sherwin Bahmani, Xian Liu, Yifan Wang, Ivan Skorokhodov, Victor Rong, Ziwei Liu, Xihui Liu, Jeong Joon Park, Sergey Tulyakov, Gordon Wetzstein, Andrea Tagliasacchi, David B. Lindell

    Abstract: Recent techniques for text-to-4D generation synthesize dynamic 3D scenes using supervision from pre-trained text-to-video models. However, existing representations for motion, such as deformation models or time-dependent neural representations, are limited in the amount of motion they can generate-they cannot synthesize motion extending far beyond the bounding box used for volume rendering. The la… ▽ More

    Submitted 10 April, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

    Comments: Project Page: https://sherwinbahmani.github.io/tc4d

  11. arXiv:2403.14599  [pdf, other

    cs.CV

    MyVLM: Personalizing VLMs for User-Specific Queries

    Authors: Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or

    Abstract: Recent large-scale vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and generating textual descriptions for visual content. However, these models lack an understanding of user-specific concepts. In this work, we take a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts. For example, we explore whether… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: Project page: https://snap-research.github.io/MyVLM/

  12. arXiv:2402.19479  [pdf, other

    cs.CV

    Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

    Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, Sergey Tulyakov

    Abstract: The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, a… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: CVPR 2024. Project Page: https://snap-research.github.io/Panda-70M

  13. arXiv:2402.17753  [pdf, other

    cs.CL cs.AI cs.LG

    Evaluating Very Long-Term Conversational Memory of LLM Agents

    Authors: Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, Yuwei Fang

    Abstract: Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context large language models (LLMs) and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to gen… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: 19 pages; Project page: https://snap-research.github.io/locomo/

  14. arXiv:2402.14797  [pdf, other

    cs.CV cs.AI

    Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

    Authors: Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, Sergey Tulyakov

    Abstract: Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

  15. arXiv:2402.11487  [pdf, other

    cs.CV

    Visual Concept-driven Image Generation with Text-to-Image Diffusion Model

    Authors: Tanzila Rahman, Shweta Mahajan, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Leonid Sigal

    Abstract: Text-to-image (TTI) diffusion models have demonstrated impressive results in generating high-resolution images of complex and imaginative scenes. Recent approaches have further extended these methods with personalization techniques that allow them to integrate user-illustrated concepts (e.g., the user him/herself) using a few sample image illustrations. However, the ability to generate images with… ▽ More

    Submitted 18 February, 2024; originally announced February 2024.

    Comments: 9 Figures, 8 Pages

  16. arXiv:2402.05235  [pdf, other

    cs.CV

    SPAD : Spatially Aware Multiview Diffusers

    Authors: Yash Kant, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, Igor Gilitschenski, Aliaksandr Siarohin

    Abstract: We present SPAD, a novel approach for creating consistent multi-view images from text prompts or single images. To enable multi-view generation, we repurpose a pretrained 2D diffusion model by extending its self-attention layers with cross-view interactions, and fine-tune it on a high quality subset of Objaverse. We find that a naive extension of the self-attention proposed in prior work (e.g. MVD… ▽ More

    Submitted 7 February, 2024; originally announced February 2024.

    Comments: Webpage: https://yashkant.github.io/spad

  17. arXiv:2402.00867  [pdf, other

    cs.CV

    AToM: Amortized Text-to-Mesh using 2D Diffusion

    Authors: Guocheng Qian, Junli Cao, Aliaksandr Siarohin, Yash Kant, Chaoyang Wang, Michael Vasilkovsky, Hsin-Ying Lee, Yuwei Fang, Ivan Skorokhodov, Peiye Zhuang, Igor Gilitschenski, Jian Ren, Bernard Ghanem, Kfir Aberman, Sergey Tulyakov

    Abstract: We introduce Amortized Text-to-Mesh (AToM), a feed-forward text-to-mesh framework optimized across multiple text prompts simultaneously. In contrast to existing text-to-3D methods that often entail time-consuming per-prompt optimization and commonly output representations other than polygonal meshes, AToM directly generates high-quality textured meshes in less than 1 second with around 10 times re… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

    Comments: 19 pages with appendix and references. Webpage: https://snap-research.github.io/AToM/

  18. arXiv:2401.06127  [pdf, other

    cs.CV cs.AI cs.LG

    E$^{2}$GAN: Efficient Training of Efficient GANs for Image-to-Image Translation

    Authors: Yifan Gong, Zheng Zhan, Qing **, Yanyu Li, Yerlan Idelbayev, Xian Liu, Andrey Zharkov, Kfir Aberman, Sergey Tulyakov, Yanzhi Wang, Jian Ren

    Abstract: One highly promising direction for enabling flexible real-time on-device image editing is utilizing data distillation by leveraging large-scale text-to-image diffusion models to generate paired datasets used for training generative adversarial networks (GANs). This approach notably alleviates the stringent requirements typically imposed by high-end commercial GPUs for performing image editing with… ▽ More

    Submitted 2 June, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

    Comments: ICML 2024. Project Page: https://yifanfanfanfan.github.io/e2gan/

  19. arXiv:2401.05583  [pdf, other

    cs.CV

    Diffusion Priors for Dynamic View Synthesis from Monocular Videos

    Authors: Chaoyang Wang, Peiye Zhuang, Aliaksandr Siarohin, Junli Cao, Guocheng Qian, Hsin-Ying Lee, Sergey Tulyakov

    Abstract: Dynamic novel view synthesis aims to capture the temporal evolution of visual content within videos. Existing methods struggle to distinguishing between motion and structure, particularly in scenarios where camera poses are either unknown or constrained compared to object motion. Furthermore, with information solely from reference images, it is extremely challenging to hallucinate unseen regions t… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

  20. arXiv:2312.14154  [pdf, other

    cs.CV

    Virtual Pets: Animatable Animal Generation in 3D Scenes

    Authors: Yen-Chi Cheng, Chieh Hubert Lin, Chaoyang Wang, Yash Kant, Sergey Tulyakov, Alexander Schwing, Liangyan Gui, Hsin-Ying Lee

    Abstract: Toward unlocking the potential of generative models in immersive 4D experiences, we introduce Virtual Pet, a novel pipeline to model realistic and diverse motions for target animal species within a 3D environment. To circumvent the limited availability of 3D motion data aligned with environmental geometry, we leverage monocular internet videos and extract deformable NeRF representations for the fo… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: Preprint. Project page: https://yccyenchicheng.github.io/VirtualPets/

  21. arXiv:2312.08885  [pdf, other

    cs.CV

    SceneWiz3D: Towards Text-guided 3D Scene Composition

    Authors: Qihang Zhang, Chaoyang Wang, Aliaksandr Siarohin, Peiye Zhuang, Yinghao Xu, Ceyuan Yang, Dahua Lin, Bolei Zhou, Sergey Tulyakov, Hsin-Ying Lee

    Abstract: We are witnessing significant breakthroughs in the technology for generating 3D objects from text. Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets. Generating entire scenes, however, remains very challenging as a scene contains multiple 3D objects, diverse and scattered. In this work, we introduce Scen… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

    Comments: Project page: https://zqh0253.github.io/SceneWiz3D/

  22. arXiv:2312.06661  [pdf, other

    cs.CV

    UpFusion: Novel View Diffusion from Unposed Sparse View Observations

    Authors: Bharath Raj Nagoor Kani, Hsin-Ying Lee, Sergey Tulyakov, Shubham Tulsiani

    Abstract: We propose UpFusion, a system that can perform novel view synthesis and infer 3D representations for an object given a sparse set of reference images without corresponding pose information. Current sparse-view 3D inference methods typically rely on camera poses to geometrically aggregate information from input views, but are not robust in-the-wild when such information is unavailable/inaccurate. I… ▽ More

    Submitted 4 January, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: Project Page: https://upfusion3d.github.io/ v2: Fixed a citation mistake

  23. arXiv:2311.17984  [pdf, other

    cs.CV

    4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling

    Authors: Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, David B. Lindell

    Abstract: Recent breakthroughs in text-to-4D generation rely on pre-trained text-to-image and text-to-video models to generate dynamic 3D scenes. However, current text-to-4D methods face a three-way tradeoff between the quality of scene appearance, 3D structure, and motion. For example, text-to-image models and their 3D-aware variants are trained on internet-scale image datasets and can be used to produce s… ▽ More

    Submitted 26 May, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

    Comments: CVPR 2024; Project page: https://sherwinbahmani.github.io/4dfy

  24. arXiv:2311.17261  [pdf, other

    cs.CV

    SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

    Authors: Dave Zhenyu Chen, Haoxuan Li, Hsin-Ying Lee, Sergey Tulyakov, Matthias Nießner

    Abstract: We propose SceneTex, a novel method for effectively generating high-quality and style-consistent textures for indoor scenes using depth-to-image diffusion priors. Unlike previous methods that either iteratively warp 2D views onto a mesh surface or distillate diffusion latent features without accurate geometric and style cues, SceneTex formulates the texture synthesis task as an optimization proble… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

    Comments: Project website: https://daveredrum.github.io/SceneTex/

  25. arXiv:2310.16832  [pdf, other

    cs.CV

    LightSpeed: Light and Fast Neural Light Fields on Mobile Devices

    Authors: Aarush Gupta, Junli Cao, Chaoyang Wang, Ju Hu, Sergey Tulyakov, Jian Ren, László A Jeni

    Abstract: Real-time novel-view image synthesis on mobile devices is prohibitive due to the limited computational power and storage. Using volumetric rendering methods, such as NeRF and its derivatives, on mobile devices is not suitable due to the high computational cost of volumetric rendering. On the other hand, recent advances in neural light field representations have shown promising real-time view synth… ▽ More

    Submitted 26 October, 2023; v1 submitted 25 October, 2023; originally announced October 2023.

    Comments: Project Page: http://lightspeed-r2l.github.io/ . Add camera ready version

    Journal ref: NeurIPS 2023

  26. arXiv:2310.16167  [pdf, other

    cs.CV

    iNVS: Repurposing Diffusion Inpainters for Novel View Synthesis

    Authors: Yash Kant, Aliaksandr Siarohin, Michael Vasilkovsky, Riza Alp Guler, Jian Ren, Sergey Tulyakov, Igor Gilitschenski

    Abstract: We present a method for generating consistent novel views from a single source image. Our approach focuses on maximizing the reuse of visible pixels from the source image. To achieve this, we use a monocular depth estimator that transfers visible pixels from the source view to the target view. Starting from a pre-trained 2D inpainting diffusion model, we train our method on the large-scale Objaver… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: Accepted to SIGGRAPH Asia, 2023 (Conference Papers)

  27. arXiv:2310.08579  [pdf, other

    cs.CV

    HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion

    Authors: Xian Liu, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Yanyu Li, Dahua Lin, Xihui Liu, Ziwei Liu, Sergey Tulyakov

    Abstract: Despite significant advances in large-scale text-to-image models, achieving hyper-realistic human image generation remains a desirable yet unsolved task. Existing models like Stable Diffusion and DALL-E 2 tend to generate human images with incoherent parts or unnatural poses. To tackle these challenges, our key insight is that human image is inherently structural over multiple granularities, from… ▽ More

    Submitted 14 March, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

    Comments: Accepted by ICLR 2024, camera-ready version. Project Page: https://snap-research.github.io/HyperHuman/

  28. arXiv:2307.05445  [pdf, other

    cs.CV

    AutoDecoding Latent 3D Diffusion Models

    Authors: Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc Van Gool, Sergey Tulyakov

    Abstract: We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space, which can then be decoded into a volumetric representation for rendering view-consistent appearance and geometry. We then identify the appropriate intermediate volumetric latent s… ▽ More

    Submitted 7 July, 2023; originally announced July 2023.

    Comments: Project page: https://snap-research.github.io/3DVADER/

  29. arXiv:2307.03190  [pdf, other

    cs.CV cs.GR cs.LG

    Text-Guided Synthesis of Eulerian Cinemagraphs

    Authors: Aniruddha Mahapatra, Aliaksandr Siarohin, Hsin-Ying Lee, Sergey Tulyakov, Jun-Yan Zhu

    Abstract: We introduce Text2Cinemagraph, a fully automated method for creating cinemagraphs from text descriptions - an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. We focus on cinemagraphs of fluid elements, such as flowing rivers, and drifting clouds, which exhibit continuous motion… ▽ More

    Submitted 25 September, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

    Comments: Project website: https://text2cinemagraph.github.io/website/

  30. arXiv:2306.17843  [pdf, other

    cs.CV

    Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors

    Authors: Guocheng Qian, **jie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, Bernard Ghanem

    Abstract: We present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both2D and 3D priors. In the first stage, we optimize a neural radiance field to produce a coarse geometry. In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing… ▽ More

    Submitted 23 July, 2023; v1 submitted 30 June, 2023; originally announced June 2023.

    Comments: webpage: https://guochengqian.github.io/project/magic123/

  31. arXiv:2306.00980  [pdf, other

    cs.CV cs.AI cs.LG

    SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

    Authors: Yanyu Li, Huan Wang, Qing **, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, Jian Ren

    Abstract: Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion mode… ▽ More

    Submitted 16 October, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: Our project webpage: https://snap-research.github.io/SnapFusion/

  32. arXiv:2303.13472  [pdf, other

    cs.CV cs.AI

    Promptable Game Models: Text-Guided Game Simulation via Masked Diffusion Models

    Authors: Willi Menapace, Aliaksandr Siarohin, Stéphane Lathuilière, Panos Achlioptas, Vladislav Golyanik, Sergey Tulyakov, Elisa Ricci

    Abstract: Neural video game simulators emerged as powerful tools to generate and edit videos. Their idea is to represent games as the evolution of an environment's state driven by the actions of its agents. While such a paradigm enables users to play a game action-by-action, its rigidity precludes more semantic forms of control. To overcome this limitation, we augment game models with prompts specified as a… ▽ More

    Submitted 21 January, 2024; v1 submitted 23 March, 2023; originally announced March 2023.

    Comments: ACM Transactions on Graphics \c{opyright} Copyright is held by the owner/author(s) 2023. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM Transactions on Graphics, http://dx.doi.org/10.1145/3635705

  33. arXiv:2303.11396  [pdf, other

    cs.CV

    Text2Tex: Text-driven Texture Synthesis via Diffusion Models

    Authors: Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, Matthias Nießner

    Abstract: We present Text2Tex, a novel method for generating high-quality textures for 3D meshes from the given text prompts. Our method incorporates inpainting into a pre-trained depth-aware image diffusion model to progressively synthesize high resolution partial textures from multiple viewpoints. To avoid accumulating inconsistent and stretched artifacts across views, we dynamically segment the rendered… ▽ More

    Submitted 20 March, 2023; originally announced March 2023.

    Comments: Project page: https://daveredrum.github.io/Text2Tex/

  34. arXiv:2303.01416  [pdf, other

    cs.CV cs.AI cs.GR

    3D generation on ImageNet

    Authors: Ivan Skorokhodov, Aliaksandr Siarohin, Yinghao Xu, Jian Ren, Hsin-Ying Lee, Peter Wonka, Sergey Tulyakov

    Abstract: Existing 3D-from-2D generators are typically designed for well-curated single-category datasets, where all the objects have (approximately) the same scale, 3D location, and orientation, and the camera always points to the center of the scene. This makes them inapplicable to diverse, in-the-wild datasets of non-alignable scenes rendered from arbitrary camera poses. In this work, we develop a 3D gen… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: ICLR 2023 (Oral)

    Journal ref: ICLR 2023

  35. arXiv:2302.09227  [pdf, other

    cs.CV cs.GR

    Invertible Neural Skinning

    Authors: Yash Kant, Aliaksandr Siarohin, Riza Alp Guler, Menglei Chai, Jian Ren, Sergey Tulyakov, Igor Gilitschenski

    Abstract: Building animatable and editable models of clothed humans from raw 3D scans and poses is a challenging problem. Existing reposing methods suffer from the limited expressiveness of Linear Blend Skinning (LBS), require costly mesh extraction to generate each new pose, and typically do not preserve surface correspondences across different poses. In this work, we introduce Invertible Neural Skinning (… ▽ More

    Submitted 4 March, 2023; v1 submitted 17 February, 2023; originally announced February 2023.

  36. arXiv:2301.11326  [pdf, other

    cs.CV

    Unsupervised Volumetric Animation

    Authors: Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Kyle Olszewski, Jian Ren, Hsin-Ying Lee, Menglei Chai, Sergey Tulyakov

    Abstract: We propose a novel approach for unsupervised 3D animation of non-rigid deformable objects. Our method learns the 3D structure and dynamics of objects solely from single-view RGB videos, and can decompose them into semantically meaningful parts that can be tracked and animated. Using a 3D autodecoder framework, paired with a keypoint estimator via a differentiable PnP algorithm, our model learns th… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

  37. arXiv:2301.09637  [pdf, other

    cs.CV cs.AI cs.GR cs.LG

    InfiniCity: Infinite-Scale City Synthesis

    Authors: Chieh Hubert Lin, Hsin-Ying Lee, Willi Menapace, Menglei Chai, Aliaksandr Siarohin, Ming-Hsuan Yang, Sergey Tulyakov

    Abstract: Toward infinite-scale 3D city synthesis, we propose a novel framework, InfiniCity, which constructs and renders an unconstrainedly large and 3D-grounded environment from random noises. InfiniCity decomposes the seemingly impractical task into three feasible modules, taking advantage of both 2D and 3D data. First, an infinite-pixel image synthesis module generates arbitrary-scale 2D maps from the b… ▽ More

    Submitted 14 August, 2023; v1 submitted 23 January, 2023; originally announced January 2023.

  38. arXiv:2301.02700  [pdf, other

    cs.CV cs.GR

    3DAvatarGAN: Bridging Domains for Personalized Editable Avatars

    Authors: Rameen Abdal, Hsin-Ying Lee, Peihao Zhu, Menglei Chai, Aliaksandr Siarohin, Peter Wonka, Sergey Tulyakov

    Abstract: Modern 3D-GANs synthesize geometry and texture by training on large-scale datasets with a consistent structure. Training such models on stylized, artistic data, with often unknown, highly variable geometry, and camera information has not yet been shown possible. Can we train a 3D GAN on such artistic data, while maintaining multi-view consistency and texture quality? To this end, we propose an ada… ▽ More

    Submitted 26 March, 2023; v1 submitted 6 January, 2023; originally announced January 2023.

    Comments: Project Page: https://rameenabdal.github.io/3DAvatarGAN/

  39. arXiv:2212.11984  [pdf, other

    cs.CV

    DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-aware Scene Synthesis

    Authors: Yinghao Xu, Menglei Chai, Zifan Shi, Sida Peng, Ivan Skorokhodov, Aliaksandr Siarohin, Ceyuan Yang, Yujun Shen, Hsin-Ying Lee, Bolei Zhou, Sergey Tulyakov

    Abstract: Existing 3D-aware image synthesis approaches mainly focus on generating a single canonical object and show limited capacity in composing a complex scene containing a variety of objects. This work presents DisCoScene: a 3Daware generative model for high-quality and controllable scene synthesis. The key ingredient of our method is a very abstract object-level representation (i.e., 3D bounding boxes… ▽ More

    Submitted 22 December, 2022; originally announced December 2022.

    Comments: Project page: https://snap-research.github.io/discoscene/

  40. arXiv:2212.08059  [pdf, other

    cs.CV cs.AI cs.LG

    Rethinking Vision Transformers for MobileNet Size and Speed

    Authors: Yanyu Li, Ju Hu, Yang Wen, Georgios Evangelidis, Kamyar Salahi, Yanzhi Wang, Sergey Tulyakov, Jian Ren

    Abstract: With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize the performance and complexity of ViTs to enable efficient deployment on mobile devices. Multiple approaches are proposed to accelerate attention mechanism, improve inefficient designs, or incorporate mobile-friendly lightweight convolutions to form hybrid architectures. However, ViT and its varian… ▽ More

    Submitted 4 September, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

    Comments: Code is available at: https://github.com/snap-research/EfficientFormer

  41. arXiv:2212.08057  [pdf, other

    cs.CV cs.AI cs.LG

    Real-Time Neural Light Field on Mobile Devices

    Authors: Junli Cao, Huan Wang, Pavlo Chemerys, Vladislav Shakhrai, Ju Hu, Yun Fu, Denys Makoviichuk, Sergey Tulyakov, Jian Ren

    Abstract: Recent efforts in Neural Rendering Fields (NeRF) have shown impressive results on novel view synthesis by utilizing implicit neural representation to represent 3D scenes. Due to the process of volumetric rendering, the inference speed for NeRF is extremely slow, limiting the application scenarios of utilizing NeRF on resource-constrained hardware, such as mobile devices. Many works have been condu… ▽ More

    Submitted 24 June, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

    Comments: CVPR 2023. Project page: https://snap-research.github.io/MobileR2L/ Code: https://github.com/snap-research/MobileR2L/

  42. arXiv:2212.05011  [pdf, other

    cs.CV cs.CL

    LADIS: Language Disentanglement for 3D Shape Editing

    Authors: Ian Huang, Panos Achlioptas, Tianyi Zhang, Sergey Tulyakov, Minhyuk Sung, Leonidas Guibas

    Abstract: Natural language interaction is a promising direction for democratizing 3D shape design. However, existing methods for text-driven 3D shape editing face challenges in producing decoupled, local edits to 3D shapes. We address this problem by learning disentangled latent representations that ground language in 3D geometry. To this end, we propose a complementary tool set including a novel network ar… ▽ More

    Submitted 9 December, 2022; originally announced December 2022.

  43. arXiv:2212.04493  [pdf, other

    cs.CV cs.LG

    SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation

    Authors: Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexander Schwing, Liangyan Gui

    Abstract: In this work, we present a novel framework built to simplify 3D asset generation for amateur users. To enable interactive generation, our method supports a variety of input modalities that can be easily provided by a human, including images, text, partially observed shapes and combinations of these, further allowing to adjust the strength of each input. At the core of our approach is an encoder-de… ▽ More

    Submitted 21 March, 2023; v1 submitted 8 December, 2022; originally announced December 2022.

    Comments: In CVPR 2023. Project page and code is available at: https://yccyenchicheng.github.io/SDFusion/. Fix some typos

  44. arXiv:2211.13319  [pdf, other

    cs.CV

    Make-A-Story: Visual Memory Conditioned Consistent Story Generation

    Authors: Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, Leonid Sigal

    Abstract: There has been a recent explosion of impressive generative models that can produce high quality images (or videos) conditioned on text descriptions. However, all such approaches rely on conditional sentences that contain unambiguous descriptions of scenes and main actors in them. Therefore employing such models for more complex task of story visualization, where naturally references and co-referen… ▽ More

    Submitted 5 May, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

    Comments: 11 pages

  45. arXiv:2210.01946  [pdf, other

    cs.CV cs.CL

    Affection: Learning Affective Explanations for Real-World Visual Data

    Authors: Panos Achlioptas, Maks Ovsjanikov, Leonidas Guibas, Sergey Tulyakov

    Abstract: In this work, we explore the emotional reactions that real-world images tend to induce by using natural language as the medium to express the rationale behind an affective response to a given visual stimulus. To embark on this journey, we introduce and share with the research community a large-scale dataset that contains emotional reactions and free-form textual explanations for 85,007 publicly av… ▽ More

    Submitted 4 October, 2022; originally announced October 2022.

    Comments: https://affective-explanations.org

  46. arXiv:2209.11204  [pdf, other

    cs.LG cs.AI cs.CV

    Layer Freezing & Data Sieving: Missing Pieces of a Generic Framework for Sparse Training

    Authors: Geng Yuan, Yanyu Li, Sheng Li, Zhenglun Kong, Sergey Tulyakov, Xulong Tang, Yanzhi Wang, Jian Ren

    Abstract: Recently, sparse training has emerged as a promising paradigm for efficient deep learning on edge devices. The current research mainly devotes efforts to reducing training costs by further increasing model sparsity. However, increasing sparsity is not always ideal since it will inevitably introduce severe accuracy degradation at an extremely high sparsity level. This paper intends to explore other… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

    Comments: Published in 36th Conference on Neural Information Processing Systems (NeurIPS 2022)

  47. arXiv:2208.11398  [pdf, other

    cs.CV

    Event-based Image Deblurring with Dynamic Motion Awareness

    Authors: Patricia Vitoria, Stamatios Georgoulis, Stepan Tulyakov, Alfredo Bochicchio, Julius Erbach, Yuanyou Li

    Abstract: Non-uniform image deblurring is a challenging task due to the lack of temporal and textural information in the blurry image itself. Complementary information from auxiliary sensors such event sensors are being explored to address these limitations. The latter can record changes in a logarithmic intensity asynchronously, called events, with high temporal resolution and high dynamic range. Current e… ▽ More

    Submitted 24 August, 2022; originally announced August 2022.

    Journal ref: ECCVW 2022

  48. arXiv:2207.11795  [pdf, other

    cs.CV

    Cross-Modal 3D Shape Generation and Manipulation

    Authors: Zezhou Cheng, Menglei Chai, Jian Ren, Hsin-Ying Lee, Kyle Olszewski, Zeng Huang, Subhransu Maji, Sergey Tulyakov

    Abstract: Creating and editing the shape and color of 3D objects require tremendous human effort and expertise. Compared to direct manipulation in 3D interfaces, 2D interactions such as sketches and scribbles are usually much more natural and intuitive for the users. In this paper, we propose a generic multi-modal generative model that couples the 2D modalities and implicit 3D representations through shared… ▽ More

    Submitted 24 July, 2022; originally announced July 2022.

    Comments: ECCV 2022. Project page: https://people.cs.umass.edu/~zezhoucheng/edit3d/

  49. arXiv:2206.10535  [pdf, other

    cs.CV cs.AI cs.LG

    EpiGRAF: Rethinking training of 3D GANs

    Authors: Ivan Skorokhodov, Sergey Tulyakov, Yiqun Wang, Peter Wonka

    Abstract: A very recent trend in generative modeling is building 3D-aware generators from 2D image collections. To induce the 3D bias, such models typically rely on volumetric rendering, which is expensive to employ at high resolutions. During the past months, there appeared more than 10 works that address this scaling issue by training a separate 2D decoder to upsample a low-resolution image (or a feature… ▽ More

    Submitted 15 December, 2022; v1 submitted 21 June, 2022; originally announced June 2022.

    Comments: NeurIPS 2022

  50. arXiv:2206.07771  [pdf, other

    cs.CV

    Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation

    Authors: Ye Zhu, Yu Wu, Kyle Olszewski, Jian Ren, Sergey Tulyakov, Yan Yan

    Abstract: Diffusion probabilistic models (DPMs) have become a popular approach to conditional generation, due to their promising results and support for cross-modal synthesis. A key desideratum in conditional synthesis is to achieve high correspondence between the conditioning input and generated output. Most existing methods learn such relationships implicitly, by incorporating the prior into the variation… ▽ More

    Submitted 16 February, 2023; v1 submitted 15 June, 2022; originally announced June 2022.

    Comments: ICLR 2023. Project at https://github.com/L-YeZhu/CDCD