Skip to main content

Showing 1–50 of 107 results for author: Jampani, V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.20077  [pdf, other

    cs.CV

    HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Model

    Authors: Hieu T. Nguyen, Yiwen Chen, Vikram Voleti, Varun Jampani, Huaizu Jiang

    Abstract: We introduce HouseCrafter, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in a batch-wise m… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  2. arXiv:2406.17396  [pdf, other

    cs.CV

    SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing

    Authors: Ruihuang Li, Liyi Chen, Zhengqiang Zhang, Varun Jampani, Vishal M. Patel, Lei Zhang

    Abstract: Text-based 2D diffusion models have demonstrated impressive capabilities in image generation and editing. Meanwhile, the 2D diffusion models also exhibit substantial potentials for 3D editing tasks. However, how to achieve consistent edits across multiple viewpoints remains a challenge. While the iterative dataset update method is capable of achieving global consistency, it suffers from slow conve… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: 16 pages, 13 figures

  3. arXiv:2406.08488  [pdf, other

    cs.CV cs.AI cs.LG

    ICE-G: Image Conditional Editing of 3D Gaussian Splats

    Authors: Vishnu Jaganathan, Hannah Hanyun Huang, Muhammad Zubair Irshad, Varun Jampani, Amit Raj, Zsolt Kira

    Abstract: Recently many techniques have emerged to create high quality 3D assets and scenes. When it comes to editing of these objects, however, existing approaches are either slow, compromise on quality, or do not provide enough customization. We introduce a novel approach to quickly edit a 3D model from a single reference view. Our technique first segments the edit image, and then matches semantically cor… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted to CVPR AI4CC Workshop 2024. Project page: https://ice-gaussian.github.io

  4. arXiv:2405.13218  [pdf, other

    cs.CV

    Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction

    Authors: Maciej Kilian, Varun Jampani, Luke Zettlemoyer

    Abstract: Nearly every recent image synthesis approach, including diffusion, masked-token prediction, and next-token prediction, uses a Transformer network architecture. Despite this common backbone, there has been no direct, compute controlled comparison of how these approaches affect performance and efficiency. We analyze the scalability of each approach through the lens of compute budget measured in FLOP… ▽ More

    Submitted 24 May, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

  5. arXiv:2404.10142  [pdf, other

    cs.HC cs.AI

    Sha** Realities: Enhancing 3D Generative AI with Fabrication Constraints

    Authors: Faraz Faruqi, Yingtao Tian, Vrushank Phadnis, Varun Jampani, Stefanie Mueller

    Abstract: Generative AI tools are becoming more prevalent in 3D modeling, enabling users to manipulate or create new models with text or images as inputs. This makes it easier for users to rapidly customize and iterate on their 3D designs and explore new creative ideas. These methods focus on the aesthetic quality of the 3D models, refining them to look similar to the prompts provided by the user. However,… ▽ More

    Submitted 16 April, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

  6. arXiv:2404.08636  [pdf, other

    cs.CV

    Probing the 3D Awareness of Visual Foundation Models

    Authors: Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, Varun Jampani

    Abstract: Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task, their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also repr… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024. Project page: https://github.com/mbanani/probe3d

  7. arXiv:2404.06425  [pdf, other

    cs.CV

    ZeST: Zero-Shot Material Transfer from a Single Image

    Authors: Ta-Ying Cheng, Prafull Sharma, Andrew Markham, Niki Trigoni, Varun Jampani

    Abstract: We propose ZeST, a method for zero-shot material transfer to an object in the input image given a material exemplar image. ZeST leverages existing diffusion adapters to extract implicit material representation from the exemplar image. This representation is used to transfer the material using pre-trained inpainting diffusion model on the object in the input image using depth estimates as geometry… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: Project Page: https://ttchengab.github.io/zest

  8. arXiv:2404.03656  [pdf, other

    cs.CV

    MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation

    Authors: Hanzhe Hu, Zhizhuo Zhou, Varun Jampani, Shubham Tulsiani

    Abstract: We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images. While recent methods pursuing 3D inference advocate learning novel-view generative models, these generations are not 3D-consistent and require a distillation process to generate a 3D output. We instead cast the task of 3D inference as directly generating mutually-consistent m… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: Project page: https://mvd-fusion.github.io/

  9. arXiv:2404.02125  [pdf, other

    cs.CV

    3D Congealing: 3D-Aware Image Alignment in the Wild

    Authors: Yunzhi Zhang, Zizhang Li, Amit Raj, Andreas Engelhardt, Yuanzhen Li, Tingbo Hou, Jiajun Wu, Varun Jampani

    Abstract: We propose 3D Congealing, a novel problem of 3D-aware alignment for 2D images capturing semantically similar objects. Given a collection of unlabeled Internet images, our goal is to associate the shared semantic parts from the inputs and aggregate the knowledge from 2D images to a shared 3D canonical space. We introduce a general framework that tackles the task without assuming shape templates, po… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: Project page: https://ai.stanford.edu/~yzzhang/projects/3d-congealing/

  10. arXiv:2403.17541  [pdf, other

    cs.CV cs.GR

    WordRobe: Text-Guided Generation of Textured 3D Garments

    Authors: Astitva Srivastava, Pranav Manu, Amit Raj, Varun Jampani, Avinash Sharma

    Abstract: In this paper, we tackle a new and challenging problem of text-driven generation of 3D garments with high-quality textures. We propose "WordRobe", a novel framework for the generation of unposed & textured 3D garment meshes from user-friendly text prompts. We achieve this by first learning a latent representation of 3D garments using a novel coarse-to-fine training strategy and a loss for latent d… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

  11. arXiv:2403.12008  [pdf, other

    cs.CV

    SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

    Authors: Vikram Voleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, Varun Jampani

    Abstract: We present Stable Video 3D (SV3D) -- a latent video diffusion model for high-resolution, image-to-multi-view generation of orbital videos around a 3D object. Recent work on 3D generation propose techniques to adapt 2D generative models for novel view synthesis (NVS) and 3D optimization. However, these methods have several disadvantages due to either limited views or inconsistent NVS, thereby affec… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Project page: https://sv3d.github.io/

  12. arXiv:2403.02151  [pdf, other

    cs.CV

    TripoSR: Fast 3D Object Reconstruction from a Single Image

    Authors: Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, Yan-Pei Cao

    Abstract: This technical report introduces TripoSR, a 3D reconstruction model leveraging transformer architecture for fast feed-forward 3D generation, producing 3D mesh from a single image in under 0.5 seconds. Building upon the LRM network architecture, TripoSR integrates substantial improvements in data processing, model design, and training techniques. Evaluations on public datasets show that TripoSR exh… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: Model: https://huggingface.co/stabilityai/TripoSR Code: https://github.com/VAST-AI-Research/TripoSR Demo: https://huggingface.co/spaces/stabilityai/TripoSR

  13. arXiv:2401.10171  [pdf, other

    cs.CV cs.GR

    SHINOBI: Shape and Illumination using Neural Object Decomposition via BRDF Optimization In-the-wild

    Authors: Andreas Engelhardt, Amit Raj, Mark Boss, Yunzhi Zhang, Abhishek Kar, Yuanzhen Li, Deqing Sun, Ricardo Martin Brualla, Jonathan T. Barron, Hendrik P. A. Lensch, Varun Jampani

    Abstract: We present SHINOBI, an end-to-end framework for the reconstruction of shape, material, and illumination from object images captured with varying lighting, pose, and background. Inverse rendering of an object based on unconstrained image collections is a long-standing challenge in computer vision and graphics and requires a joint optimization over shape, radiance, and pose. We show that an implicit… ▽ More

    Submitted 29 March, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: Accepted by IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024). Updated supplementary material and acknowledgements

  14. arXiv:2401.03108  [pdf, other

    cs.CV

    Dress-Me-Up: A Dataset & Method for Self-Supervised 3D Garment Retargeting

    Authors: Shanthika Naik, Kunwar Singh, Astitva Srivastava, Dhawal Sirikonda, Amit Raj, Varun Jampani, Avinash Sharma

    Abstract: We propose a novel self-supervised framework for retargeting non-parameterized 3D garments onto 3D human avatars of arbitrary shapes and poses, enabling 3D virtual try-on (VTON). Existing self-supervised 3D retargeting methods only support parametric and canonical garments, which can only be draped over parametric body, e.g. SMPL. To facilitate the non-parametric garments and body, we propose a no… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

  15. arXiv:2312.14198  [pdf, other

    cs.CV

    ZeroShape: Regression-based Zero-shot Shape Reconstruction

    Authors: Zixuan Huang, Stefan Stojanov, Anh Thai, Varun Jampani, James M. Rehg

    Abstract: We study the problem of single-image zero-shot 3D shape reconstruction. Recent works learn zero-shot shape reconstruction through generative modeling of 3D assets, but these models are computationally expensive at train and inference time. In contrast, the traditional approach to this problem is regression-based, where deterministic models are trained to directly regress the object shape. Such reg… ▽ More

    Submitted 16 January, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

    Comments: Project page: https://zixuanh.com/projects/zeroshape.html

  16. arXiv:2312.09168  [pdf, other

    cs.CV cs.GR cs.LG

    DiffusionLight: Light Probes for Free by Painting a Chrome Ball

    Authors: Pakkapon Phongthawee, Worameth Chinchuthakun, Nontaphat Sinsunthithet, Amit Raj, Varun Jampani, Pramook Khungurn, Supasorn Suwajanakorn

    Abstract: We present a simple yet effective technique to estimate lighting in a single input image. Current techniques rely heavily on HDR panorama datasets to train neural networks to regress an input with limited field-of-view to a full environment map. However, these approaches often struggle with real-world, uncontrolled settings due to the limited diversity and size of their datasets. To address this p… ▽ More

    Submitted 9 April, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

    Comments: CVPR 2024 Oral. For more information and code, please visit our website https://diffusionlight.github.io/

    ACM Class: I.3.3; I.4.8

  17. arXiv:2312.06553  [pdf, other

    cs.CV

    HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models

    Authors: Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, Huaizu Jiang

    Abstract: We address the problem of generating realistic 3D human-object interactions (HOIs) driven by textual prompts. To this end, we take a modular design and decompose the complex task into simpler sub-tasks. We first develop a dual-branch diffusion model (HOI-DM) to generate both human and object motions conditioned on the input text, and encourage coherent motions by a cross-attention communication mo… ▽ More

    Submitted 15 March, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: Project Page: https://neu-vi.github.io/HOI-Diff/

  18. arXiv:2312.04560  [pdf, other

    cs.CV cs.AI cs.GR

    NeRFiller: Completing Scenes via Generative 3D Inpainting

    Authors: Ethan Weber, Aleksander Hołyński, Varun Jampani, Saurabh Saxena, Noah Snavely, Abhishek Kar, Angjoo Kanazawa

    Abstract: We propose NeRFiller, an approach that completes missing portions of a 3D capture via generative 3D inpainting using off-the-shelf 2D visual generative models. Often parts of a captured 3D scene or object are missing due to mesh reconstruction failures or a lack of observations (e.g., contact regions, such as the bottom of objects, or hard-to-reach areas). We approach this challenging 3D inpaintin… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: Project page: https://ethanweber.me/nerfiller

  19. arXiv:2312.02970  [pdf, other

    cs.CV cs.AI cs.GR

    Alchemist: Parametric Control of Material Properties with Diffusion Models

    Authors: Prafull Sharma, Varun Jampani, Yuanzhen Li, Xuhui Jia, Dmitry Lagun, Fredo Durand, William T. Freeman, Mark Matthews

    Abstract: We propose a method to control material attributes of objects like roughness, metallic, albedo, and transparency in real images. Our method capitalizes on the generative prior of text-to-image models known for photorealism, employing a scalar value and instructions to alter low-level material properties. Addressing the lack of datasets with controlled material attributes, we generated an object-ce… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

  20. arXiv:2312.01985  [pdf, other

    cs.CV

    UniGS: Unified Representation for Image Generation and Segmentation

    Authors: Lu Qi, Lehan Yang, Weidong Guo, Yu Xu, Bo Du, Varun Jampani, Ming-Hsuan Yang

    Abstract: This paper introduces a novel unified representation of diffusion models for image generation and segmentation. Specifically, we use a colormap to represent entity-level masks, addressing the challenge of varying entity numbers while aligning the representation closely with the image RGB domain. Two novel modules, including the location-aware color palette and progressive dichotomy module, are pro… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

  21. arXiv:2311.17776  [pdf, other

    cs.CV

    One-Shot Open Affordance Learning with Foundation Models

    Authors: Gen Li, Deqing Sun, Laura Sevilla-Lara, Varun Jampani

    Abstract: We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category, but is expected to identify novel objects and affordances. While vision-language models excel at recognizing novel objects and scenes, they often struggle to understand finer levels of granularity such as affordances. To handle this issue, we conduct a comprehensive analy… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

  22. arXiv:2311.17034  [pdf, other

    cs.CV

    Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

    Authors: Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqing Sun, Ming-Hsuan Yang

    Abstract: While pre-trained large-scale vision models have shown significant promise for semantic correspondence, their features often struggle to grasp the geometry and orientation of instances. This paper identifies the importance of being geometry-aware for semantic correspondence and reveals a limitation of the features of current foundation models under simple post-processing. We show that incorporatin… ▽ More

    Submitted 24 March, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

    Comments: Accepted by CVPR 24, project page: https://telling-left-from-right.github.io/

  23. arXiv:2311.16052  [pdf, other

    cs.CV

    Exploring Attribute Variations in Style-based GANs using Diffusion Models

    Authors: Rishubh Parihar, Prasanna Balaji, Raghav Magazine, Sarthak Vora, Tejan Karmali, Varun Jampani, R. Venkatesh Babu

    Abstract: Existing attribute editing methods treat semantic attributes as binary, resulting in a single edit per attribute. However, attributes such as eyeglasses, smiles, or hairstyles exhibit a vast range of diversity. In this work, we formulate the task of \textit{diverse attribute editing} by modeling the multidimensional nature of attribute edits. This enables users to generate multiple plausible edits… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

    Comments: Neurips Workshop on Diffusion Models 2023

  24. arXiv:2311.15127  [pdf, other

    cs.CV

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Authors: Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach

    Abstract: We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary wi… ▽ More

    Submitted 25 November, 2023; originally announced November 2023.

  25. arXiv:2311.13600  [pdf, other

    cs.CV cs.GR cs.LG

    ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

    Authors: Viraj Shah, Nataniel Ruiz, Forrester Cole, Erika Lu, Svetlana Lazebnik, Yuanzhen Li, Varun Jampani

    Abstract: Methods for finetuning generative models for concept-driven personalization generally achieve strong results for subject-driven or style-driven generation. Recently, low-rank adaptations (LoRA) have been proposed as a parameter-efficient way of achieving concept-driven personalization. While recent work explores the combination of separate LoRAs to achieve joint generation of learned styles and su… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

    Comments: Project page: https://ziplora.github.io

  26. arXiv:2310.08580  [pdf, other

    cs.CV cs.GR

    OmniControl: Control Any Joint at Any Time for Human Motion Generation

    Authors: Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, Huaizu Jiang

    Abstract: We present a novel approach named OmniControl for incorporating flexible spatial control signals into a text-conditioned human motion generation model based on the diffusion process. Unlike previous methods that can only control the pelvis trajectory, OmniControl can incorporate flexible spatial control signals over different joints at different times with only one model. Specifically, we propose… ▽ More

    Submitted 14 April, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

    Comments: ICLR 2024. Project page: https://neu-vi.github.io/omnicontrol/

  27. arXiv:2307.06949  [pdf, other

    cs.CV cs.AI cs.GR cs.LG

    HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

    Authors: Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, Kfir Aberman

    Abstract: Personalization has emerged as a prominent aspect within the field of generative AI, enabling the synthesis of individuals in diverse contexts and styles, while retaining high-fidelity to their identities. However, the process of personalization presents inherent challenges in terms of time and memory requirements. Fine-tuning each personalized model needs considerable GPU time investment, and sto… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

    Comments: project page: https://hyperdreambooth.github.io

  28. arXiv:2306.09109  [pdf, other

    cs.CV

    NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations

    Authors: Varun Jampani, Kevis-Kokitsi Maninis, Andreas Engelhardt, Arjun Karpur, Karen Truong, Kyle Sargent, Stefan Popov, André Araujo, Ricardo Martin-Brualla, Kaushal Patel, Daniel Vlasic, Vittorio Ferrari, Ameesh Makadia, Ce Liu, Yuanzhen Li, Howard Zhou

    Abstract: Recent advances in neural reconstruction enable high-quality 3D object reconstruction from casually captured image collections. Current techniques mostly analyze their progress on relatively simple image collections where Structure-from-Motion (SfM) techniques can provide ground-truth (GT) camera poses. We note that SfM techniques tend to fail on in-the-wild image collections such as image search… ▽ More

    Submitted 13 October, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023 camera ready. Project page: https://navidataset.github.io

  29. arXiv:2306.05428  [pdf, other

    cs.CV

    Background Prompting for Improved Object Depth

    Authors: Manel Baradad, Yuanzhen Li, Forrester Cole, Michael Rubinstein, Antonio Torralba, William T. Freeman, Varun Jampani

    Abstract: Estimating the depth of objects from a single image is a valuable task for many vision, robotics, and graphics applications. However, current methods often fail to produce accurate depth for objects in diverse scenes. In this work, we propose a simple yet effective Background Prompting strategy that adapts the input object image with a learned background. We learn the background prompts only using… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

  30. arXiv:2306.05410  [pdf, other

    cs.CV

    LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs

    Authors: Zezhou Cheng, Carlos Esteves, Varun Jampani, Abhishek Kar, Subhransu Maji, Ameesh Makadia

    Abstract: A critical obstacle preventing NeRF models from being deployed broadly in the wild is their reliance on accurate camera poses. Consequently, there is growing interest in extending NeRF models to jointly optimize camera poses and scene representation, which offers an alternative to off-the-shelf SfM pipelines which have well-understood failure modes. Existing approaches for unposed NeRF operate und… ▽ More

    Submitted 8 June, 2023; originally announced June 2023.

    Comments: Project website: https://people.cs.umass.edu/~zezhoucheng/lu-nerf/

  31. arXiv:2306.04619  [pdf, other

    cs.CV

    ARTIC3D: Learning Robust Articulated 3D Shapes from Noisy Web Image Collections

    Authors: Chun-Han Yao, Amit Raj, Wei-Chih Hung, Yuanzhen Li, Michael Rubinstein, Ming-Hsuan Yang, Varun Jampani

    Abstract: Estimating 3D articulated shapes like animal bodies from monocular images is inherently challenging due to the ambiguities of camera viewpoint, pose, texture, lighting, etc. We propose ARTIC3D, a self-supervised framework to reconstruct per-instance 3D shapes from a sparse image collection in-the-wild. Specifically, ARTIC3D is built upon a skeleton-based surface representation and is further guide… ▽ More

    Submitted 7 June, 2023; originally announced June 2023.

    Comments: Project page: https://chhankyao.github.io/artic3d/

  32. arXiv:2305.18373  [pdf, other

    cs.CV cs.CL

    KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature Adaptation of Vision-Language Models

    Authors: Zhiwei Jia, Pradyumna Narayana, Arjun R. Akula, Garima Pruthi, Hao Su, Sugato Basu, Varun Jampani

    Abstract: Image ad understanding is a crucial task with wide real-world applications. Although highly challenging with the involvement of diverse atypical scenes, real-world entities, and reasoning over scene-texts, how to interpret image ads is relatively under-explored, especially in the era of foundational vision-language models (VLMs) featuring impressive generalizability and adaptability. In this paper… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: ACL 2023

  33. arXiv:2305.15393  [pdf, other

    cs.CV cs.AI

    LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

    Authors: Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, William Yang Wang

    Abstract: Attaining a high degree of user controllability in visual generation often requires intricate, fine-grained inputs like layouts. However, such inputs impose a substantial burden on users when compared to simple text inputs. To address the issue, we study how Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions, and thus collaborate with visual genera… ▽ More

    Submitted 28 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  34. arXiv:2305.15347  [pdf, other

    cs.CV

    A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

    Authors: Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, Ming-Hsuan Yang

    Abstract: Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e.g., classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across mul… ▽ More

    Submitted 28 November, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted by NeurIPS 23, project page: https://sd-complements-dino.github.io/

  35. arXiv:2305.10722  [pdf, other

    cs.CV

    Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners

    Authors: Xuehai He, Weixi Feng, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, William Yang Wang, Xin Eric Wang

    Abstract: Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To an… ▽ More

    Submitted 24 April, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

  36. arXiv:2305.01618  [pdf, other

    cs.CV cs.LG cs.RO

    ContactArt: Learning 3D Interaction Priors for Category-level Articulated Object and Hand Poses Estimation

    Authors: Zehao Zhu, Jiashun Wang, Yuzhe Qin, Deqing Sun, Varun Jampani, Xiaolong Wang

    Abstract: We propose a new dataset and a novel approach to learning hand-object interaction priors for hand and articulated object pose estimation. We first collect a dataset using visual teleoperation, where the human operator can directly play within a physical simulator to manipulate the articulated objects. We record the data and obtain free and accurate annotations on object poses and contact informati… ▽ More

    Submitted 2 May, 2023; originally announced May 2023.

    Comments: Project: https://zehaozhu.github.io/ContactArt/ ; Dataset Explorer: https://zehaozhu.github.io/ContactArt/explorer/

  37. arXiv:2304.06247  [pdf, other

    cs.CV

    ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency

    Authors: Zixuan Huang, Varun Jampani, Anh Thai, Yuanzhen Li, Stefan Stojanov, James M. Rehg

    Abstract: We present ShapeClipper, a novel method that reconstructs 3D object shapes from real-world single-view RGB images. Instead of relying on laborious 3D, multi-view or camera pose annotation, ShapeClipper learns shape reconstruction from a set of single-view segmented images. The key idea is to facilitate shape learning via CLIP-based shape consistency, where we encourage objects with similar CLIP en… ▽ More

    Submitted 12 April, 2023; originally announced April 2023.

    Comments: Accepted to CVPR 2023, project website at https://zixuanh.com/projects/shapeclipper.html

  38. arXiv:2304.05866  [pdf, other

    cs.CV cs.LG

    NoisyTwins: Class-Consistent and Diverse Image Generation through StyleGANs

    Authors: Harsh Rangwani, Lavish Bansal, Kartik Sharma, Tejan Karmali, Varun Jampani, R. Venkatesh Babu

    Abstract: StyleGANs are at the forefront of controllable image generation as they produce a latent space that is semantically disentangled, making it suitable for image editing and manipulation. However, the performance of StyleGANs severely degrades when trained via class-conditioning on large-scale long-tailed datasets. We find that one reason for degradation is the collapse of latents for each class in t… ▽ More

    Submitted 12 April, 2023; originally announced April 2023.

    Comments: CVPR 2023. Project Page: https://rangwani-harsh.github.io/NoisyTwins/

  39. arXiv:2303.16201  [pdf, other

    cs.CV cs.AI cs.LG

    ASIC: Aligning Sparse in-the-wild Image Collections

    Authors: Kamal Gupta, Varun Jampani, Carlos Esteves, Abhinav Shrivastava, Ameesh Makadia, Noah Snavely, Abhishek Kar

    Abstract: We present a method for joint alignment of sparse in-the-wild image collections of an object category. Most prior works assume either ground-truth keypoint annotations or a large dataset of images of a single object category. However, neither of the above assumptions hold true for the long-tail of the objects present in the world. We present a self-supervised technique that directly optimizes on a… ▽ More

    Submitted 28 March, 2023; originally announced March 2023.

    Comments: Web: https://kampta.github.io/asic

  40. arXiv:2303.13508  [pdf, other

    cs.CV cs.AI cs.GR

    DreamBooth3D: Subject-Driven Text-to-3D Generation

    Authors: Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Nataniel Ruiz, Ben Mildenhall, Shiran Zada, Kfir Aberman, Michael Rubinstein, Jonathan Barron, Yuanzhen Li, Varun Jampani

    Abstract: We present DreamBooth3D, an approach to personalize text-to-3D generative models from as few as 3-6 casually captured images of a subject. Our approach combines recent advances in personalizing text-to-image models (DreamBooth) with text-to-3D generation (DreamFusion). We find that naively combining these methods fails to yield satisfactory subject-specific 3D assets due to personalized text-to-im… ▽ More

    Submitted 27 March, 2023; v1 submitted 23 March, 2023; originally announced March 2023.

    Comments: Project page at https://dreambooth3d.github.io/ Video Summary at https://youtu.be/kKVDrbfvOoA

  41. arXiv:2303.09665  [pdf, other

    cs.CV

    LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding

    Authors: Gen Li, Varun Jampani, Deqing Sun, Laura Sevilla-Lara

    Abstract: Humans excel at acquiring knowledge through observation. For example, we can learn to use new tools by watching demonstrations. This skill is fundamental for intelligent systems to interact with the world. A key step to acquire this skill is to identify what part of the object affords each action, which is called affordance grounding. In this paper, we address this problem and propose a framework… ▽ More

    Submitted 16 March, 2023; originally announced March 2023.

    Comments: CVPR 2023, Project page: https://reagan1311.github.io/locate/, Video: https://www.youtube.com/watch?v=RLHansdFxII

  42. arXiv:2302.04862  [pdf, other

    cs.CV cs.LG

    Polynomial Neural Fields for Subband Decomposition and Manipulation

    Authors: Guandao Yang, Sagie Benaim, Varun Jampani, Kyle Genova, Jonathan T. Barron, Thomas Funkhouser, Bharath Hariharan, Serge Belongie

    Abstract: Neural fields have emerged as a new paradigm for representing signals, thanks to their ability to do it compactly while being easy to optimize. In most applications, however, neural fields are treated like black boxes, which precludes many signal manipulation tasks. In this paper, we propose a new class of neural fields called polynomial neural fields (PNFs). The key advantage of a PNF is that it… ▽ More

    Submitted 9 February, 2023; originally announced February 2023.

    Comments: Accepted to NeurIPS 2022

  43. arXiv:2302.00070  [pdf, other

    cs.LG cs.CV

    Debiasing Vision-Language Models via Biased Prompts

    Authors: Ching-Yao Chuang, Varun Jampani, Yuanzhen Li, Antonio Torralba, Stefanie Jegelka

    Abstract: Machine learning models have been shown to inherit biases from their training datasets. This can be particularly problematic for vision-language foundation models trained on uncurated datasets scraped from the internet. The biases can be amplified and propagated to downstream applications like zero-shot classifiers and text-to-image generative models. In this study, we propose a general approach f… ▽ More

    Submitted 15 May, 2023; v1 submitted 31 January, 2023; originally announced February 2023.

  44. arXiv:2212.11042  [pdf, other

    cs.CV

    Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery from Sparse Image Ensemble

    Authors: Chun-Han Yao, Wei-Chih Hung, Yuanzhen Li, Michael Rubinstein, Ming-Hsuan Yang, Varun Jampani

    Abstract: Automatically estimating 3D skeleton, shape, camera viewpoints, and part articulation from sparse in-the-wild image ensembles is a severely under-constrained and challenging problem. Most prior methods rely on large-scale image datasets, dense temporal correspondence, or human annotations like camera pose, 2D keypoints, and shape templates. We propose Hi-LASSIE, which performs 3D articulated recon… ▽ More

    Submitted 25 March, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

    Comments: Project page: https://chhankyao.github.io/hi-lassie/

  45. arXiv:2212.09898  [pdf, other

    cs.CV

    MetaCLUE: Towards Comprehensive Visual Metaphors Research

    Authors: Arjun R. Akula, Brendan Driscoll, Pradyumna Narayana, Soravit Changpinyo, Zhiwei Jia, Suyash Damle, Garima Pruthi, Sugato Basu, Leonidas Guibas, William T. Freeman, Yuanzhen Li, Varun Jampani

    Abstract: Creativity is an indispensable part of human cognition and also an inherent part of how we make sense of the world. Metaphorical abstraction is fundamental in communicating creative ideas through nuanced relationships between abstract concepts such as feelings. While computer vision benchmarks and approaches predominantly focus on understanding and generating literal interpretations of images, met… ▽ More

    Submitted 2 June, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: Accepted in CVPR 2023. Project page: https://metaclue.github.io/ , Video summary: https://youtu.be/V3TmeNETL-o

  46. arXiv:2212.05032  [pdf, other

    cs.CV cs.CL

    Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

    Authors: Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, William Yang Wang

    Abstract: Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, s… ▽ More

    Submitted 28 February, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

    Comments: ICLR 2023 Camera Ready version

  47. arXiv:2210.15909  [pdf, other

    cs.CV cs.LG

    Subsidiary Prototype Alignment for Universal Domain Adaptation

    Authors: Jogendra Nath Kundu, Suvaansh Bhambri, Akshay Kulkarni, Hiran Sarkar, Varun Jampani, R. Venkatesh Babu

    Abstract: Universal Domain Adaptation (UniDA) deals with the problem of knowledge transfer between two datasets with domain-shift as well as category-shift. The goal is to categorize unlabeled target samples, either into one of the "known" categories or into a single "unknown" category. A major problem in UniDA is negative transfer, i.e. misalignment of "known" and "unknown" classes. To this end, we first u… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: NeurIPS 2022. Project page: https://sites.google.com/view/spa-unida

  48. arXiv:2210.10362  [pdf, other

    cs.CV cs.AI cs.CL

    CPL: Counterfactual Prompt Learning for Vision and Language Models

    Authors: Xuehai He, Diji Yang, Weixi Feng, Tsu-Jui Fu, Arjun Akula, Varun Jampani, Pradyumna Narayana, Sugato Basu, William Yang Wang, Xin Eric Wang

    Abstract: Prompt tuning is a new few-shot transfer learning technique that only tunes the learnable prompt for pre-trained vision and language models such as CLIP. However, existing prompt tuning methods tend to learn spurious or entangled representations, which leads to poor generalization to unseen concepts. Towards non-spurious and efficient prompt learning from limited examples, this paper presents a no… ▽ More

    Submitted 4 November, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

  49. arXiv:2208.12242  [pdf, other

    cs.CV cs.GR cs.LG

    DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

    Authors: Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, Kfir Aberman

    Abstract: Large text-to-image models achieved a remarkable leap in the evolution of AI, enabling high-quality and diverse synthesis of images from a given text prompt. However, these models lack the ability to mimic the appearance of subjects in a given reference set and synthesize novel renditions of them in different contexts. In this work, we present a new approach for "personalization" of text-to-image… ▽ More

    Submitted 15 March, 2023; v1 submitted 25 August, 2022; originally announced August 2022.

    Comments: Published at CVPR 2023. Project page: https://dreambooth.github.io/

  50. arXiv:2208.09932  [pdf, other

    cs.CV cs.LG eess.IV

    Improving GANs for Long-Tailed Data through Group Spectral Regularization

    Authors: Harsh Rangwani, Naman Jaswani, Tejan Karmali, Varun Jampani, R. Venkatesh Babu

    Abstract: Deep long-tailed learning aims to train useful deep networks on practical, real-world imbalanced distributions, wherein most labels of the tail classes are associated with a few samples. There has been a large body of work to train discriminative models for visual recognition on long-tailed distribution. In contrast, we aim to train conditional Generative Adversarial Networks, a class of image gen… ▽ More

    Submitted 21 August, 2022; originally announced August 2022.

    Comments: ECCV 2022. Project Page: https://sites.google.com/view/gsr-eccv22