Search | arXiv e-print repository

SweepNet: Unsupervised Learning Shape Abstraction via Neural Sweepers

Authors: Mingrui Zhao, Yizhi Wang, Fenggen Yu, Changqing Zou, Ali Mahdavi-Amiri

Abstract: Shape abstraction is an important task for simplifying complex geometric structures while retaining essential features. Sweep surfaces, commonly found in human-made objects, aid in this process by effectively capturing and representing object geometry, thereby facilitating abstraction. In this paper, we introduce \papername, a novel approach to shape abstraction through sweep surfaces. We propose… ▽ More Shape abstraction is an important task for simplifying complex geometric structures while retaining essential features. Sweep surfaces, commonly found in human-made objects, aid in this process by effectively capturing and representing object geometry, thereby facilitating abstraction. In this paper, we introduce \papername, a novel approach to shape abstraction through sweep surfaces. We propose an effective parameterization for sweep surfaces, utilizing superellipses for profile representation and B-spline curves for the axis. This compact representation, requiring as few as 14 float numbers, facilitates intuitive and interactive editing while preserving shape details effectively. Additionally, by introducing a differentiable neural sweeper and an encoder-decoder architecture, we demonstrate the ability to predict sweep surface representations without supervision. We show the superiority of our model through several quantitative and qualitative experiments throughout the paper. Our code is available at https://mingrui-zhao.github.io/SweepNet/ △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: 14 pages,20 figures, ECCV 2024

arXiv:2406.01300 [pdf, other]

pOps: Photo-Inspired Diffusion Operators

Authors: Elad Richardson, Yuval Alaluf, Ali Mahdavi-Amiri, Daniel Cohen-Or

Abstract: Text-guided image generation enables the creation of visual content from textual descriptions. However, certain visual concepts cannot be effectively conveyed through language alone. This has sparked a renewed interest in utilizing the CLIP image embedding space for more visually-oriented tasks through methods such as IP-Adapter. Interestingly, the CLIP image embedding space has been shown to be s… ▽ More Text-guided image generation enables the creation of visual content from textual descriptions. However, certain visual concepts cannot be effectively conveyed through language alone. This has sparked a renewed interest in utilizing the CLIP image embedding space for more visually-oriented tasks through methods such as IP-Adapter. Interestingly, the CLIP image embedding space has been shown to be semantically meaningful, where linear operations within this space yield semantically meaningful results. Yet, the specific meaning of these operations can vary unpredictably across different images. To harness this potential, we introduce pOps, a framework that trains specific semantic operators directly on CLIP image embeddings. Each pOps operator is built upon a pretrained Diffusion Prior model. While the Diffusion Prior model was originally trained to map between text embeddings and image embeddings, we demonstrate that it can be tuned to accommodate new input conditions, resulting in a diffusion operator. Working directly over image embeddings not only improves our ability to learn semantic operations but also allows us to directly use a textual CLIP loss as an additional supervision when needed. We show that pOps can be used to learn a variety of photo-inspired operators with distinct semantic meanings, highlighting the semantic diversity and potential of our proposed approach. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: Project Page: https://popspaper.github.io/pOps/

arXiv:2405.17393 [pdf, other]

EASI-Tex: Edge-Aware Mesh Texturing from Single Image

Authors: Sai Raj Kishore Perla, Yizhi Wang, Ali Mahdavi-Amiri, Hao Zhang

Abstract: We present a novel approach for single-image mesh texturing, which employs a diffusion model with judicious conditioning to seamlessly transfer an object's texture from a single RGB image to a given 3D mesh object. We do not assume that the two objects belong to the same category, and even if they do, there can be significant discrepancies in their geometry and part proportions. Our method aims to… ▽ More We present a novel approach for single-image mesh texturing, which employs a diffusion model with judicious conditioning to seamlessly transfer an object's texture from a single RGB image to a given 3D mesh object. We do not assume that the two objects belong to the same category, and even if they do, there can be significant discrepancies in their geometry and part proportions. Our method aims to rectify the discrepancies by conditioning a pre-trained Stable Diffusion generator with edges describing the mesh through ControlNet, and features extracted from the input image using IP-Adapter to generate textures that respect the underlying geometry of the mesh and the input texture without any optimization or training. We also introduce Image Inversion, a novel technique to quickly personalize the diffusion model for a single concept using a single image, for cases where the pre-trained IP-Adapter falls short in capturing all the details from the input image faithfully. Experimental results demonstrate the efficiency and effectiveness of our edge-aware single-image mesh texturing approach, coined EASI-Tex, in preserving the details of the input texture on diverse 3D objects, while respecting their geometry. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: ACM Transactions on Graphics (Proceedings of SIGGRAPH), 2024. Project Page: https://sairajk.github.io/easi-tex/

arXiv:2403.14937 [pdf, other]

Survey on Modeling of Articulated Objects

Authors: Jiayi Liu, Manolis Savva, Ali Mahdavi-Amiri

Abstract: 3D modeling of articulated objects is a research problem within computer vision, graphics, and robotics. Its objective is to understand the shape and motion of the articulated components, represent the geometry and mobility of object parts, and create realistic models that reflect articulated objects in the real world. This survey provides a comprehensive overview of the current state-of-the-art i… ▽ More 3D modeling of articulated objects is a research problem within computer vision, graphics, and robotics. Its objective is to understand the shape and motion of the articulated components, represent the geometry and mobility of object parts, and create realistic models that reflect articulated objects in the real world. This survey provides a comprehensive overview of the current state-of-the-art in 3D modeling of articulated objects, with a specific focus on the task of articulated part perception and articulated object creation (reconstruction and generation). We systematically review and discuss the relevant literature from two perspectives: geometry processing and articulation modeling. Through this survey, we highlight the substantial progress made in these areas, outline the ongoing challenges, and identify gaps for future research. Our survey aims to serve as a foundational reference for researchers and practitioners in computer vision and graphics, offering insights into the complexities of articulated object modeling. △ Less

Submitted 21 March, 2024; originally announced March 2024.

arXiv:2403.04943 [pdf, other]

AFreeCA: Annotation-Free Counting for All

Authors: Adriano D'Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh

Abstract: Object counting methods typically rely on manually annotated datasets. The cost of creating such datasets has restricted the versatility of these networks to count objects from specific classes (such as humans or penguins), and counting objects from diverse categories remains a challenge. The availability of robust text-to-image latent diffusion models (LDMs) raises the question of whether these m… ▽ More Object counting methods typically rely on manually annotated datasets. The cost of creating such datasets has restricted the versatility of these networks to count objects from specific classes (such as humans or penguins), and counting objects from diverse categories remains a challenge. The availability of robust text-to-image latent diffusion models (LDMs) raises the question of whether these models can be utilized to generate counting datasets. However, LDMs struggle to create images with an exact number of objects based solely on text prompts but they can be used to offer a dependable \textit{sorting} signal by adding and removing objects within an image. Leveraging this data, we initially introduce an unsupervised sorting methodology to learn object-related features that are subsequently refined and anchored for counting purposes using counting data generated by LDMs. Further, we present a density classifier-guided method for dividing an image into patches containing objects that can be reliably counted. Consequently, we can generate counting data for any type of object and count them in an unsupervised manner. Our approach outperforms other unsupervised and few-shot alternatives and is not restricted to specific object classes for which counting data is available. Code to be released upon acceptance. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2402.03549 [pdf]

AnaMoDiff: 2D Analogical Motion Diffusion via Disentangled Denoising

Authors: Maham Tanveer, Yizhi Wang, Ruiqi Wang, Nanxuan Zhao, Ali Mahdavi-Amiri, Hao Zhang

Abstract: We present AnaMoDiff, a novel diffusion-based method for 2D motion analogies that is applied to raw, unannotated videos of articulated characters. Our goal is to accurately transfer motions from a 2D driving video onto a source character, with its identity, in terms of appearance and natural movement, well preserved, even when there may be significant discrepancies between the source and driving c… ▽ More We present AnaMoDiff, a novel diffusion-based method for 2D motion analogies that is applied to raw, unannotated videos of articulated characters. Our goal is to accurately transfer motions from a 2D driving video onto a source character, with its identity, in terms of appearance and natural movement, well preserved, even when there may be significant discrepancies between the source and driving characters in their part proportions and movement speed and styles. Our diffusion model transfers the input motion via a latent optical flow (LOF) network operating in a noised latent space, which is spatially aware, efficient to process compared to the original RGB videos, and artifact-resistant through the diffusion denoising process even amid dense movements. To accomplish both motion analogy and identity preservation, we train our denoising model in a feature-disentangled manner, operating at two noise levels. While identity-revealing features of the source are learned via conventional noise injection, motion features are learned from LOF-warped videos by only injecting noise with large values, with the stipulation that motion properties involving pose and limbs are encoded by higher-level features. Experiments demonstrate that our method achieves the best trade-off between motion analogy and identity preservation. △ Less

Submitted 5 February, 2024; originally announced February 2024.

arXiv:2312.09570 [pdf, other]

CAGE: Controllable Articulation GEneration

Authors: Jiayi Liu, Hou In Ivan Tam, Ali Mahdavi-Amiri, Manolis Savva

Abstract: We address the challenge of generating 3D articulated objects in a controllable fashion. Currently, modeling articulated 3D objects is either achieved through laborious manual authoring, or using methods from prior work that are hard to scale and control directly. We leverage the interplay between part shape, connectivity, and motion using a denoising diffusion-based method with attention modules… ▽ More We address the challenge of generating 3D articulated objects in a controllable fashion. Currently, modeling articulated 3D objects is either achieved through laborious manual authoring, or using methods from prior work that are hard to scale and control directly. We leverage the interplay between part shape, connectivity, and motion using a denoising diffusion-based method with attention modules designed to extract correlations between part attributes. Our method takes an object category label and a part connectivity graph as input and generates an object's geometry and motion parameters. The generated objects conform to user-specified constraints on the object category, part shape, and part articulation. Our experiments show that our method outperforms the state-of-the-art in articulated object generation, producing more realistic objects while conforming better to user constraints. Video Summary at: http://youtu.be/cH_rbKbyTpE △ Less

Submitted 20 March, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: CVPR 2024. Project page: https://3dlg-hcvc.github.io/cage/

arXiv:2312.02221 [pdf, other]

Slice3D: Multi-Slice, Occlusion-Revealing, Single View 3D Reconstruction

Authors: Yizhi Wang, Wallace Lira, Wenqi Wang, Ali Mahdavi-Amiri, Hao Zhang

Abstract: We introduce multi-slice reasoning, a new notion for single-view 3D reconstruction which challenges the current and prevailing belief that multi-view synthesis is the most natural conduit between single-view and 3D. Our key observation is that object slicing is more advantageous than altering views to reveal occluded structures. Specifically, slicing is more occlusion-revealing since it can peel t… ▽ More We introduce multi-slice reasoning, a new notion for single-view 3D reconstruction which challenges the current and prevailing belief that multi-view synthesis is the most natural conduit between single-view and 3D. Our key observation is that object slicing is more advantageous than altering views to reveal occluded structures. Specifically, slicing is more occlusion-revealing since it can peel through any occluders without obstruction. In the limit, i.e., with infinitely many slices, it is guaranteed to unveil all hidden object parts. We realize our idea by develo** Slice3D, a novel method for single-view 3D reconstruction which first predicts multi-slice images from a single RGB image and then integrates the slices into a 3D model using a coordinate-based transformer network for signed distance prediction. The slice images can be regressed or generated, both through a U-Net based network. For the former, we inject a learnable slice indicator code to designate each decoded image into a spatial slice location, while the slice generator is a denoising diffusion model operating on the entirety of slice images stacked on the input channels. We conduct extensive evaluation against state-of-the-art alternatives to demonstrate superiority of our method, especially in recovering complex and severely occluded shape structures, amid ambiguities. All Slice3D results were produced by networks trained on a single Nvidia A40 GPU, with an inference time less than 20 seconds. △ Less

Submitted 10 December, 2023; v1 submitted 3 December, 2023; originally announced December 2023.

Comments: Project website: https://yizhiwang96.github.io/Slice3D/

arXiv:2311.17083 [pdf, other]

CLiC: Concept Learning in Context

Authors: Mehdi Safaee, Aryan Mikaeili, Or Patashnik, Daniel Cohen-Or, Ali Mahdavi-Amiri

Abstract: This paper addresses the challenge of learning a local visual pattern of an object from one image, and generating images depicting objects with that pattern. Learning a localized concept and placing it on an object in a target image is a nontrivial task, as the objects may have different orientations and shapes. Our approach builds upon recent advancements in visual concept learning. It involves a… ▽ More This paper addresses the challenge of learning a local visual pattern of an object from one image, and generating images depicting objects with that pattern. Learning a localized concept and placing it on an object in a target image is a nontrivial task, as the objects may have different orientations and shapes. Our approach builds upon recent advancements in visual concept learning. It involves acquiring a visual concept (e.g., an ornament) from a source image and subsequently applying it to an object (e.g., a chair) in a target image. Our key idea is to perform in-context concept learning, acquiring the local visual concept within the broader context of the objects they belong to. To localize the concept learning, we employ soft masks that contain both the concept within the mask and the surrounding image area. We demonstrate our approach through object generation within an image, showcasing plausible embedding of in-context learned concepts. We also introduce methods for directing acquired concepts to specific locations within target images, employing cross-attention mechanisms, and establishing correspondences between source and target objects. The effectiveness of our method is demonstrated through quantitative and qualitative experiments, along with comparisons against baseline techniques. △ Less

Submitted 27 November, 2023; originally announced November 2023.

arXiv:2310.01662 [pdf, other]

SYRAC: Synthesize, Rank, and Count

Authors: Adriano D'Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh

Abstract: Crowd counting is a critical task in computer vision, with several important applications. However, existing counting methods rely on labor-intensive density map annotations, necessitating the manual localization of each individual pedestrian. While recent efforts have attempted to alleviate the annotation burden through weakly or semi-supervised learning, these approaches fall short of significan… ▽ More Crowd counting is a critical task in computer vision, with several important applications. However, existing counting methods rely on labor-intensive density map annotations, necessitating the manual localization of each individual pedestrian. While recent efforts have attempted to alleviate the annotation burden through weakly or semi-supervised learning, these approaches fall short of significantly reducing the workload. We propose a novel approach to eliminate the annotation burden by leveraging latent diffusion models to generate synthetic data. However, these models struggle to reliably understand object quantities, leading to noisy annotations when prompted to produce images with a specific quantity of objects. To address this, we use latent diffusion models to create two types of synthetic data: one by removing pedestrians from real images, which generates ranked image pairs with a weak but reliable object quantity signal, and the other by generating synthetic images with a predetermined number of objects, offering a strong but noisy counting signal. Our method utilizes the ranking image pairs for pre-training and then fits a linear layer to the noisy synthetic images using these crowd quantity features. We report state-of-the-art results for unsupervised crowd counting. △ Less

Submitted 11 October, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

arXiv:2309.00733 [pdf, other]

TExplain: Explaining Learned Visual Features via Pre-trained (Frozen) Language Models

Authors: Saeid Asgari Taghanaki, Aliasghar Khani, Ali Saheb Pasand, Amir Khasahmadi, Aditya Sanghi, Karl D. D. Willis, Ali Mahdavi-Amiri

Abstract: Interpreting the learned features of vision models has posed a longstanding challenge in the field of machine learning. To address this issue, we propose a novel method that leverages the capabilities of language models to interpret the learned features of pre-trained image classifiers. Our method, called TExplain, tackles this task by training a neural network to establish a connection between th… ▽ More Interpreting the learned features of vision models has posed a longstanding challenge in the field of machine learning. To address this issue, we propose a novel method that leverages the capabilities of language models to interpret the learned features of pre-trained image classifiers. Our method, called TExplain, tackles this task by training a neural network to establish a connection between the feature space of image classifiers and language models. Then, during inference, our approach generates a vast number of sentences to explain the features learned by the classifier for a given image. These sentences are then used to extract the most frequent words, providing a comprehensive understanding of the learned features and patterns within the classifier. Our method, for the first time, utilizes these frequent words corresponding to a visual representation to provide insights into the decision-making process of the independently trained classifier, enabling the detection of spurious correlations, biases, and a deeper comprehension of its behavior. To validate the effectiveness of our approach, we conduct experiments on diverse datasets, including ImageNet-9L and Waterbirds. The results demonstrate the potential of our method to enhance the interpretability and robustness of image classifiers. △ Less

Submitted 1 May, 2024; v1 submitted 1 September, 2023; originally announced September 2023.

Comments: Accepted to ICLR 2024, Reliable and Responsible Foundation Models workshop

arXiv:2308.07391 [pdf, other]

PARIS: Part-level Reconstruction and Motion Analysis for Articulated Objects

Authors: Jiayi Liu, Ali Mahdavi-Amiri, Manolis Savva

Abstract: We address the task of simultaneous part-level reconstruction and motion parameter estimation for articulated objects. Given two sets of multi-view images of an object in two static articulation states, we decouple the movable part from the static part and reconstruct shape and appearance while predicting the motion parameters. To tackle this problem, we present PARIS: a self-supervised, end-to-en… ▽ More We address the task of simultaneous part-level reconstruction and motion parameter estimation for articulated objects. Given two sets of multi-view images of an object in two static articulation states, we decouple the movable part from the static part and reconstruct shape and appearance while predicting the motion parameters. To tackle this problem, we present PARIS: a self-supervised, end-to-end architecture that learns part-level implicit shape and appearance models and optimizes motion parameters jointly without any 3D supervision, motion, or semantic annotation. Our experiments show that our method generalizes better across object categories, and outperforms baselines and prior work that are given 3D point clouds as input. Our approach improves reconstruction relative to state-of-the-art baselines with a Chamfer-L1 distance reduction of 3.94 (45.2%) for objects and 26.79 (84.5%) for parts, and achieves 5% error rate for motion estimation across 10 object categories. Video summary at: https://youtu.be/tDSrROPCgUc △ Less

Submitted 14 August, 2023; originally announced August 2023.

Comments: Presented at ICCV 2023. Project website: https://3dlg-hcvc.github.io/paris/

arXiv:2305.18601 [pdf, other]

BRICS: Bi-level feature Representation of Image CollectionS

Authors: Dingdong Yang, Yizhi Wang, Ali Mahdavi-Amiri, Hao Zhang

Abstract: We present BRICS, a bi-level feature representation for image collections, which consists of a key code space on top of a feature grid space. Specifically, our representation is learned by an autoencoder to encode images into continuous key codes, which are used to retrieve features from groups of multi-resolution feature grids. Our key codes and feature grids are jointly trained continuously with… ▽ More We present BRICS, a bi-level feature representation for image collections, which consists of a key code space on top of a feature grid space. Specifically, our representation is learned by an autoencoder to encode images into continuous key codes, which are used to retrieve features from groups of multi-resolution feature grids. Our key codes and feature grids are jointly trained continuously with well-defined gradient flows, leading to high usage rates of the feature grids and improved generative modeling compared to discrete Vector Quantization (VQ). Differently from existing continuous representations such as KL-regularized latent codes, our key codes are strictly bounded in scale and variance. Overall, feature encoding by BRICS is compact, efficient to train, and enables generative modeling over key codes using the diffusion model. Experimental results show that our method achieves comparable reconstruction results to VQ while having a smaller and more efficient decoder network (50% fewer GFlops). By applying the diffusion model over our key code space, we achieve state-of-the-art performance on image synthesis on the FFHQ and LSUN-Church (29% lower than LDM, 32% lower than StyleGAN2, 44% lower than Projected GAN on CLIP-FID) datasets. △ Less

Submitted 30 December, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

arXiv:2303.10735 [pdf, other]

SKED: Sketch-guided Text-based 3D Editing

Authors: Aryan Mikaeili, Or Perel, Mehdi Safaee, Daniel Cohen-Or, Ali Mahdavi-Amiri

Abstract: Text-to-image diffusion models are gradually introduced into computer graphics, recently enabling the development of Text-to-3D pipelines in an open domain. However, for interactive editing purposes, local manipulations of content through a simplistic textual interface can be arduous. Incorporating user guided sketches with Text-to-image pipelines offers users more intuitive control. Still, as sta… ▽ More Text-to-image diffusion models are gradually introduced into computer graphics, recently enabling the development of Text-to-3D pipelines in an open domain. However, for interactive editing purposes, local manipulations of content through a simplistic textual interface can be arduous. Incorporating user guided sketches with Text-to-image pipelines offers users more intuitive control. Still, as state-of-the-art Text-to-3D pipelines rely on optimizing Neural Radiance Fields (NeRF) through gradients from arbitrary rendering views, conditioning on sketches is not straightforward. In this paper, we present SKED, a technique for editing 3D shapes represented by NeRFs. Our technique utilizes as few as two guiding sketches from different views to alter an existing neural field. The edited region respects the prompt semantics through a pre-trained diffusion model. To ensure the generated output adheres to the provided sketches, we propose novel loss functions to generate the desired edits while preserving the density and radiance of the base instance. We demonstrate the effectiveness of our proposed method through several qualitative and quantitative experiments. https://sked-paper.github.io/ △ Less

Submitted 18 August, 2023; v1 submitted 19 March, 2023; originally announced March 2023.

arXiv:2303.09604 [pdf, other]

DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion

Authors: Maham Tanveer, Yizhi Wang, Ali Mahdavi-Amiri, Hao Zhang

Abstract: We introduce a novel method to automatically generate an artistic typography by stylizing one or more letter fonts to visually convey the semantics of an input word, while ensuring that the output remains readable. To address an assortment of challenges with our task at hand including conflicting goals (artistic stylization vs. legibility), lack of ground truth, and immense search space, our appro… ▽ More We introduce a novel method to automatically generate an artistic typography by stylizing one or more letter fonts to visually convey the semantics of an input word, while ensuring that the output remains readable. To address an assortment of challenges with our task at hand including conflicting goals (artistic stylization vs. legibility), lack of ground truth, and immense search space, our approach utilizes large language models to bridge texts and visual images for stylization and build an unsupervised generative model with a diffusion model backbone. Specifically, we employ the denoising generator in Latent Diffusion Model (LDM), with the key addition of a CNN-based discriminator to adapt the input style onto the input text. The discriminator uses rasterized images of a given letter/word font as real samples and output of the denoising generator as fake samples. Our model is coined DS-Fusion for discriminated and stylized diffusion. We showcase the quality and versatility of our method through numerous examples, qualitative and quantitative evaluation, as well as ablation studies. User studies comparing to strong baselines including CLIPDraw and DALL-E 2, as well as artist-crafted typographies, demonstrate strong performance of DS-Fusion. △ Less

Submitted 16 March, 2023; originally announced March 2023.

Comments: Project website: https://ds-fusion.github.io/

arXiv:2302.10167 [pdf, other]

Cross-domain Compositing with Pretrained Diffusion Models

Authors: Roy Hachnochi, Mingrui Zhao, Nadav Orzech, Rinon Gal, Ali Mahdavi-Amiri, Daniel Cohen-Or, Amit Haim Bermano

Abstract: Diffusion models have enabled high-quality, conditional image editing capabilities. We propose to expand their arsenal, and demonstrate that off-the-shelf diffusion models can be used for a wide range of cross-domain compositing tasks. Among numerous others, these include image blending, object immersion, texture-replacement and even CG2Real translation or stylization. We employ a localized, itera… ▽ More Diffusion models have enabled high-quality, conditional image editing capabilities. We propose to expand their arsenal, and demonstrate that off-the-shelf diffusion models can be used for a wide range of cross-domain compositing tasks. Among numerous others, these include image blending, object immersion, texture-replacement and even CG2Real translation or stylization. We employ a localized, iterative refinement scheme which infuses the injected objects with contextual information derived from the background scene, and enables control over the degree and types of changes the object may undergo. We conduct a range of qualitative and quantitative comparisons to prior work, and exhibit that our method produces higher quality and realistic results without requiring any annotations or training. Finally, we demonstrate how our method may be used for data augmentation of downstream tasks. △ Less

Submitted 25 May, 2023; v1 submitted 20 February, 2023; originally announced February 2023.

Comments: Code: https://github.com/cross-domain-compositing/cross-domain-compositing

arXiv:2210.00055 [pdf, other]

MaskTune: Mitigating Spurious Correlations by Forcing to Explore

Authors: Saeid Asgari Taghanaki, Aliasghar Khani, Fereshte Khani, Ali Gholami, Linh Tran, Ali Mahdavi-Amiri, Ghassan Hamarneh

Abstract: A fundamental challenge of over-parameterized deep learning models is learning meaningful data representations that yield good performance on a downstream task without over-fitting spurious input features. This work proposes MaskTune, a masking strategy that prevents over-reliance on spurious (or a limited number of) features. MaskTune forces the trained model to explore new features during a sing… ▽ More A fundamental challenge of over-parameterized deep learning models is learning meaningful data representations that yield good performance on a downstream task without over-fitting spurious input features. This work proposes MaskTune, a masking strategy that prevents over-reliance on spurious (or a limited number of) features. MaskTune forces the trained model to explore new features during a single epoch finetuning by masking previously discovered features. MaskTune, unlike earlier approaches for mitigating shortcut learning, does not require any supervision, such as annotating spurious features or labels for subgroup samples in a dataset. Our empirical results on biased MNIST, CelebA, Waterbirds, and ImagenNet-9L datasets show that MaskTune is effective on tasks that often suffer from the existence of spurious correlations. Finally, we show that MaskTune outperforms or achieves similar performance to the competing methods when applied to the selective classification (classification with rejection option) task. Code for MaskTune is available at https://github.com/aliasgharkhani/Masktune. △ Less

Submitted 8 October, 2022; v1 submitted 30 September, 2022; originally announced October 2022.

Comments: Accepted to NeurIPS 2022

arXiv:2208.02205 [pdf, other]

Large-scale Building Damage Assessment using a Novel Hierarchical Transformer Architecture on Satellite Images

Authors: Navjot Kaur, Cheng-Chun Lee, Ali Mostafavi, Ali Mahdavi-Amiri

Abstract: This paper presents \dahitra, a novel deep-learning model with hierarchical transformers to classify building damages based on satellite images in the aftermath of natural disasters. Satellite imagery provides real-time and high-coverage information and offers opportunities to inform large-scale post-disaster building damage assessment, which is critical for rapid emergency response. In this work,… ▽ More This paper presents \dahitra, a novel deep-learning model with hierarchical transformers to classify building damages based on satellite images in the aftermath of natural disasters. Satellite imagery provides real-time and high-coverage information and offers opportunities to inform large-scale post-disaster building damage assessment, which is critical for rapid emergency response. In this work, a novel transformer-based network is proposed for assessing building damage. This network leverages hierarchical spatial features of multiple resolutions and captures the temporal differences in the feature domain after applying a transformer encoder on the spatial features. The proposed network achieves state-of-the-art performance when tested on a large-scale disaster damage dataset (xBD) for building localization and damage classification, as well as on LEVIR-CD dataset for change detection tasks. In addition, this work introduces a new high-resolution satellite imagery dataset, Ida-BD (related to 2021 Hurricane Ida in Louisiana in 2021) for domain adaptation. Further, it demonstrates an approach of using this dataset by adapting the model with limited fine-tuning and hence applying the model to newly damaged areas with scarce data. △ Less

Submitted 4 February, 2023; v1 submitted 3 August, 2022; originally announced August 2022.

arXiv:2112.06596 [pdf, other]

SAC-GAN: Structure-Aware Image Composition

Authors: Hang Zhou, Rui Ma, Ling-Xiao Zhang, Lin Gao, Ali Mahdavi-Amiri, Hao Zhang

Abstract: We introduce an end-to-end learning framework for image-to-image composition, aiming to plausibly compose an object represented as a cropped patch from an object image into a background scene image. As our approach emphasizes more on semantic and structural coherence of the composed images, rather than their pixel-level RGB accuracies, we tailor the input and output of our network with structure-a… ▽ More We introduce an end-to-end learning framework for image-to-image composition, aiming to plausibly compose an object represented as a cropped patch from an object image into a background scene image. As our approach emphasizes more on semantic and structural coherence of the composed images, rather than their pixel-level RGB accuracies, we tailor the input and output of our network with structure-aware features and design our network losses accordingly, with ground truth established in a self-supervised setting through the object crop**. Specifically, our network takes the semantic layout features from the input scene image, features encoded from the edges and silhouette in the input object patch, as well as a latent code as inputs, and generates a 2D spatial affine transform defining the translation and scaling of the object patch. The learned parameters are further fed into a differentiable spatial transformer network to transform the object patch into the target image, where our model is trained adversarially using an affine transform discriminator and a layout discriminator. We evaluate our network, coined SAC-GAN, for various image composition scenarios in terms of quality, composability, and generalizability of the composite images. Comparisons are made to state-of-the-art alternatives, including Instance Insertion, ST-GAN, CompGAN and PlaceNet, confirming superiority of our method. △ Less

Submitted 2 December, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

Comments: Accepted to TVCG. Code: https://github.com/RyanHangZhou/SAC-GAN

arXiv:2112.05381 [pdf, other]

UNIST: Unpaired Neural Implicit Shape Translation Network

Authors: Qimin Chen, Johannes Merz, Aditya Sanghi, Hooman Shayani, Ali Mahdavi-Amiri, Hao Zhang

Abstract: We introduce UNIST, the first deep neural implicit model for general-purpose, unpaired shape-to-shape translation, in both 2D and 3D domains. Our model is built on autoencoding implicit fields, rather than point clouds which represents the state of the art. Furthermore, our translation network is trained to perform the task over a latent grid representation which combines the merits of both latent… ▽ More We introduce UNIST, the first deep neural implicit model for general-purpose, unpaired shape-to-shape translation, in both 2D and 3D domains. Our model is built on autoencoding implicit fields, rather than point clouds which represents the state of the art. Furthermore, our translation network is trained to perform the task over a latent grid representation which combines the merits of both latent-space processing and position awareness, to not only enable drastic shape transforms but also well preserve spatial features and fine local details for natural shape translations. With the same network architecture and only dictated by the input domain pairs, our model can learn both style-preserving content alteration and content-preserving style transfer. We demonstrate the generality and quality of the translation results, and compare them to well-known baselines. Code is available at https://qiminchen.github.io/unist/. △ Less

Submitted 30 March, 2022; v1 submitted 10 December, 2021; originally announced December 2021.

Comments: CVPR 2022. project page: https://qiminchen.github.io/unist/

arXiv:2106.16237 [pdf, other]

Multimodal Shape Completion via IMLE

Authors: Himanshu Arora, Saurabh Mishra, Shichong Peng, Ke Li, Ali Mahdavi-Amiri

Abstract: Shape completion is the problem of completing partial input shapes such as partial scans. This problem finds important applications in computer vision and robotics due to issues such as occlusion or sparsity in real-world data. However, most of the existing research related to shape completion has been focused on completing shapes by learning a one-to-one map** which limits the diversity and cre… ▽ More Shape completion is the problem of completing partial input shapes such as partial scans. This problem finds important applications in computer vision and robotics due to issues such as occlusion or sparsity in real-world data. However, most of the existing research related to shape completion has been focused on completing shapes by learning a one-to-one map** which limits the diversity and creativity of the produced results. We propose a novel multimodal shape completion technique that is effectively able to learn a one-to-many map** and generates diverse complete shapes. Our approach is based on the conditional Implicit MaximumLikelihood Estimation (IMLE) technique wherein we condition our inputs on partial 3D point clouds. We extensively evaluate our approach by comparing it to various baselines both quantitatively and qualitatively. We show that our method is superior to alternatives in terms of completeness and diversity of shapes. △ Less

Submitted 7 July, 2021; v1 submitted 30 June, 2021; originally announced June 2021.

Comments: Project Website: https://sites.google.com/site/alimahdaviamiri/projects/shape-completion

arXiv:2104.05652 [pdf, other]

CAPRI-Net: Learning Compact CAD Shapes with Adaptive Primitive Assembly

Authors: Fenggen Yu, Zhiqin Chen, Manyi Li, Aditya Sanghi, Hooman Shayani, Ali Mahdavi-Amiri, Hao Zhang

Abstract: We introduce CAPRI-Net, a neural network for learning compact and interpretable implicit representations of 3D computer-aided design (CAD) models, in the form of adaptive primitive assemblies. Our network takes an input 3D shape that can be provided as a point cloud or voxel grids, and reconstructs it by a compact assembly of quadric surface primitives via constructive solid geometry (CSG) operati… ▽ More We introduce CAPRI-Net, a neural network for learning compact and interpretable implicit representations of 3D computer-aided design (CAD) models, in the form of adaptive primitive assemblies. Our network takes an input 3D shape that can be provided as a point cloud or voxel grids, and reconstructs it by a compact assembly of quadric surface primitives via constructive solid geometry (CSG) operations. The network is self-supervised with a reconstruction loss, leading to faithful 3D reconstructions with sharp edges and plausible CSG trees, without any ground-truth shape assemblies. While the parametric nature of CAD models does make them more predictable locally, at the shape level, there is a great deal of structural and topological variations, which present a significant generalizability challenge to state-of-the-art neural models for 3D shapes. Our network addresses this challenge by adaptive training with respect to each test shape, with which we fine-tune the network that was pre-trained on a model collection. We evaluate our learning framework on both ShapeNet and ABC, the largest and most diverse CAD dataset to date, in terms of reconstruction quality, shape edges, compactness, and interpretability, to demonstrate superiority over current alternatives suitable for neural CAD reconstruction. △ Less

Submitted 12 April, 2021; originally announced April 2021.

arXiv:2104.04606 [pdf, other]

RaidaR: A Rich Annotated Image Dataset of Rainy Street Scenes

Authors: Jiongchao **, Arezou Fatemi, Wallace Lira, Fenggen Yu, Biao Leng, Rui Ma, Ali Mahdavi-Amiri, Hao Zhang

Abstract: We introduce RaidaR, a rich annotated image dataset of rainy street scenes, to support autonomous driving research. The new dataset contains the largest number of rainy images (58,542) to date, 5,000 of which provide semantic segmentations and 3,658 provide object instance segmentations. The RaidaR images cover a wide range of realistic rain-induced artifacts, including fog, droplets, and road ref… ▽ More We introduce RaidaR, a rich annotated image dataset of rainy street scenes, to support autonomous driving research. The new dataset contains the largest number of rainy images (58,542) to date, 5,000 of which provide semantic segmentations and 3,658 provide object instance segmentations. The RaidaR images cover a wide range of realistic rain-induced artifacts, including fog, droplets, and road reflections, which can effectively augment existing street scene datasets to improve data-driven machine perception during rainy weather. To facilitate efficient annotation of a large volume of images, we develop a semi-automatic scheme combining manual segmentation and an automated processing akin to cross validation, resulting in 10-20 fold reduction on annotation time. We demonstrate the utility of our new dataset by showing how data augmentation with RaidaR can elevate the accuracy of existing segmentation algorithms. We also present a novel unpaired image-to-image translation algorithm for adding/removing rain artifacts, which directly benefits from RaidaR. △ Less

Submitted 26 October, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

Comments: Presented in Second ICCV Workshop on Autonomous Vehicle Vision (AVVision), 2021. Website: https://raidar-dataset.com/

arXiv:2102.11175 [pdf, other]

doi 10.1111/cgf.14330

Data to Physicalization: A Survey of the Physical Rendering Process

Authors: Hessam Djavaherpour, Faramarz Samavati, Ali Mahdavi-Amiri, Fatemeh Yazdanbakhsh, Samuel Huron, Richard Levy, Yvonne Jansen, Lora Oehlberg

Abstract: Physical representations of data offer physical and spatial ways of looking at, navigating, and interacting with data. While digital fabrication has facilitated the creation of objects with data-driven geometry, rendering data as a physically fabricated object is still a daunting leap for many physicalization designers. Rendering in the scope of this research refers to the back-and-forth process f… ▽ More Physical representations of data offer physical and spatial ways of looking at, navigating, and interacting with data. While digital fabrication has facilitated the creation of objects with data-driven geometry, rendering data as a physically fabricated object is still a daunting leap for many physicalization designers. Rendering in the scope of this research refers to the back-and-forth process from digital design to digital fabrication and its specific challenges. We developed a corpus of example data physicalizations from research literature and physicalization practice. This survey then unpacks the "rendering" phase of the extended InfoVis pipeline in greater detail through these examples, with the aim of identifying ways that researchers, artists, and industry practitioners "render" physicalizations using digital design and fabrication tools. △ Less

Submitted 22 February, 2021; originally announced February 2021.

ACM Class: I.3.5; I.3.8

Journal ref: Computer Graphics Forum, Wiley, 2021, 40 (3), pp.569-598

arXiv:2007.04883 [pdf, other]

PIE-NET: Parametric Inference of Point Cloud Edges

Authors: Xiaogang Wang, Yuelang Xu, Kai Xu, Andrea Tagliasacchi, Bin Zhou, Ali Mahdavi-Amiri, Hao Zhang

Abstract: We introduce an end-to-end learnable technique to robustly identify feature edges in 3D point cloud data. We represent these edges as a collection of parametric curves (i.e.,lines, circles, and B-splines). Accordingly, our deep neural network, coined PIE-NET, is trained for parametric inference of edges. The network relies on a "region proposal" architecture, where a first module proposes an over-… ▽ More We introduce an end-to-end learnable technique to robustly identify feature edges in 3D point cloud data. We represent these edges as a collection of parametric curves (i.e.,lines, circles, and B-splines). Accordingly, our deep neural network, coined PIE-NET, is trained for parametric inference of edges. The network relies on a "region proposal" architecture, where a first module proposes an over-complete collection of edge and corner points, and a second module ranks each proposal to decide whether it should be considered. We train and evaluate our method on the ABC dataset, a large dataset of CAD models, and compare our results to those produced by traditional (non-learning) processing pipelines, as well as a recent deep learning based edge detector (EC-NET). Our results significantly improve over the state-of-the-art from both a quantitative and qualitative standpoint. △ Less

Submitted 25 October, 2020; v1 submitted 9 July, 2020; originally announced July 2020.

arXiv:1804.06579 [pdf, other]

doi 10.1145/3182158

Semi-Supervised Co-Analysis of 3D Shape Styles from Projected Lines

Authors: Fenggen Yu, Yan Zhang, Kai Xu, Ali Mahdavi-Amiri, Hao Zhang

Abstract: We present a semi-supervised co-analysis method for learning 3D shape styles from projected feature lines, achieving style patch localization with only weak supervision. Given a collection of 3D shapes spanning multiple object categories and styles, we perform style co-analysis over projected feature lines of each 3D shape and then backproject the learned style features onto the 3D shapes. Our cor… ▽ More We present a semi-supervised co-analysis method for learning 3D shape styles from projected feature lines, achieving style patch localization with only weak supervision. Given a collection of 3D shapes spanning multiple object categories and styles, we perform style co-analysis over projected feature lines of each 3D shape and then backproject the learned style features onto the 3D shapes. Our core analysis pipeline starts with mid-level patch sampling and pre-selection of candidate style patches. Projective features are then encoded via patch convolution. Multi-view feature integration and style clustering are carried out under the framework of partially shared latent factor (PSLF) learning, a multi-view feature learning scheme. PSLF achieves effective multi-view feature fusion by distilling and exploiting consistent and complementary feature information from multiple views, while also selecting style patches from the candidates. Our style analysis approach supports both unsupervised and semi-supervised analysis. For the latter, our method accepts both user-specified shape labels and style-ranked triplets as clustering constraints.We demonstrate results from 3D shape style analysis and patch localization as well as improvements over state-of-the-art methods. We also present several applications enabled by our style analysis. △ Less

Submitted 15 January, 2022; v1 submitted 18 April, 2018; originally announced April 2018.

Comments: 17 pages, 25 figures

MSC Class: 68U05 ACM Class: H.4.2

Journal ref: ACM Transactions on Graphics 37(2). February 2018

Showing 1–26 of 26 results for author: Mahdavi-Amiri, A