Search | arXiv e-print repository

CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

Authors: Yigit Ekin, Ahmet Burak Yildirim, Erdem Eren Caglar, Aykut Erdem, Erkut Erdem, Aysegul Dundar

Abstract: Advanced image editing techniques, particularly inpainting, are essential for seamlessly removing unwanted elements while preserving visual integrity. Traditional GAN-based methods have achieved notable success, but recent advancements in diffusion models have produced superior results due to their training on large-scale datasets, enabling the generation of remarkably realistic inpainted images.… ▽ More Advanced image editing techniques, particularly inpainting, are essential for seamlessly removing unwanted elements while preserving visual integrity. Traditional GAN-based methods have achieved notable success, but recent advancements in diffusion models have produced superior results due to their training on large-scale datasets, enabling the generation of remarkably realistic inpainted images. Despite their strengths, diffusion models often struggle with object removal tasks without explicit guidance, leading to unintended hallucinations of the removed object. To address this issue, we introduce CLIPAway, a novel approach leveraging CLIP embeddings to focus on background regions while excluding foreground elements. CLIPAway enhances inpainting accuracy and quality by identifying embeddings that prioritize the background, thus achieving seamless object removal. Unlike other methods that rely on specialized training datasets or costly manual annotations, CLIPAway provides a flexible, plug-and-play solution compatible with various diffusion-based inpainting techniques. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Project page: https://yigitekin.github.io/CLIPAway/

arXiv:2404.03632 [pdf, other]

Reference-Based 3D-Aware Image Editing with Triplane

Authors: Bahri Batuhan Bilecen, Yigit Yalin, Ning Yu, Aysegul Dundar

Abstract: Generative Adversarial Networks (GANs) have emerged as powerful tools not only for high-quality image generation but also for real image editing through manipulation of their interpretable latent spaces. Recent advancements in GANs include the development of 3D-aware models such as EG3D, characterized by efficient triplane-based architectures enabling the reconstruction of 3D geometry from single… ▽ More Generative Adversarial Networks (GANs) have emerged as powerful tools not only for high-quality image generation but also for real image editing through manipulation of their interpretable latent spaces. Recent advancements in GANs include the development of 3D-aware models such as EG3D, characterized by efficient triplane-based architectures enabling the reconstruction of 3D geometry from single images. However, scant attention has been devoted to providing an integrated framework for high-quality reference-based 3D-aware image editing within this domain. This study addresses this gap by exploring and demonstrating the effectiveness of EG3D's triplane space for achieving advanced reference-based edits, presenting a unique perspective on 3D-aware image editing through our novel pipeline. Our approach integrates the encoding of triplane features, spatial disentanglement and automatic localization of features in the triplane domain, and fusion learning for desired image editing. Moreover, our framework demonstrates versatility across domains, extending its effectiveness to animal face edits and partial stylization of cartoon portraits. The method shows significant improvements over relevant 3D-aware latent editing and 2D reference-based editing methods, both qualitatively and quantitatively. Project page: https://three-bee.github.io/triplane_edit △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2312.11422 [pdf, other]

War** the Residuals for Image Editing with StyleGAN

Authors: Ahmet Burak Yildirim, Hamza Pehlivan, Aysegul Dundar

Abstract: StyleGAN models show editing capabilities via their semantically interpretable latent organizations which require successful GAN inversion methods to edit real images. Many works have been proposed for inverting images into StyleGAN's latent space. However, their results either suffer from low fidelity to the input image or poor editing qualities, especially for edits that require large transforma… ▽ More StyleGAN models show editing capabilities via their semantically interpretable latent organizations which require successful GAN inversion methods to edit real images. Many works have been proposed for inverting images into StyleGAN's latent space. However, their results either suffer from low fidelity to the input image or poor editing qualities, especially for edits that require large transformations. That is because low-rate latent spaces lose many image details due to the information bottleneck even though it provides an editable space. On the other hand, higher-rate latent spaces can pass all the image details to StyleGAN for perfect reconstruction of images but suffer from low editing qualities. In this work, we present a novel image inversion architecture that extracts high-rate latent features and includes a flow estimation module to warp these features to adapt them to edits. The flows are estimated from StyleGAN features of edited and unedited latent codes. By estimating the high-rate features and war** them for edits, we achieve both high-fidelity to the input image and high-quality edits. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements. △ Less

Submitted 18 December, 2023; originally announced December 2023.

arXiv:2309.13975 [pdf, other]

Diverse Semantic Image Editing with Style Codes

Authors: Hakan Sivuk, Aysegul Dundar

Abstract: Semantic image editing requires inpainting pixels following a semantic map. It is a challenging task since this inpainting requires both harmony with the context and strict compliance with the semantic maps. The majority of the previous methods proposed for this task try to encode the whole information from erased images. However, when an object is added to a scene such as a car, its style cannot… ▽ More Semantic image editing requires inpainting pixels following a semantic map. It is a challenging task since this inpainting requires both harmony with the context and strict compliance with the semantic maps. The majority of the previous methods proposed for this task try to encode the whole information from erased images. However, when an object is added to a scene such as a car, its style cannot be encoded from the context alone. On the other hand, the models that can output diverse generations struggle to output images that have seamless boundaries between the generated and unerased parts. Additionally, previous methods do not have a mechanism to encode the styles of visible and partially visible objects differently for better performance. In this work, we propose a framework that can encode visible and partially visible objects with a novel mechanism to achieve consistency in the style encoding and final generations. We extensively compare with previous conditional image generation and semantic image editing algorithms. Our extensive experiments show that our method significantly improves over the state-of-the-art. Our method not only achieves better quantitative results but also provides diverse results. Please refer to the project web page for the released code and demo: https://github.com/hakansivuk/DivSem. △ Less

Submitted 25 September, 2023; originally announced September 2023.

arXiv:2307.15033 [pdf, other]

Diverse Inpainting and Editing with GAN Inversion

Authors: Ahmet Burak Yildirim, Hamza Pehlivan, Bahri Batuhan Bilecen, Aysegul Dundar

Abstract: Recent inversion methods have shown that real images can be inverted into StyleGAN's latent space and numerous edits can be achieved on those images thanks to the semantically rich feature representations of well-trained GAN models. However, extensive research has also shown that image inversion is challenging due to the trade-off between high-fidelity reconstruction and editability. In this paper… ▽ More Recent inversion methods have shown that real images can be inverted into StyleGAN's latent space and numerous edits can be achieved on those images thanks to the semantically rich feature representations of well-trained GAN models. However, extensive research has also shown that image inversion is challenging due to the trade-off between high-fidelity reconstruction and editability. In this paper, we tackle an even more difficult task, inverting erased images into GAN's latent space for realistic inpaintings and editings. Furthermore, by augmenting inverted latent codes with different latent samples, we achieve diverse inpaintings. Specifically, we propose to learn an encoder and mixing network to combine encoded features from erased images with StyleGAN's mapped features from random samples. To encourage the mixing network to utilize both inputs, we train the networks with generated data via a novel set-up. We also utilize higher-rate features to prevent color inconsistencies between the inpainted and unerased parts. We run extensive experiments and compare our method with state-of-the-art inversion and inpainting methods. Qualitative metrics and visual comparisons show significant improvements. △ Less

Submitted 27 July, 2023; originally announced July 2023.

Comments: ICCV 2023

arXiv:2305.11102 [pdf, other]

Progressive Learning of 3D Reconstruction Network from 2D GAN Data

Authors: Aysegul Dundar, Jun Gao, Andrew Tao, Bryan Catanzaro

Abstract: This paper presents a method to reconstruct high-quality textured 3D models from single images. Current methods rely on datasets with expensive annotations; multi-view images and their camera parameters. Our method relies on GAN generated multi-view image datasets which have a negligible annotation cost. However, they are not strictly multi-view consistent and sometimes GANs output distorted image… ▽ More This paper presents a method to reconstruct high-quality textured 3D models from single images. Current methods rely on datasets with expensive annotations; multi-view images and their camera parameters. Our method relies on GAN generated multi-view image datasets which have a negligible annotation cost. However, they are not strictly multi-view consistent and sometimes GANs output distorted images. This results in degraded reconstruction qualities. In this work, to overcome these limitations of generated datasets, we have two main contributions which lead us to achieve state-of-the-art results on challenging objects: 1) A robust multi-stage learning scheme that gradually relies more on the models own predictions when calculating losses, 2) A novel adversarial learning pipeline with online pseudo-ground truth generations to achieve fine details. Our work provides a bridge from 2D supervisions of GAN models to 3D reconstruction models and removes the expensive annotation efforts. We show significant improvements over previous methods whether they were trained on GAN generated multi-view images or on real images with expensive annotations. Please visit our web-page for 3D visuals: https://research.nvidia.com/labs/adlr/progressive-3d-learning △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: Web-page: https://research.nvidia.com/labs/adlr/progressive-3d-learning. arXiv admin note: text overlap with arXiv:2203.09362

arXiv:2304.03246 [pdf, other]

Inst-Inpaint: Instructing to Remove Objects with Diffusion Models

Authors: Ahmet Burak Yildirim, Vedat Baday, Erkut Erdem, Aykut Erdem, Aysegul Dundar

Abstract: Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we ar… ▽ More Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements. △ Less

Submitted 9 August, 2023; v1 submitted 6 April, 2023; originally announced April 2023.

arXiv:2303.03471 [pdf, other]

Refining 3D Human Texture Estimation from a Single Image

Authors: Said Fahri Altindis, Adil Meric, Yusuf Dalva, Ugur Gudukbay, Aysegul Dundar

Abstract: Estimating 3D human texture from a single image is essential in graphics and vision. It requires learning a map** function from input images of humans with diverse poses into the parametric (UV) space and reasonably hallucinating invisible parts. To achieve a high-quality 3D human texture estimation, we propose a framework that adaptively samples the input by a deformable convolution where offse… ▽ More Estimating 3D human texture from a single image is essential in graphics and vision. It requires learning a map** function from input images of humans with diverse poses into the parametric (UV) space and reasonably hallucinating invisible parts. To achieve a high-quality 3D human texture estimation, we propose a framework that adaptively samples the input by a deformable convolution where offsets are learned via a deep neural network. Additionally, we describe a novel cycle consistency loss that improves view generalization. We further propose to train our framework with an uncertainty-based pixel-level image reconstruction loss, which enhances color fidelity. We compare our method against the state-of-the-art approaches and show significant qualitative and quantitative improvements. △ Less

Submitted 6 March, 2023; originally announced March 2023.

arXiv:2301.04628 [pdf, other]

Face Attribute Editing with Disentangled Latent Vectors

Authors: Yusuf Dalva, Hamza Pehlivan, Cansu Moran, Öykü Irmak Hatipoğlu, Ayşegül Dündar

Abstract: We propose an image-to-image translation framework for facial attribute editing with disentangled interpretable latent directions. Facial attribute editing task faces the challenges of targeted attribute editing with controllable strength and disentanglement in the representations of attributes to preserve the other attributes during edits. For this goal, inspired by the latent space factorization… ▽ More We propose an image-to-image translation framework for facial attribute editing with disentangled interpretable latent directions. Facial attribute editing task faces the challenges of targeted attribute editing with controllable strength and disentanglement in the representations of attributes to preserve the other attributes during edits. For this goal, inspired by the latent space factorization works of fixed pretrained GANs, we design the attribute editing by latent space factorization, and for each attribute, we learn a linear direction that is orthogonal to the others. We train these directions with orthogonality constraints and disentanglement losses. To project images to semantically organized latent spaces, we set an encoder-decoder architecture with attention-based skip connections. We extensively compare with previous image translation algorithms and editing with pretrained GAN works. Our extensive experiments show that our method significantly improves over the state-of-the-arts. Project page: https://yusufdalva.github.io/vecgan △ Less

Submitted 11 January, 2023; originally announced January 2023.

Comments: See https://yusufdalva.github.io/vecgan for the project webpage. arXiv admin note: substantial text overlap with arXiv:2207.03411

arXiv:2212.14359 [pdf, other]

StyleRes: Transforming the Residuals for Real Image Editing with StyleGAN

Authors: Hamza Pehlivan, Yusuf Dalva, Aysegul Dundar

Abstract: We present a novel image inversion framework and a training pipeline to achieve high-fidelity image inversion with high-quality attribute editing. Inverting real images into StyleGAN's latent space is an extensively studied problem, yet the trade-off between the image reconstruction fidelity and image editing quality remains an open challenge. The low-rate latent spaces are limited in their expres… ▽ More We present a novel image inversion framework and a training pipeline to achieve high-fidelity image inversion with high-quality attribute editing. Inverting real images into StyleGAN's latent space is an extensively studied problem, yet the trade-off between the image reconstruction fidelity and image editing quality remains an open challenge. The low-rate latent spaces are limited in their expressiveness power for high-fidelity reconstruction. On the other hand, high-rate latent spaces result in degradation in editing quality. In this work, to achieve high-fidelity inversion, we learn residual features in higher latent codes that lower latent codes were not able to encode. This enables preserving image details in reconstruction. To achieve high-quality editing, we learn how to transform the residual features for adapting to manipulations in latent codes. We train the framework to extract residual features and transform them via a novel architecture pipeline and cycle consistency losses. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements. Code: https://github.com/hamzapehlivan/StyleRes △ Less

Submitted 29 December, 2022; originally announced December 2022.

arXiv:2207.03411 [pdf, other]

VecGAN: Image-to-Image Translation with Interpretable Latent Directions

Authors: Yusuf Dalva, Said Fahri Altindis, Aysegul Dundar

Abstract: We propose VecGAN, an image-to-image translation framework for facial attribute editing with interpretable latent directions. Facial attribute editing task faces the challenges of precise attribute editing with controllable strength and preservation of the other attributes of an image. For this goal, we design the attribute editing by latent space factorization and for each attribute, we learn a l… ▽ More We propose VecGAN, an image-to-image translation framework for facial attribute editing with interpretable latent directions. Facial attribute editing task faces the challenges of precise attribute editing with controllable strength and preservation of the other attributes of an image. For this goal, we design the attribute editing by latent space factorization and for each attribute, we learn a linear direction that is orthogonal to the others. The other component is the controllable strength of the change, a scalar value. In our framework, this scalar can be either sampled or encoded from a reference image by projection. Our work is inspired by the latent space factorization works of fixed pretrained GANs. However, while those models cannot be trained end-to-end and struggle to edit encoded images precisely, VecGAN is end-to-end trained for image translation task and successful at editing an attribute while preserving the others. Our extensive experiments show that VecGAN achieves significant improvements over state-of-the-arts for both local and global edits. △ Less

Submitted 7 July, 2022; originally announced July 2022.

Comments: ECCV 2022

arXiv:2203.09362 [pdf, other]

Fine Detailed Texture Learning for 3D Meshes with Generative Models

Authors: Aysegul Dundar, Jun Gao, Andrew Tao, Bryan Catanzaro

Abstract: This paper presents a method to reconstruct high-quality textured 3D models from both multi-view and single-view images. The reconstruction is posed as an adaptation problem and is done progressively where in the first stage, we focus on learning accurate geometry, whereas in the second stage, we focus on learning the texture with a generative adversarial network. In the generative learning pipeli… ▽ More This paper presents a method to reconstruct high-quality textured 3D models from both multi-view and single-view images. The reconstruction is posed as an adaptation problem and is done progressively where in the first stage, we focus on learning accurate geometry, whereas in the second stage, we focus on learning the texture with a generative adversarial network. In the generative learning pipeline, we propose two improvements. First, since the learned textures should be spatially aligned, we propose an attention mechanism that relies on the learnable positions of pixels. Secondly, since discriminator receives aligned texture maps, we augment its input with a learnable embedding which improves the feedback to the generator. We achieve significant improvements on multi-view sequences from Tripod dataset as well as on single-view image datasets, Pascal 3D+ and CUB. We demonstrate that our method achieves superior 3D textured models compared to the previous works. Please visit our web-page for 3D visuals. △ Less

Submitted 17 March, 2022; originally announced March 2022.

arXiv:2109.01123 [pdf, other]

Benchmarking the Robustness of Instance Segmentation Models

Authors: Said Fahri Altindis, Yusuf Dalva, Hamza Pehlivan, Aysegul Dundar

Abstract: This paper presents a comprehensive evaluation of instance segmentation models with respect to real-world image corruptions as well as out-of-domain image collections, e.g. images captured by a different set-up than the training dataset. The out-of-domain image evaluation shows the generalization capability of models, an essential aspect of real-world applications and an extensively studied topic… ▽ More This paper presents a comprehensive evaluation of instance segmentation models with respect to real-world image corruptions as well as out-of-domain image collections, e.g. images captured by a different set-up than the training dataset. The out-of-domain image evaluation shows the generalization capability of models, an essential aspect of real-world applications and an extensively studied topic of domain adaptation. These presented robustness and generalization evaluations are important when designing instance segmentation models for real-world applications and picking an off-the-shelf pretrained model to directly use for the task at hand. Specifically, this benchmark study includes state-of-the-art network architectures, network backbones, normalization layers, models trained starting from scratch versus pretrained networks, and the effect of multi-task training on robustness and generalization. Through this study, we gain several insights. For example, we find that group normalization enhances the robustness of networks across corruptions where the image contents stay the same but corruptions are added on top. On the other hand, batch normalization improves the generalization of the models across different datasets where statistics of image features change. We also find that single-stage detectors do not generalize well to larger image resolutions than their training size. On the other hand, multi-stage detectors can easily be used on images of different sizes. We hope that our comprehensive study will motivate the development of more robust and reliable instance segmentation models. △ Less

Submitted 10 August, 2022; v1 submitted 2 September, 2021; originally announced September 2021.

arXiv:2106.06533 [pdf, other]

View Generalization for Single Image Textured 3D Models

Authors: Anand Bhattad, Aysegul Dundar, Guilin Liu, Andrew Tao, Bryan Catanzaro

Abstract: Humans can easily infer the underlying 3D geometry and texture of an object only from a single 2D image. Current computer vision methods can do this, too, but suffer from view generalization problems - the models inferred tend to make poor predictions of appearance in novel views. As for generalization problems in machine learning, the difficulty is balancing single-view accuracy (cf. training err… ▽ More Humans can easily infer the underlying 3D geometry and texture of an object only from a single 2D image. Current computer vision methods can do this, too, but suffer from view generalization problems - the models inferred tend to make poor predictions of appearance in novel views. As for generalization problems in machine learning, the difficulty is balancing single-view accuracy (cf. training error; bias) with novel view accuracy (cf. test error; variance). We describe a class of models whose geometric rigidity is easily controlled to manage this tradeoff. We describe a cycle consistency loss that improves view generalization (roughly, a model from a generated view should predict the original view well). View generalization of textures requires that models share texture information, so a car seen from the back still has headlights because other cars have headlights. We describe a cycle consistency loss that encourages model textures to be aligned, so as to encourage sharing. We compare our method against the state-of-the-art method and show both qualitative and quantitative improvements. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: CVPR 2021. Project website: https://nv-adlr.github.io/view-generalization

arXiv:2103.16748 [pdf, other]

Dual Contrastive Loss and Attention for GANs

Authors: Ning Yu, Guilin Liu, Aysegul Dundar, Andrew Tao, Bryan Catanzaro, Larry Davis, Mario Fritz

Abstract: Generative Adversarial Networks (GANs) produce impressive results on unconditional image generation when powered with large-scale image datasets. Yet generated images are still easy to spot especially on datasets with high variance (e.g. bedroom, church). In this paper, we propose various improvements to further push the boundaries in image generation. Specifically, we propose a novel dual contras… ▽ More Generative Adversarial Networks (GANs) produce impressive results on unconditional image generation when powered with large-scale image datasets. Yet generated images are still easy to spot especially on datasets with high variance (e.g. bedroom, church). In this paper, we propose various improvements to further push the boundaries in image generation. Specifically, we propose a novel dual contrastive loss and show that, with this loss, discriminator learns more generalized and distinguishable representations to incentivize generation. In addition, we revisit attention and extensively experiment with different attention blocks in the generator. We find attention to be still an important module for successful image generation even though it was not used in the recent state-of-the-art models. Lastly, we study different attention architectures in the discriminator, and propose a reference attention mechanism. By combining the strengths of these remedies, we improve the compelling state-of-the-art Fréchet Inception Distance (FID) by at least 17.5% on several benchmark datasets. We obtain even more significant improvements on compositional synthetic scenes (up to 47.5% in FID). Code and models are available at https://github.com/ningyu1991/AttentionDualContrastGAN . △ Less

Submitted 17 March, 2022; v1 submitted 30 March, 2021; originally announced March 2021.

Comments: Accepted to ICCV'21

arXiv:2004.10289 [pdf, other]

Panoptic-based Image Synthesis

Authors: Aysegul Dundar, Karan Sapra, Guilin Liu, Andrew Tao, Bryan Catanzaro

Abstract: Conditional image synthesis for generating photorealistic images serves various applications for content editing to content generation. Previous conditional image synthesis algorithms mostly rely on semantic maps, and often fail in complex environments where multiple instances occlude each other. We propose a panoptic aware image synthesis network to generate high fidelity and photorealistic image… ▽ More Conditional image synthesis for generating photorealistic images serves various applications for content editing to content generation. Previous conditional image synthesis algorithms mostly rely on semantic maps, and often fail in complex environments where multiple instances occlude each other. We propose a panoptic aware image synthesis network to generate high fidelity and photorealistic images conditioned on panoptic maps which unify semantic and instance information. To achieve this, we efficiently use panoptic maps in convolution and upsampling layers. We show that with the proposed changes to the generator, we can improve on the previous state-of-the-art methods by generating images in complex instance interaction environments in higher fidelity and tiny objects in more details. Furthermore, our proposed method also outperforms the previous state-of-the-art methods in metrics of mean IoU (Intersection over Union), and detAP (Detection Average Precision). △ Less

Submitted 21 April, 2020; originally announced April 2020.

Comments: CVPR 2020

arXiv:2001.09518 [pdf, other]

Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos

Authors: Aysegul Dundar, Kevin J. Shih, Animesh Garg, Robert Pottorf, Andrew Tao, Bryan Catanzaro

Abstract: Unsupervised landmark learning is the task of learning semantic keypoint-like representations without the use of expensive input keypoint-level annotations. A popular approach is to factorize an image into a pose and appearance data stream, then to reconstruct the image from the factorized components. The pose representation should capture a set of consistent and tightly localized landmarks in ord… ▽ More Unsupervised landmark learning is the task of learning semantic keypoint-like representations without the use of expensive input keypoint-level annotations. A popular approach is to factorize an image into a pose and appearance data stream, then to reconstruct the image from the factorized components. The pose representation should capture a set of consistent and tightly localized landmarks in order to facilitate reconstruction of the input image. Ultimately, we wish for our learned landmarks to focus on the foreground object of interest. However, the reconstruction task of the entire image forces the model to allocate landmarks to model the background. This work explores the effects of factorizing the reconstruction task into separate foreground and background reconstructions, conditioning only the foreground reconstruction on the unsupervised landmarks. Our experiments demonstrate that the proposed factorization results in landmarks that are focused on the foreground object of interest. Furthermore, the rendered background quality is also improved, as the background rendering pipeline no longer requires the ill-suited landmarks to model its pose and appearance. We demonstrate this improvement in the context of the video-prediction task. △ Less

Submitted 26 January, 2020; originally announced January 2020.

arXiv:1909.02749 [pdf, other]

Video Interpolation and Prediction with Unsupervised Landmarks

Authors: Kevin J. Shih, Aysegul Dundar, Animesh Garg, Robert Pottorf, Andrew Tao, Bryan Catanzaro

Abstract: Prediction and interpolation for long-range video data involves the complex task of modeling motion trajectories for each visible object, occlusions and dis-occlusions, as well as appearance changes due to viewpoint and lighting. Optical flow based techniques generalize but are suitable only for short temporal ranges. Many methods opt to project the video frames to a low dimensional latent space,… ▽ More Prediction and interpolation for long-range video data involves the complex task of modeling motion trajectories for each visible object, occlusions and dis-occlusions, as well as appearance changes due to viewpoint and lighting. Optical flow based techniques generalize but are suitable only for short temporal ranges. Many methods opt to project the video frames to a low dimensional latent space, achieving long-range predictions. However, these latent representations are often non-interpretable, and therefore difficult to manipulate. This work poses video prediction and interpolation as unsupervised latent structure inference followed by a temporal prediction in this latent space. The latent representations capture foreground semantics without explicit supervision such as keypoints or poses. Further, as each landmark can be mapped to a coordinate indicating where a semantic part is positioned, we can reliably interpolate within the coordinate domain to achieve predictable motion interpolation. Given an image decoder capable of map** these landmarks back to the image domain, we are able to achieve high-quality long-range video interpolation and extrapolation by operating on the landmark representation space. △ Less

Submitted 6 September, 2019; originally announced September 2019.

Comments: Technical Report

arXiv:1906.05928 [pdf, other]

Unsupervised Video Interpolation Using Cycle Consistency

Authors: Fitsum A. Reda, Deqing Sun, Aysegul Dundar, Mohammad Shoeybi, Guilin Liu, Kevin J. Shih, Andrew Tao, Jan Kautz, Bryan Catanzaro

Abstract: Learning to synthesize high frame rate videos via interpolation requires large quantities of high frame rate training videos, which, however, are scarce, especially at high resolutions. Here, we propose unsupervised techniques to synthesize high frame rate videos directly from low frame rate videos using cycle consistency. For a triplet of consecutive frames, we optimize models to minimize the dis… ▽ More Learning to synthesize high frame rate videos via interpolation requires large quantities of high frame rate training videos, which, however, are scarce, especially at high resolutions. Here, we propose unsupervised techniques to synthesize high frame rate videos directly from low frame rate videos using cycle consistency. For a triplet of consecutive frames, we optimize models to minimize the discrepancy between the center frame and its cycle reconstruction, obtained by interpolating back from interpolated intermediate frames. This simple unsupervised constraint alone achieves results comparable with supervision using the ground truth intermediate frames. We further introduce a pseudo supervised loss term that enforces the interpolated frames to be consistent with predictions of a pre-trained interpolation model. The pseudo supervised loss term, used together with cycle consistency, can effectively adapt a pre-trained model to a new target domain. With no additional data and in a completely unsupervised fashion, our techniques significantly improve pre-trained models on new target domains, increasing PSNR values from 32.84dB to 33.05dB on the Slowflow and from 31.82dB to 32.53dB on the Sintel evaluation datasets. △ Less

Submitted 27 March, 2021; v1 submitted 13 June, 2019; originally announced June 2019.

Comments: Published in ICCV 2019. Codes are available at https://github.com/NVIDIA/unsupervised-video-interpolation. Project website https://nv-adlr.github.io/publication/2019-UnsupervisedVideoInterpolation

arXiv:1807.09384 [pdf, other]

Domain Stylization: A Strong, Simple Baseline for Synthetic to Real Image Domain Adaptation

Authors: Aysegul Dundar, Ming-Yu Liu, Ting-Chun Wang, John Zedlewski, Jan Kautz

Abstract: Deep neural networks have largely failed to effectively utilize synthetic data when applied to real images due to the covariate shift problem. In this paper, we show that by applying a straightforward modification to an existing photorealistic style transfer algorithm, we achieve state-of-the-art synthetic-to-real domain adaptation results. We conduct extensive experimental validations on four syn… ▽ More Deep neural networks have largely failed to effectively utilize synthetic data when applied to real images due to the covariate shift problem. In this paper, we show that by applying a straightforward modification to an existing photorealistic style transfer algorithm, we achieve state-of-the-art synthetic-to-real domain adaptation results. We conduct extensive experimental validations on four synthetic-to-real tasks for semantic segmentation and object detection, and show that our approach exceeds the performance of any current state-of-the-art GAN-based image translation approach as measured by segmentation and object detection metrics. Furthermore we offer a distance based analysis of our method which shows a dramatic reduction in Frechet Inception distance between the source and target domains, offering a quantitative metric that demonstrates the effectiveness of our algorithm in bridging the synthetic-to-real gap. △ Less

Submitted 24 July, 2018; originally announced July 2018.

arXiv:1712.01653 [pdf, other]

Context Augmentation for Convolutional Neural Networks

Authors: Aysegul Dundar, Ignacio Garcia-Dorado

Abstract: Recent enhancements of deep convolutional neural networks (ConvNets) empowered by enormous amounts of labeled data have closed the gap with human performance for many object recognition tasks. These impressive results have generated interest in understanding and visualization of ConvNets. In this work, we study the effect of background in the task of image classification. Our results show that cha… ▽ More Recent enhancements of deep convolutional neural networks (ConvNets) empowered by enormous amounts of labeled data have closed the gap with human performance for many object recognition tasks. These impressive results have generated interest in understanding and visualization of ConvNets. In this work, we study the effect of background in the task of image classification. Our results show that changing the backgrounds of the training datasets can have drastic effects on testing accuracies. Furthermore, we enhance existing augmentation techniques with the foreground segmented objects. The findings of this work are important in increasing the accuracies when only a small dataset is available, in creating datasets, and creating synthetic images. △ Less

Submitted 11 December, 2017; v1 submitted 22 November, 2017; originally announced December 2017.

Comments: 8 pages, 7 figures

arXiv:1706.05048 [pdf, other]

Human-like Clustering with Deep Convolutional Neural Networks

Authors: Ali Borji, Aysegul Dundar

Abstract: Classification and clustering have been studied separately in machine learning and computer vision. Inspired by the recent success of deep learning models in solving various vision problems (e.g., object recognition, semantic segmentation) and the fact that humans serve as the gold standard in assessing clustering algorithms, here, we advocate for a unified treatment of the two problems and sugges… ▽ More Classification and clustering have been studied separately in machine learning and computer vision. Inspired by the recent success of deep learning models in solving various vision problems (e.g., object recognition, semantic segmentation) and the fact that humans serve as the gold standard in assessing clustering algorithms, here, we advocate for a unified treatment of the two problems and suggest that hierarchical frameworks that progressively build complex patterns on top of the simpler ones (e.g., convolutional neural networks) offer a promising solution. We do not dwell much on the learning mechanisms in these frameworks as they are still a matter of debate, with respect to biological constraints. Instead, we emphasize on the compositionality of the real world structures and objects. In particular, we show that CNNs, trained end to end using back propagation with noisy labels, are able to cluster data points belonging to several overlap** shapes, and do so much better than the state of the art algorithms. The main takeaway lesson from our study is that mechanisms of human vision, particularly the hierarchal organization of the visual ventral stream should be taken into account in clustering algorithms (e.g., for learning representations in an unsupervised manner or with minimum supervision) to reach human level clustering performance. This, by no means, suggests that other methods do not hold merits. For example, methods relying on pairwise affinities (e.g., spectral clustering) have been very successful in many scenarios but still fail in some cases (e.g., overlap** clusters). △ Less

Submitted 11 December, 2017; v1 submitted 15 June, 2017; originally announced June 2017.

arXiv:1511.06306 [pdf, other]

Robust Convolutional Neural Networks under Adversarial Noise

Authors: Jonghoon **, Aysegul Dundar, Eugenio Culurciello

Abstract: Recent studies have shown that Convolutional Neural Networks (CNNs) are vulnerable to a small perturbation of input called "adversarial examples". In this work, we propose a new feedforward CNN that improves robustness in the presence of adversarial noise. Our model uses stochastic additive noise added to the input image and to the CNN models. The proposed model operates in conjunction with a CNN… ▽ More Recent studies have shown that Convolutional Neural Networks (CNNs) are vulnerable to a small perturbation of input called "adversarial examples". In this work, we propose a new feedforward CNN that improves robustness in the presence of adversarial noise. Our model uses stochastic additive noise added to the input image and to the CNN models. The proposed model operates in conjunction with a CNN trained with either standard or adversarial objective function. In particular, convolution, max-pooling, and ReLU layers are modified to benefit from the noise model. Our feedforward model is parameterized by only a mean and variance per pixel which simplifies computations and makes our method scalable to a deep architecture. From CIFAR-10 and ImageNet test, the proposed model outperforms other methods and the improvement is more evident for difficult classification tasks or stronger adversarial noise. △ Less

Submitted 25 February, 2016; v1 submitted 19 November, 2015; originally announced November 2015.

Comments: 8 pages

arXiv:1511.06241 [pdf, other]

Convolutional Clustering for Unsupervised Learning

Authors: Aysegul Dundar, Jonghoon **, Eugenio Culurciello

Abstract: The task of labeling data for training deep neural networks is daunting and tedious, requiring millions of labels to achieve the current state-of-the-art results. Such reliance on large amounts of labeled data can be relaxed by exploiting hierarchical features via unsupervised learning techniques. In this work, we propose to train a deep convolutional network based on an enhanced version of the k-… ▽ More The task of labeling data for training deep neural networks is daunting and tedious, requiring millions of labels to achieve the current state-of-the-art results. Such reliance on large amounts of labeled data can be relaxed by exploiting hierarchical features via unsupervised learning techniques. In this work, we propose to train a deep convolutional network based on an enhanced version of the k-means clustering algorithm, which reduces the number of correlated parameters in the form of similar filters, and thus increases test categorization accuracy. We call our algorithm convolutional k-means clustering. We further show that learning the connection between the layers of a deep convolutional neural network improves its ability to be trained on a smaller amount of labeled data. Our experiments show that the proposed algorithm outperforms other techniques that learn filters unsupervised. Specifically, we obtained a test accuracy of 74.1% on STL-10 and a test error of 0.5% on MNIST. △ Less

Submitted 16 February, 2016; v1 submitted 19 November, 2015; originally announced November 2015.

Comments: 11 pages

arXiv:1412.5474 [pdf, other]

Flattened Convolutional Neural Networks for Feedforward Acceleration

Authors: Jonghoon **, Aysegul Dundar, Eugenio Culurciello

Abstract: We present flattened convolutional neural networks that are designed for fast feedforward execution. The redundancy of the parameters, especially weights of the convolutional filters in convolutional neural networks has been extensively studied and different heuristics have been proposed to construct a low rank basis of the filters after training. In this work, we train flattened networks that con… ▽ More We present flattened convolutional neural networks that are designed for fast feedforward execution. The redundancy of the parameters, especially weights of the convolutional filters in convolutional neural networks has been extensively studied and different heuristics have been proposed to construct a low rank basis of the filters after training. In this work, we train flattened networks that consist of consecutive sequence of one-dimensional filters across all directions in 3D space to obtain comparable performance as conventional convolutional networks. We tested flattened model on different datasets and found that the flattened layer can effectively substitute for the 3D filters without loss of accuracy. The flattened convolution pipelines provide around two times speed-up during feedforward pass compared to the baseline model due to the significant reduction of learning parameters. Furthermore, the proposed method does not require efforts in manual tuning or post processing once the model is trained. △ Less

Submitted 20 November, 2015; v1 submitted 17 December, 2014; originally announced December 2014.

Comments: International Conference on Learning Representations (ICLR) 2015

arXiv:1306.0152 [pdf, other]

An Analysis of the Connections Between Layers of Deep Neural Networks

Authors: Eugenio Culurciello, Jonghoon **, Aysegul Dundar, Jordan Bates

Abstract: We present an analysis of different techniques for selecting the connection be- tween layers of deep neural networks. Traditional deep neural networks use ran- dom connection tables between layers to keep the number of connections small and tune to different image features. This kind of connection performs adequately in supervised deep networks because their values are refined during the training.… ▽ More We present an analysis of different techniques for selecting the connection be- tween layers of deep neural networks. Traditional deep neural networks use ran- dom connection tables between layers to keep the number of connections small and tune to different image features. This kind of connection performs adequately in supervised deep networks because their values are refined during the training. On the other hand, in unsupervised learning, one cannot rely on back-propagation techniques to learn the connections between layers. In this work, we tested four different techniques for connecting the first layer of the network to the second layer on the CIFAR and SVHN datasets and showed that the accuracy can be im- proved up to 3% depending on the technique used. We also showed that learning the connections based on the co-occurrences of the features does not confer an advantage over a random connection table in small networks. This work is helpful to improve the efficiency of connections between the layers of unsupervised deep neural networks. △ Less

Submitted 1 June, 2013; originally announced June 2013.

arXiv:1301.2820 [pdf, other]

Clustering Learning for Robotic Vision

Authors: Eugenio Culurciello, Jordan Bates, Aysegul Dundar, Jose Carrasco, Clement Farabet

Abstract: We present the clustering learning technique applied to multi-layer feedforward deep neural networks. We show that this unsupervised learning technique can compute network filters with only a few minutes and a much reduced set of parameters. The goal of this paper is to promote the technique for general-purpose robotic vision systems. We report its use in static image datasets and object tracking… ▽ More We present the clustering learning technique applied to multi-layer feedforward deep neural networks. We show that this unsupervised learning technique can compute network filters with only a few minutes and a much reduced set of parameters. The goal of this paper is to promote the technique for general-purpose robotic vision systems. We report its use in static image datasets and object tracking datasets. We show that networks trained with clustering learning can outperform large networks trained for many hours on complex datasets. △ Less

Submitted 13 March, 2013; v1 submitted 13 January, 2013; originally announced January 2013.

Comments: Code for this paper is available here: https://github.com/culurciello/CL_paper1_code

arXiv:1209.2696 [pdf, ps, other]

Visual Tracking with Similarity Matching Ratio

Authors: Aysegul Dundar, Jonghoon **, Eugenio Culurciello

Abstract: This paper presents a novel approach to visual tracking: Similarity Matching Ratio (SMR). The traditional approach of tracking is minimizing some measures of the difference between the template and a patch from the frame. This approach is vulnerable to outliers and drastic appearance changes and an extensive study is focusing on making the approach more tolerant to them. However, this often result… ▽ More This paper presents a novel approach to visual tracking: Similarity Matching Ratio (SMR). The traditional approach of tracking is minimizing some measures of the difference between the template and a patch from the frame. This approach is vulnerable to outliers and drastic appearance changes and an extensive study is focusing on making the approach more tolerant to them. However, this often results in longer, corrective algo- rithms which do not solve the original problem. This paper proposes a novel approach to the definition of the tracking problems, SMR, which turns the differences into a probability measure. Only pixel differences below a threshold count towards deciding the match, the rest are ignored. This approach makes the SMR tracker robust to outliers and points that dramaticaly change appearance. The SMR tracker is tested on challenging video sequences and achieved state-of-the-art performance. △ Less

Submitted 12 September, 2012; originally announced September 2012.

arXiv:1104.1112 [pdf, ps, other]

doi 10.1063/1.3593963

Electromechanical wavelength tuning of double-membrane photonic crystal cavities

Authors: L. Midolo, P. J. van Veldhoven, M. A. Dundar, R. Nötzel, A. Fiore

Abstract: We present a method for tuning the resonant wavelength of photonic crystal cavities (PCCs) around 1.55 um. Large tuning of the PCC mode is enabled by electromechanically controlling the separation between two parallel InGaAsP membranes. A fabrication method to avoid sticking between the membranes is discussed. Reversible red/blue shifting of the symmetric/anti-symmetric modes has been observed, wh… ▽ More We present a method for tuning the resonant wavelength of photonic crystal cavities (PCCs) around 1.55 um. Large tuning of the PCC mode is enabled by electromechanically controlling the separation between two parallel InGaAsP membranes. A fabrication method to avoid sticking between the membranes is discussed. Reversible red/blue shifting of the symmetric/anti-symmetric modes has been observed, which provides clear evidence of the electromechanical tuning, and a maximum shift of 10 nm with < 6 V applied bias has been obtained. △ Less

Submitted 6 April, 2011; originally announced April 2011.

Comments: 9 pages, 3 figures

Journal ref: Appl. Phys. Lett. 98, 211120 (2011)

arXiv:0705.2637 [pdf]

doi 10.1364/JOSAB.24.001824

A method for volume stabilization of single, dye-doped water microdroplets with femtoliter resolution

Authors: A. Kiraz, A. Kurt, M. A. Dündar, M. Y. Yüce, A. L. Demirel

Abstract: A self-control mechanism that stabilizes the size of Rhodamine B-doped water microdroplets standing on a superhydrophobic surface is demonstrated. The mechanism relies on the interplay between the condensation rate that was kept constant and evaporation rate induced by laser excitation which critically depends on the size of the microdroplets. The radii of individual water microdroplets (>5 um)… ▽ More A self-control mechanism that stabilizes the size of Rhodamine B-doped water microdroplets standing on a superhydrophobic surface is demonstrated. The mechanism relies on the interplay between the condensation rate that was kept constant and evaporation rate induced by laser excitation which critically depends on the size of the microdroplets. The radii of individual water microdroplets (>5 um) stayed within a few nanometers during long time periods (up to 455 seconds). By blocking the laser excitation for 500 msec, the stable volume of individual microdroplets was shown to change stepwise. △ Less

Submitted 18 May, 2007; originally announced May 2007.

Comments: to appear in the J. Op. Soc. Am. B

arXiv:0705.2482 [pdf]

doi 10.1016/j.optcom.2007.04.026

Lasing from single, stationary, dye-doped glycerol/water microdroplets located on a superhydrophobic surface

Authors: A. Kiraz, A. Sennaroglu, S. Doğanay, M. A. Dündar, A. Kurt, H. Kalaycıoğlu, A. L. Demirel

Abstract: We report laser emission from single, stationary, Rhodamine B-doped glycerol/water microdroplets located on a superhydrophobic surface. In the experiments, a pulsed, frequency-doubled Nd:YAG laser operating at 532 nm was used as the excitation source. The microdroplets ranged in diameter from a few to 20 um. Lasing was achieved in the red-shifted portion of the dye emission spectrum with thresho… ▽ More We report laser emission from single, stationary, Rhodamine B-doped glycerol/water microdroplets located on a superhydrophobic surface. In the experiments, a pulsed, frequency-doubled Nd:YAG laser operating at 532 nm was used as the excitation source. The microdroplets ranged in diameter from a few to 20 um. Lasing was achieved in the red-shifted portion of the dye emission spectrum with threshold fluences as low as 750 J/cm2. Photobleaching was observed when the microdroplets were pumped above threshold. In certain cases, multimode lasing was also observed and attributed to the simultaneous lasing of two modes belonging to different sets of whispering gallery modes. △ Less

Submitted 17 May, 2007; originally announced May 2007.

Comments: to appear in Optics Communications

Showing 1–31 of 31 results for author: Dundar, A