Search | arXiv e-print repository

arXiv:2406.12095 [pdf, other]

DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

Authors: Letian Wang, Seung Wook Kim, Jiawei Yang, Cunjun Yu, Boris Ivanovic, Steven L. Waslander, Yue Wang, Sanja Fidler, Marco Pavone, Peter Karkus

Abstract: We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in autonomous driving. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs, and is trained self-supervised with differentiable rendering to reconstruct RGB,… ▽ More We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in autonomous driving. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets for training, thereby hel** our model to learn 3D geometry from sparse non-overlap** image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream tasks without the need for costly 3D human annotations. To leverage these two insights, we introduce a novel model architecture with a two-stage lift-splat-shoot encoder and a parameterized sparse hierarchical voxel representation. Experimental results on the NuScenes dataset demonstrate that DistillNeRF significantly outperforms existing comparable self-supervised methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available at https://distillnerf.github.io/. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.10324 [pdf, other]

L4GM: Large 4D Gaussian Reconstruction Model

Authors: Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xiaohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim, Huan Ling

Abstract: We present L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input -- in a single feed-forward pass that takes only a second. Key to our success is a novel dataset of multiview videos containing curated, rendered animated objects from Objaverse. This dataset depicts 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in… ▽ More We present L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input -- in a single feed-forward pass that takes only a second. Key to our success is a novel dataset of multiview videos containing curated, rendered animated objects from Objaverse. This dataset depicts 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in 12M videos with a total of 300M frames. We keep our L4GM simple for scalability and build directly on top of LGM, a pretrained 3D Large Reconstruction Model that outputs 3D Gaussian ellipsoids from multiview image input. L4GM outputs a per-frame 3D Gaussian Splatting representation from video frames sampled at a low fps and then upsamples the representation to a higher fps to achieve temporal smoothness. We add temporal self-attention layers to the base LGM to help it learn consistency across time, and utilize a per-timestep multiview rendering loss to train the model. The representation is upsampled to a higher framerate by training an interpolation model which produces intermediate 3D Gaussian representations. We showcase that L4GM that is only trained on synthetic data generalizes extremely well on in-the-wild videos, producing high quality animated 3D assets. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Project page: https://research.nvidia.com/labs/toronto-ai/l4gm

arXiv:2406.06650 [pdf, other]

Predicting the risk of early-stage breast cancer recurrence using H\&E-stained tissue images

Authors: Geongyu Lee, Joonho Lee, Tae-Yeong Kwak, Sun Woo Kim, Youngmee Kwon, Chungyeul Kim, Hyeyoon Chang

Abstract: Accurate prediction of the likelihood of recurrence is important in the selection of postoperative treatment for patients with early-stage breast cancer. In this study, we investigated whether deep learning algorithms can predict patients' risk of recurrence by analyzing the pathology images of their cancer histology. A total of 125 hematoxylin and eosin stained breast cancer whole slide images la… ▽ More Accurate prediction of the likelihood of recurrence is important in the selection of postoperative treatment for patients with early-stage breast cancer. In this study, we investigated whether deep learning algorithms can predict patients' risk of recurrence by analyzing the pathology images of their cancer histology. A total of 125 hematoxylin and eosin stained breast cancer whole slide images labeled with the risk prediction via genomics assays were used, and we obtained sensitivity of 0.857, 0.746, and 0.529 for predicting low, intermediate, and high risk, and specificity of 0.816, 0.803, and 0.972. When compared to the expert pathologist's regional histology grade information, a Pearson's correlation coefficient of 0.61 was obtained. When we checked the model learned through these studies through the class activation map, we found that it actually considered tubule formation and mitotic rate when predicting different risk groups. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 12 pages, 7 figures

arXiv:2405.14126 [pdf, other]

The Disappearance of Timestep Embedding in Modern Time-Dependent Neural Networks

Authors: Bum Jun Kim, Yoshinobu Kawahara, Sang Woo Kim

Abstract: Dynamical systems are often time-varying, whose modeling requires a function that evolves with respect to time. Recent studies such as the neural ordinary differential equation proposed a time-dependent neural network, which provides a neural network varying with respect to time. However, we claim that the architectural choice to build a time-dependent neural network significantly affects its time… ▽ More Dynamical systems are often time-varying, whose modeling requires a function that evolves with respect to time. Recent studies such as the neural ordinary differential equation proposed a time-dependent neural network, which provides a neural network varying with respect to time. However, we claim that the architectural choice to build a time-dependent neural network significantly affects its time-awareness but still lacks sufficient validation in its current states. In this study, we conduct an in-depth analysis of the architecture of modern time-dependent neural networks. Here, we report a vulnerability of vanishing timestep embedding, which disables the time-awareness of a time-dependent neural network. Furthermore, we find that this vulnerability can also be observed in diffusion models because they employ a similar architecture that incorporates timestep embedding to discriminate between different timesteps during a diffusion process. Our analysis provides a detailed description of this phenomenon as well as several solutions to address the root cause. Through experiments on neural ordinary differential equations and diffusion models, we observed that ensuring alive time-awareness via proposed solutions boosted their performance, which implies that their current implementations lack sufficient time-dependency. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 14 pages, 7 figures

arXiv:2405.14115 [pdf, other]

Configuring Data Augmentations to Reduce Variance Shift in Positional Embedding of Vision Transformers

Authors: Bum Jun Kim, Sang Woo Kim

Abstract: Vision transformers (ViTs) have demonstrated remarkable performance in a variety of vision tasks. Despite their promising capabilities, training a ViT requires a large amount of diverse data. Several studies empirically found that using rich data augmentations, such as Mixup, Cutmix, and random erasing, is critical to the successful training of ViTs. Now, the use of rich data augmentations has bec… ▽ More Vision transformers (ViTs) have demonstrated remarkable performance in a variety of vision tasks. Despite their promising capabilities, training a ViT requires a large amount of diverse data. Several studies empirically found that using rich data augmentations, such as Mixup, Cutmix, and random erasing, is critical to the successful training of ViTs. Now, the use of rich data augmentations has become a standard practice in the current state. However, we report a vulnerability to this practice: Certain data augmentations such as Mixup cause a variance shift in the positional embedding of ViT, which has been a hidden factor that degrades the performance of ViT during the test phase. We claim that achieving a stable effect from positional embedding requires a specific condition on the image, which is often broken for the current data augmentation methods. We provide a detailed analysis of this problem as well as the correct configuration for these data augmentations to remove the side effects of variance shift. Experiments showed that adopting our guidelines improves the performance of ViTs compared with the current configuration of data augmentations. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 16 pages, 4 figures

arXiv:2404.10765 [pdf, other]

RefFusion: Reference Adapted Diffusion Models for 3D Scene Inpainting

Authors: Ashkan Mirzaei, Riccardo De Lutio, Seung Wook Kim, David Acuna, Jonathan Kelly, Sanja Fidler, Igor Gilitschenski, Zan Gojcic

Abstract: Neural reconstruction approaches are rapidly emerging as the preferred representation for 3D scenes, but their limited editability is still posing a challenge. In this work, we propose an approach for 3D scene inpainting -- the task of coherently replacing parts of the reconstructed scene with desired content. Scene inpainting is an inherently ill-posed task as there exist many solutions that plau… ▽ More Neural reconstruction approaches are rapidly emerging as the preferred representation for 3D scenes, but their limited editability is still posing a challenge. In this work, we propose an approach for 3D scene inpainting -- the task of coherently replacing parts of the reconstructed scene with desired content. Scene inpainting is an inherently ill-posed task as there exist many solutions that plausibly replace the missing content. A good inpainting method should therefore not only enable high-quality synthesis but also a high degree of control. Based on this observation, we focus on enabling explicit control over the inpainted content and leverage a reference image as an efficient means to achieve this goal. Specifically, we introduce RefFusion, a novel 3D inpainting method based on a multi-scale personalization of an image inpainting diffusion model to the given reference view. The personalization effectively adapts the prior distribution to the target scene, resulting in a lower variance of score distillation objective and hence significantly sharper details. Our framework achieves state-of-the-art results for object removal while maintaining high controllability. We further demonstrate the generality of our formulation on other downstream tasks such as object insertion, scene outpainting, and sparse view reconstruction. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: Project page: https://reffusion.github.io

arXiv:2402.01149 [pdf, other]

Scale Equalization for Multi-Level Feature Fusion

Authors: Bum Jun Kim, Sang Woo Kim

Abstract: Deep neural networks have exhibited remarkable performance in a variety of computer vision fields, especially in semantic segmentation tasks. Their success is often attributed to multi-level feature fusion, which enables them to understand both global and local information from an image. However, we found that multi-level features from parallel branches are on different scales. The scale disequili… ▽ More Deep neural networks have exhibited remarkable performance in a variety of computer vision fields, especially in semantic segmentation tasks. Their success is often attributed to multi-level feature fusion, which enables them to understand both global and local information from an image. However, we found that multi-level features from parallel branches are on different scales. The scale disequilibrium is a universal and unwanted flaw that leads to detrimental gradient descent, thereby degrading performance in semantic segmentation. We discover that scale disequilibrium is caused by bilinear upsampling, which is supported by both theoretical and empirical evidence. Based on this observation, we propose injecting scale equalizers to achieve scale equilibrium across multi-level features after bilinear upsampling. Our proposed scale equalizers are easy to implement, applicable to any architecture, hyperparameter-free, implementable without requiring extra computational cost, and guarantee scale equilibrium for any dataset. Experiments showed that adopting scale equalizers consistently improved the mIoU index across various target datasets, including ADE20K, PASCAL VOC 2012, and Cityscapes, as well as various decoder choices, including UPerHead, PSPHead, ASPPHead, SepASPPHead, and FCNHead. △ Less

Submitted 2 February, 2024; originally announced February 2024.

Comments: 10 pages, 3 figures

arXiv:2401.11739 [pdf, other]

EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models

Authors: Koichi Namekata, Amirmojtaba Sabour, Sanja Fidler, Seung Wook Kim

Abstract: Diffusion models have recently received increasing research attention for their remarkable transfer abilities in semantic segmentation tasks. However, generating fine-grained segmentation masks with diffusion models often requires additional training on annotated datasets, leaving it unclear to what extent pre-trained diffusion models alone understand the semantic relations of their generated imag… ▽ More Diffusion models have recently received increasing research attention for their remarkable transfer abilities in semantic segmentation tasks. However, generating fine-grained segmentation masks with diffusion models often requires additional training on annotated datasets, leaving it unclear to what extent pre-trained diffusion models alone understand the semantic relations of their generated images. To address this question, we leverage the semantic knowledge extracted from Stable Diffusion (SD) and aim to develop an image segmentor capable of generating fine-grained segmentation maps without any additional training. The primary difficulty stems from the fact that semantically meaningful feature maps typically exist only in the spatially lower-dimensional layers, which poses a challenge in directly extracting pixel-level semantic relations from these feature maps. To overcome this issue, our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps by exploiting SD's generation process and utilizes them for constructing image-resolution segmentation maps. In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images, indicating the existence of highly accurate pixel-level semantic knowledge in diffusion models. △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: ICLR 2024. Project page: https://kmcode1.github.io/Projects/EmerDiff/

arXiv:2312.13763 [pdf, other]

Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models

Authors: Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fidler, Karsten Kreis

Abstract: Text-guided diffusion models have revolutionized image and video generation and have also been successfully used for optimization-based 3D object synthesis. Here, we instead focus on the underexplored text-to-4D setting and synthesize dynamic, animated 3D objects using score distillation methods with an additional temporal dimension. Compared to previous work, we pursue a novel compositional gener… ▽ More Text-guided diffusion models have revolutionized image and video generation and have also been successfully used for optimization-based 3D object synthesis. Here, we instead focus on the underexplored text-to-4D setting and synthesize dynamic, animated 3D objects using score distillation methods with an additional temporal dimension. Compared to previous work, we pursue a novel compositional generation-based approach, and combine text-to-image, text-to-video, and 3D-aware multiview diffusion models to provide feedback during 4D object optimization, thereby simultaneously enforcing temporal consistency, high-quality visual appearance and realistic geometry. Our method, called Align Your Gaussians (AYG), leverages dynamic 3D Gaussian Splatting with deformation fields as 4D representation. Crucial to AYG is a novel method to regularize the distribution of the moving 3D Gaussians and thereby stabilize the optimization and induce motion. We also propose a motion amplification mechanism as well as a new autoregressive synthesis scheme to generate and combine multiple 4D sequences for longer generation. These techniques allow us to synthesize vivid dynamic scenes, outperform previous work qualitatively and quantitatively and achieve state-of-the-art text-to-4D performance. Due to the Gaussian 4D representation, different 4D animations can be seamlessly combined, as we demonstrate. AYG opens up promising avenues for animation, simulation and digital content creation as well as synthetic data generation. △ Less

Submitted 3 January, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

Comments: Project page: https://research.nvidia.com/labs/toronto-ai/AlignYourGaussians/

arXiv:2311.13570 [pdf, other]

WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space

Authors: Katja Schwarz, Seung Wook Kim, Jun Gao, Sanja Fidler, Andreas Geiger, Karsten Kreis

Abstract: Modern learning-based approaches to 3D-aware image synthesis achieve high photorealism and 3D-consistent viewpoint changes for the generated images. Existing approaches represent instances in a shared canonical space. However, for in-the-wild datasets a shared canonical system can be difficult to define or might not even exist. In this work, we instead model instances in view space, alleviating th… ▽ More Modern learning-based approaches to 3D-aware image synthesis achieve high photorealism and 3D-consistent viewpoint changes for the generated images. Existing approaches represent instances in a shared canonical space. However, for in-the-wild datasets a shared canonical system can be difficult to define or might not even exist. In this work, we instead model instances in view space, alleviating the need for posed images and learned camera distributions. We find that in this setting, existing GAN-based methods are prone to generating flat geometry and struggle with distribution coverage. We hence propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs). We first train an autoencoder that infers a compressed latent representation, which additionally captures the images' underlying 3D structure and enables not only reconstruction but also novel view synthesis. To learn a faithful 3D representation, we leverage cues from monocular depth prediction. Then, we train a diffusion model in the 3D-aware latent space, thereby enabling synthesis of high-quality 3D-consistent image samples, outperforming recent state-of-the-art GAN-based methods. Importantly, our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry and does not require posed images or learned pose or camera distributions. It directly learns a 3D representation without relying on canonical camera coordinates. This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data. See https://katjaschwarz.github.io/wildfusion for videos of our 3D results. △ Less

Submitted 12 April, 2024; v1 submitted 22 November, 2023; originally announced November 2023.

arXiv:2311.03938 [pdf, other]

Analysis of NaN Divergence in Training Monocular Depth Estimation Model

Authors: Bum Jun Kim, Hyeonah Jang, Sang Woo Kim

Abstract: The latest advances in deep learning have facilitated the development of highly accurate monocular depth estimation models. However, when training a monocular depth estimation network, practitioners and researchers have observed not a number (NaN) loss, which disrupts gradient descent optimization. Although several practitioners have reported the stochastic and mysterious occurrence of NaN loss th… ▽ More The latest advances in deep learning have facilitated the development of highly accurate monocular depth estimation models. However, when training a monocular depth estimation network, practitioners and researchers have observed not a number (NaN) loss, which disrupts gradient descent optimization. Although several practitioners have reported the stochastic and mysterious occurrence of NaN loss that bothers training, its root cause is not discussed in the literature. This study conducted an in-depth analysis of NaN loss during training a monocular depth estimation network and identified three types of vulnerabilities that cause NaN loss: 1) the use of square root loss, which leads to an unstable gradient; 2) the log-sigmoid function, which exhibits numerical stability issues; and 3) certain variance implementations, which yield incorrect computations. Furthermore, for each vulnerability, the occurrence of NaN loss was demonstrated and practical guidelines to prevent NaN loss were presented. Experiments showed that both optimization stability and performance on monocular depth estimation could be improved by following our guidelines. △ Less

Submitted 7 November, 2023; originally announced November 2023.

Comments: 10 pages, 3 figures

arXiv:2311.02077 [pdf, other]

EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision

Authors: Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Seung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler, Marco Pavone, Yue Wang

Abstract: We present EmerNeRF, a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. Grounded in neural fields, EmerNeRF simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrap**. EmerNeRF hinges upon two core components: First, it stratifies scenes into static and dynamic fields. This decomposition emerges purely from… ▽ More We present EmerNeRF, a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. Grounded in neural fields, EmerNeRF simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrap**. EmerNeRF hinges upon two core components: First, it stratifies scenes into static and dynamic fields. This decomposition emerges purely from self-supervision, enabling our model to learn from general, in-the-wild data sources. Second, EmerNeRF parameterizes an induced flow field from the dynamic field and uses this flow field to further aggregate multi-frame features, amplifying the rendering precision of dynamic objects. Coupling these three fields (static, dynamic, and flow) enables EmerNeRF to represent highly-dynamic scenes self-sufficiently, without relying on ground truth object annotations or pre-trained models for dynamic object segmentation or optical flow estimation. Our method achieves state-of-the-art performance in sensor simulation, significantly outperforming previous methods when reconstructing static (+2.93 PSNR) and dynamic (+3.70 PSNR) scenes. In addition, to bolster EmerNeRF's semantic generalization, we lift 2D visual foundation model features into 4D space-time and address a general positional bias in modern Transformers, significantly boosting 3D perception performance (e.g., 37.50% relative improvement in occupancy prediction accuracy on average). Finally, we construct a diverse and challenging 120-sequence dataset to benchmark neural fields under extreme and highly-dynamic settings. △ Less

Submitted 3 November, 2023; originally announced November 2023.

Comments: See the project page for code, data, and request pre-trained models: https://emernerf.github.io

arXiv:2307.14179 [pdf, other]

Resolution-Aware Design of Atrous Rates for Semantic Segmentation Networks

Authors: Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Sang Woo Kim

Abstract: DeepLab is a widely used deep neural network for semantic segmentation, whose success is attributed to its parallel architecture called atrous spatial pyramid pooling (ASPP). ASPP uses multiple atrous convolutions with different atrous rates to extract both local and global information. However, fixed values of atrous rates are used for the ASPP module, which restricts the size of its field of vie… ▽ More DeepLab is a widely used deep neural network for semantic segmentation, whose success is attributed to its parallel architecture called atrous spatial pyramid pooling (ASPP). ASPP uses multiple atrous convolutions with different atrous rates to extract both local and global information. However, fixed values of atrous rates are used for the ASPP module, which restricts the size of its field of view. In principle, atrous rate should be a hyperparameter to change the field of view size according to the target task or dataset. However, the manipulation of atrous rate is not governed by any guidelines. This study proposes practical guidelines for obtaining an optimal atrous rate. First, an effective receptive field for semantic segmentation is introduced to analyze the inner behavior of segmentation networks. We observed that the use of ASPP module yielded a specific pattern in the effective receptive field, which was traced to reveal the module's underlying mechanism. Accordingly, we derive practical guidelines for obtaining the optimal atrous rate, which should be controlled based on the size of input image. Compared to other values, using the optimal atrous rate consistently improved the segmentation results across multiple datasets, including the STARE, CHASE_DB1, HRF, Cityscapes, and iSAID datasets. △ Less

Submitted 26 July, 2023; originally announced July 2023.

Comments: 18 pages, 12 figures

arXiv:2307.07487 [pdf, other]

DreamTeacher: Pretraining Image Backbones with Deep Generative Models

Authors: Daiqing Li, Huan Ling, Amlan Kar, David Acuna, Seung Wook Kim, Karsten Kreis, Antonio Torralba, Sanja Fidler

Abstract: In this work, we introduce a self-supervised feature representation learning framework DreamTeacher that utilizes generative networks for pre-training downstream image backbones. We propose to distill knowledge from a trained generative model into standard image backbones that have been well engineered for specific perception tasks. We investigate two types of knowledge distillation: 1) distilling… ▽ More In this work, we introduce a self-supervised feature representation learning framework DreamTeacher that utilizes generative networks for pre-training downstream image backbones. We propose to distill knowledge from a trained generative model into standard image backbones that have been well engineered for specific perception tasks. We investigate two types of knowledge distillation: 1) distilling learned generative features onto target image backbones as an alternative to pretraining these backbones on large labeled datasets such as ImageNet, and 2) distilling labels obtained from generative networks with task heads onto logits of target backbones. We perform extensive analyses on multiple generative models, dense prediction benchmarks, and several pre-training regimes. We empirically find that our DreamTeacher significantly outperforms existing self-supervised representation learning approaches across the board. Unsupervised ImageNet pre-training with DreamTeacher leads to significant improvements over ImageNet classification pre-training on downstream datasets, showcasing generative models, and diffusion generative models specifically, as a promising approach to representation learning on large, diverse datasets without requiring manual annotation. △ Less

Submitted 14 July, 2023; originally announced July 2023.

Comments: Project page: https://research.nvidia.com/labs/toronto-ai/DreamTeacher/

arXiv:2305.04722 [pdf, other]

Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields

Authors: Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Sang Woo Kim

Abstract: Vision transformers (ViTs) that model an image as a sequence of partitioned patches have shown notable performance in diverse vision tasks. Because partitioning patches eliminates the image structure, to reflect the order of patches, ViTs utilize an explicit component called positional embedding. However, we claim that the use of positional embedding does not simply guarantee the order-awareness o… ▽ More Vision transformers (ViTs) that model an image as a sequence of partitioned patches have shown notable performance in diverse vision tasks. Because partitioning patches eliminates the image structure, to reflect the order of patches, ViTs utilize an explicit component called positional embedding. However, we claim that the use of positional embedding does not simply guarantee the order-awareness of ViT. To support this claim, we analyze the actual behavior of ViTs using an effective receptive field. We demonstrate that during training, ViT acquires an understanding of patch order from the positional embedding that is trained to be a specific pattern. Based on this observation, we propose explicitly adding a Gaussian attention bias that guides the positional embedding to have the corresponding pattern from the beginning of training. We evaluated the influence of Gaussian attention bias on the performance of ViTs in several image classification, object detection, and semantic segmentation experiments. The results showed that proposed method not only facilitates ViTs to understand images but also boosts their performance on various datasets, including ImageNet, COCO 2017, and ADE20K. △ Less

Submitted 8 May, 2023; originally announced May 2023.

Comments: 11 pages, 7 Figures

arXiv:2304.09787 [pdf, other]

NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

Authors: Seung Wook Kim, Bradley Brown, Kangxue Yin, Karsten Kreis, Katja Schwarz, Daiqing Li, Robin Rombach, Antonio Torralba, Sanja Fidler

Abstract: Automatically generating high-quality real world 3D scenes is of enormous interest for applications such as virtual reality and robotics simulation. Towards this goal, we introduce NeuralField-LDM, a generative model capable of synthesizing complex 3D environments. We leverage Latent Diffusion Models that have been successfully utilized for efficient high-quality 2D content creation. We first trai… ▽ More Automatically generating high-quality real world 3D scenes is of enormous interest for applications such as virtual reality and robotics simulation. Towards this goal, we introduce NeuralField-LDM, a generative model capable of synthesizing complex 3D environments. We leverage Latent Diffusion Models that have been successfully utilized for efficient high-quality 2D content creation. We first train a scene auto-encoder to express a set of image and pose pairs as a neural field, represented as density and feature voxel grids that can be projected to produce novel views of the scene. To further compress this representation, we train a latent-autoencoder that maps the voxel grids to a set of latent representations. A hierarchical diffusion model is then fit to the latents to complete the scene generation pipeline. We achieve a substantial improvement over existing state-of-the-art scene generation models. Additionally, we show how NeuralField-LDM can be used for a variety of 3D content creation applications, including conditional scene generation, scene inpainting and scene style manipulation. △ Less

Submitted 19 April, 2023; originally announced April 2023.

Comments: CVPR 2023

arXiv:2304.08818 [pdf, other]

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Authors: Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis

Abstract: Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by int… ▽ More Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling. In particular, we validate our Video LDM on real driving videos of resolution 512 x 1024, achieving state-of-the-art performance. Furthermore, our approach can easily leverage off-the-shelf pre-trained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048. We show that the temporal layers trained in this way generalize to different fine-tuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation. Project page: https://research.nvidia.com/labs/toronto-ai/VideoLDM/ △ Less

Submitted 27 December, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

Comments: Conference on Computer Vision and Pattern Recognition (CVPR) 2023. Project page: https://research.nvidia.com/labs/toronto-ai/VideoLDM/

arXiv:2302.06112 [pdf, other]

How to Use Dropout Correctly on Residual Networks with Batch Normalization

Authors: Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Donggeon Lee, Sang Woo Kim

Abstract: For the stable optimization of deep neural networks, regularization methods such as dropout and batch normalization have been used in various tasks. Nevertheless, the correct position to apply dropout has rarely been discussed, and different positions have been employed depending on the practitioners. In this study, we investigate the correct position to apply dropout. We demonstrate that for a re… ▽ More For the stable optimization of deep neural networks, regularization methods such as dropout and batch normalization have been used in various tasks. Nevertheless, the correct position to apply dropout has rarely been discussed, and different positions have been employed depending on the practitioners. In this study, we investigate the correct position to apply dropout. We demonstrate that for a residual network with batch normalization, applying dropout at certain positions increases the performance, whereas applying dropout at other positions decreases the performance. Based on theoretical analysis, we provide the following guideline for the correct position to apply dropout: apply one dropout after the last batch normalization but before the last weight layer in the residual branch. We provide detailed theoretical explanations to support this claim and demonstrate them through module tests. In addition, we investigate the correct position of dropout in the head that produces the final prediction. Although the current consensus is to apply dropout after global average pooling, we prove that applying dropout before global average pooling leads to a more stable output. The proposed guidelines are validated through experiments using different datasets and models. △ Less

Submitted 13 February, 2023; originally announced February 2023.

Comments: 10 pages, 4 figures

arXiv:2302.03193 [pdf, other]

On the Ideal Number of Groups for Isometric Gradient Propagation

Authors: Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Sang Woo Kim

Abstract: Recently, various normalization layers have been proposed to stabilize the training of deep neural networks. Among them, group normalization is a generalization of layer normalization and instance normalization by allowing a degree of freedom in the number of groups it uses. However, to determine the optimal number of groups, trial-and-error-based hyperparameter tuning is required, and such experi… ▽ More Recently, various normalization layers have been proposed to stabilize the training of deep neural networks. Among them, group normalization is a generalization of layer normalization and instance normalization by allowing a degree of freedom in the number of groups it uses. However, to determine the optimal number of groups, trial-and-error-based hyperparameter tuning is required, and such experiments are time-consuming. In this study, we discuss a reasonable method for setting the number of groups. First, we find that the number of groups influences the gradient behavior of the group normalization layer. Based on this observation, we derive the ideal number of groups, which calibrates the gradient scale to facilitate gradient descent optimization. Our proposed number of groups is theoretically grounded, architecture-aware, and can provide a proper value in a layer-wise manner for all layers. The proposed method exhibited improved performance over existing methods in numerous neural network architectures, tasks, and datasets. △ Less

Submitted 6 February, 2023; originally announced February 2023.

Comments: 10 pages, 2 figures

arXiv:2208.12544 [pdf]

doi 10.1016/j.combustflame.2022.112583

Deep learning-based denoising for fast time-resolved flame emission spectroscopy in high-pressure combustion environment

Authors: Taekeun Yoon, Seon Woong Kim, Hosung Byun, Younsik Kim, Campbell D. Carter, Hyungrok Do

Abstract: A deep learning strategy is developed for fast and accurate gas property measurements using flame emission spectroscopy (FES). Particularly, the short-gated fast FES is essential to resolve fast-evolving combustion behaviors. However, as the exposure time for capturing the flame emission spectrum gets shorter, the signal-to-noise ratio (SNR) decreases, and characteristic spectral features indicati… ▽ More A deep learning strategy is developed for fast and accurate gas property measurements using flame emission spectroscopy (FES). Particularly, the short-gated fast FES is essential to resolve fast-evolving combustion behaviors. However, as the exposure time for capturing the flame emission spectrum gets shorter, the signal-to-noise ratio (SNR) decreases, and characteristic spectral features indicating the gas properties become relatively weaker. Then, the property estimation based on the short-gated spectrum is difficult and inaccurate. Denoising convolutional neural networks (CNN) can enhance the SNR of the short-gated spectrum. A new CNN architecture including a reversible down- and up-sampling (DU) operator and a loss function based on proper orthogonal decomposition (POD) coefficients is proposed. For training and testing the CNN, flame chemiluminescence spectra were captured from a stable methane-air flat flame using a portable spectrometer (spectral range: 250 - 850 nm, resolution: 0.5 nm) with varied equivalence ratio (0.8 - 1.2), pressure (1 - 10 bar), and exposure time (0.05, 0.2, 0.4, and 2 s). The long exposure (2 s) spectra were used as the ground truth when training the denoising CNN. A kriging model with POD is trained by the long-gated spectra for calibration, and then the prediction of the gas properties taking the denoised short-gated spectrum as the input: The property prediction errors of pressure and equivalence ratio were remarkably lowered in spite of the low SNR attendant with reduced exposure. △ Less

Submitted 26 December, 2022; v1 submitted 29 July, 2022; originally announced August 2022.

Comments: 25 pages, 12 figures, accepted to Combustion and Flame

Report number: Combustion and Flame 248 (2023) 112583

arXiv:2206.02903 [pdf, other]

Polymorphic-GAN: Generating Aligned Samples across Multiple Domains with Learned Morph Maps

Authors: Seung Wook Kim, Karsten Kreis, Daiqing Li, Antonio Torralba, Sanja Fidler

Abstract: Modern image generative models show remarkable sample quality when trained on a single domain or class of objects. In this work, we introduce a generative adversarial network that can simultaneously generate aligned image samples from multiple related domains. We leverage the fact that a variety of object classes share common attributes, with certain geometric differences. We propose Polymorphic-G… ▽ More Modern image generative models show remarkable sample quality when trained on a single domain or class of objects. In this work, we introduce a generative adversarial network that can simultaneously generate aligned image samples from multiple related domains. We leverage the fact that a variety of object classes share common attributes, with certain geometric differences. We propose Polymorphic-GAN which learns shared features across all domains and a per-domain morph layer to morph shared features according to each domain. In contrast to previous works, our framework allows simultaneous modelling of images with highly varying geometries, such as images of human faces, painted and artistic faces, as well as multiple different animal faces. We demonstrate that our model produces aligned samples for all domains and show how it can be used for applications such as segmentation transfer and cross-domain image editing, as well as training in low-data regimes. Additionally, we apply our Polymorphic-GAN on image-to-image translation tasks and show that we can greatly surpass previous approaches in cases where the geometric differences between domains are large. △ Less

Submitted 6 June, 2022; originally announced June 2022.

Comments: CVPR 2022 Oral

arXiv:2205.07260 [pdf, other]

Guidelines for the Regularization of Gammas in Batch Normalization for Deep Residual Networks

Authors: Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Dong Gu Lee, Wonseok Jeong, Sang Woo Kim

Abstract: L2 regularization for weights in neural networks is widely used as a standard training trick. However, L2 regularization for gamma, a trainable parameter of batch normalization, remains an undiscussed mystery and is applied in different ways depending on the library and practitioner. In this paper, we study whether L2 regularization for gamma is valid. To explore this issue, we consider two approa… ▽ More L2 regularization for weights in neural networks is widely used as a standard training trick. However, L2 regularization for gamma, a trainable parameter of batch normalization, remains an undiscussed mystery and is applied in different ways depending on the library and practitioner. In this paper, we study whether L2 regularization for gamma is valid. To explore this issue, we consider two approaches: 1) variance control to make the residual network behave like identity map** and 2) stable optimization through the improvement of effective learning rate. Through two analyses, we specify the desirable and undesirable gamma to apply L2 regularization and propose four guidelines for managing them. In several experiments, we observed the increase and decrease in performance caused by applying L2 regularization to gamma of four categories, which is consistent with our four guidelines. Our proposed guidelines were validated through various tasks and architectures, including variants of residual networks and transformers. △ Less

Submitted 15 May, 2022; originally announced May 2022.

Comments: 12 pages, 6 figures

arXiv:2201.04684 [pdf, other]

BigDatasetGAN: Synthesizing ImageNet with Pixel-wise Annotations

Authors: Daiqing Li, Huan Ling, Seung Wook Kim, Karsten Kreis, Adela Barriuso, Sanja Fidler, Antonio Torralba

Abstract: Annotating images with pixel-wise labels is a time-consuming and costly process. Recently, DatasetGAN showcased a promising alternative - to synthesize a large labeled dataset via a generative adversarial network (GAN) by exploiting a small set of manually labeled, GAN-generated images. Here, we scale DatasetGAN to ImageNet scale of class diversity. We take image samples from the class-conditional… ▽ More Annotating images with pixel-wise labels is a time-consuming and costly process. Recently, DatasetGAN showcased a promising alternative - to synthesize a large labeled dataset via a generative adversarial network (GAN) by exploiting a small set of manually labeled, GAN-generated images. Here, we scale DatasetGAN to ImageNet scale of class diversity. We take image samples from the class-conditional generative model BigGAN trained on ImageNet, and manually annotate 5 images per class, for all 1k classes. By training an effective feature segmentation architecture on top of BigGAN, we turn BigGAN into a labeled dataset generator. We further show that VQGAN can similarly serve as a dataset generator, leveraging the already annotated data. We create a new ImageNet benchmark by labeling an additional set of 8k real images and evaluate segmentation performance in a variety of settings. Through an extensive ablation study we show big gains in leveraging a large generated dataset to train different supervised and self-supervised backbone models on pixel-wise tasks. Furthermore, we demonstrate that using our synthesized datasets for pre-training leads to improvements over standard ImageNet pre-training on several downstream datasets, such as PASCAL-VOC, MS-COCO, Cityscapes and chest X-ray, as well as tasks (detection, segmentation). Our benchmark will be made public and maintain a leaderboard for this challenging task. Project Page: https://nv-tlabs.github.io/big-datasetgan/ △ Less

Submitted 12 January, 2022; originally announced January 2022.

Comments: https://nv-tlabs.github.io/big-datasetgan/

arXiv:2111.08413 [pdf, other]

Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding

Authors: Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Dong Gu Lee, Wonseok Jeong, Sang Woo Kim

Abstract: Vision transformers (ViTs) have recently demonstrated state-of-the-art performance in a variety of vision tasks, replacing convolutional neural networks (CNNs). Meanwhile, since ViT has a different architecture than CNN, it may behave differently. To investigate the reliability of ViT, this paper studies the behavior and robustness of ViT. We compared the robustness of CNN and ViT by assuming vari… ▽ More Vision transformers (ViTs) have recently demonstrated state-of-the-art performance in a variety of vision tasks, replacing convolutional neural networks (CNNs). Meanwhile, since ViT has a different architecture than CNN, it may behave differently. To investigate the reliability of ViT, this paper studies the behavior and robustness of ViT. We compared the robustness of CNN and ViT by assuming various image corruptions that may appear in practical vision tasks. We confirmed that for most image transformations, ViT showed robustness comparable to CNN or more improved. However, for contrast enhancement, severe performance degradations were consistently observed in ViT. From a detailed analysis, we identified a potential problem: positional embedding in ViT's patch embedding could work improperly when the color scale changes. Here we claim the use of PreLayerNorm, a modified patch embedding structure to ensure scale-invariant behavior of ViT. ViT with PreLayerNorm showed improved robustness in various corruptions including contrast-varying environments. △ Less

Submitted 16 November, 2021; originally announced November 2021.

Comments: 7 pages, 8 figures. Work in Progress

arXiv:2111.03186 [pdf, other]

EditGAN: High-Precision Semantic Image Editing

Authors: Huan Ling, Karsten Kreis, Daiqing Li, Seung Wook Kim, Antonio Torralba, Sanja Fidler

Abstract: Generative adversarial networks (GANs) have recently found applications in image editing. However, most GAN based image editing methods often require large scale datasets with semantic segmentation annotations for training, only provide high level control, or merely interpolate between different images. Here, we propose EditGAN, a novel method for high quality, high precision semantic image editin… ▽ More Generative adversarial networks (GANs) have recently found applications in image editing. However, most GAN based image editing methods often require large scale datasets with semantic segmentation annotations for training, only provide high level control, or merely interpolate between different images. Here, we propose EditGAN, a novel method for high quality, high precision semantic image editing, allowing users to edit images by modifying their highly detailed part segmentation masks, e.g., drawing a new mask for the headlight of a car. EditGAN builds on a GAN framework that jointly models images and their semantic segmentations, requiring only a handful of labeled examples, making it a scalable tool for editing. Specifically, we embed an image into the GAN latent space and perform conditional latent code optimization according to the segmentation edit, which effectively also modifies the image. To amortize optimization, we find editing vectors in latent space that realize the edits. The framework allows us to learn an arbitrary number of editing vectors, which can then be directly applied on other images at interactive rates. We experimentally show that EditGAN can manipulate images with an unprecedented level of detail and freedom, while preserving full image quality.We can also easily combine multiple edits and perform plausible edits beyond EditGAN training data. We demonstrate EditGAN on a wide variety of image types and quantitatively outperform several previous editing methods on standard editing benchmark tasks. △ Less

Submitted 4 November, 2021; originally announced November 2021.

arXiv:2108.13576 [pdf, other]

Dead Pixel Test Using Effective Receptive Field

Authors: Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, Dong Gu Lee, Wonseok Jeong, Sang Woo Kim

Abstract: Deep neural networks have been used in various fields, but their internal behavior is not well known. In this study, we discuss two counterintuitive behaviors of convolutional neural networks (CNNs). First, we evaluated the size of the receptive field. Previous studies have attempted to increase or control the size of the receptive field. However, we observed that the size of the receptive field d… ▽ More Deep neural networks have been used in various fields, but their internal behavior is not well known. In this study, we discuss two counterintuitive behaviors of convolutional neural networks (CNNs). First, we evaluated the size of the receptive field. Previous studies have attempted to increase or control the size of the receptive field. However, we observed that the size of the receptive field does not describe the classification accuracy. The size of the receptive field would be inappropriate for representing superiority in performance because it reflects only depth or kernel size and does not reflect other factors such as width or cardinality. Second, using the effective receptive field, we examined the pixels contributing to the output. Intuitively, each pixel is expected to equally contribute to the final output. However, we found that there exist pixels in a partially dead state with little contribution to the output. We reveal that the reason for this lies in the architecture of CNN and discuss solutions to reduce the phenomenon. Interestingly, for general classification tasks, the existence of dead pixels improves the training of CNNs. However, in a task that captures small perturbation, dead pixels degrade the performance. Therefore, the existence of these dead pixels should be understood and considered in practical applications of CNN. △ Less

Submitted 30 August, 2021; originally announced August 2021.

Comments: 9 pages, 5 figures

arXiv:2104.15060 [pdf, other]

DriveGAN: Towards a Controllable High-Quality Neural Simulation

Authors: Seung Wook Kim, Jonah Philion, Antonio Torralba, Sanja Fidler

Abstract: Realistic simulators are critical for training and verifying robotics systems. While most of the contemporary simulators are hand-crafted, a scaleable way to build simulators is to use machine learning to learn how the environment behaves in response to an action, directly from data. In this work, we aim to learn to simulate a dynamic environment directly in pixel-space, by watching unannotated se… ▽ More Realistic simulators are critical for training and verifying robotics systems. While most of the contemporary simulators are hand-crafted, a scaleable way to build simulators is to use machine learning to learn how the environment behaves in response to an action, directly from data. In this work, we aim to learn to simulate a dynamic environment directly in pixel-space, by watching unannotated sequences of frames and their associated action pairs. We introduce a novel high-quality neural simulator referred to as DriveGAN that achieves controllability by disentangling different components without supervision. In addition to steering controls, it also includes controls for sampling features of a scene, such as the weather as well as the location of non-player objects. Since DriveGAN is a fully differentiable simulator, it further allows for re-simulation of a given video sequence, offering an agent to drive through a recorded scene again, possibly taking different actions. We train DriveGAN on multiple datasets, including 160 hours of real-world driving data. We showcase that our approach greatly surpasses the performance of previous data-driven simulators, and allows for new features not explored before. △ Less

Submitted 30 April, 2021; originally announced April 2021.

Comments: CVPR 2021 Oral

arXiv:2102.03897 [pdf, other]

doi 10.1016/j.media.2021.102256

Self-supervised driven consistency training for annotation efficient histopathology image analysis

Authors: Chetan L. Srinidhi, Seung Wook Kim, Fu-Der Chen, Anne L. Martel

Abstract: Training a neural network with a large labeled dataset is still a dominant paradigm in computational histopathology. However, obtaining such exhaustive manual annotations is often expensive, laborious, and prone to inter and Intra-observer variability. While recent self-supervised and semi-supervised methods can alleviate this need by learn-ing unsupervised feature representations, they still stru… ▽ More Training a neural network with a large labeled dataset is still a dominant paradigm in computational histopathology. However, obtaining such exhaustive manual annotations is often expensive, laborious, and prone to inter and Intra-observer variability. While recent self-supervised and semi-supervised methods can alleviate this need by learn-ing unsupervised feature representations, they still struggle to generalize well to downstream tasks when the number of labeled instances is small. In this work, we overcome this challenge by leveraging both task-agnostic and task-specific unlabeled data based on two novel strategies: i) a self-supervised pretext task that harnesses the underlying multi-resolution contextual cues in histology whole-slide images to learn a powerful supervisory signal for unsupervised representation learning; ii) a new teacher-student semi-supervised consistency paradigm that learns to effectively transfer the pretrained representations to downstream tasks based on prediction consistency with the task-specific un-labeled data. We carry out extensive validation experiments on three histopathology benchmark datasets across two classification and one regression-based tasks, i.e., tumor metastasis detection, tissue type classification, and tumor cellularity quantification. Under limited-label data, the proposed method yields tangible improvements, which is close or even outperforming other state-of-the-art self-supervised and supervised baselines. Furthermore, we empirically show that the idea of bootstrap** the self-supervised pretrained features is an effective way to improve the task-specific semi-supervised learning on standard benchmarks. Code and pretrained models will be made available at: https://github.com/srinidhiPY/SSL_CR_Histo △ Less

Submitted 3 October, 2021; v1 submitted 7 February, 2021; originally announced February 2021.

Journal ref: Medical Image Analysis, Volume 75, January 2022

arXiv:2008.07083 [pdf, other]

Edge Network-Assisted Real-Time Object Detection Framework for Autonomous Driving

Authors: Seung Wook Kim, Keunsoo Ko, Haneul Ko, Victor C. M. Leung

Abstract: Autonomous vehicles (AVs) can achieve the desired results within a short duration by offloading tasks even requiring high computational power (e.g., object detection (OD)) to edge clouds. However, although edge clouds are exploited, real-time OD cannot always be guaranteed due to dynamic channel quality. To mitigate this problem, we propose an edge network-assisted real-time OD framework~(EODF). I… ▽ More Autonomous vehicles (AVs) can achieve the desired results within a short duration by offloading tasks even requiring high computational power (e.g., object detection (OD)) to edge clouds. However, although edge clouds are exploited, real-time OD cannot always be guaranteed due to dynamic channel quality. To mitigate this problem, we propose an edge network-assisted real-time OD framework~(EODF). In an EODF, AVs extract the region of interests~(RoIs) of the captured image when the channel quality is not sufficiently good for supporting real-time OD. Then, AVs compress the image data on the basis of the RoIs and transmit the compressed one to the edge cloud. In so doing, real-time OD can be achieved owing to the reduced transmission latency. To verify the feasibility of our framework, we evaluate the probability that the results of OD are not received within the inter-frame duration (i.e., outage probability) and their accuracy. From the evaluation, we demonstrate that the proposed EODF provides the results to AVs in real-time and achieves satisfactory accuracy. △ Less

Submitted 17 August, 2020; originally announced August 2020.

Comments: This paper will be published in IEEE Network

arXiv:2005.12126 [pdf, other]

Learning to Simulate Dynamic Environments with GameGAN

Authors: Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, Sanja Fidler

Abstract: Simulation is a crucial component of any robotic system. In order to simulate correctly, we need to write complex rules of the environment: how dynamic agents behave, and how the actions of each of the agents affect the behavior of others. In this paper, we aim to learn a simulator by simply watching an agent interact with an environment. We focus on graphics games as a proxy of the real environme… ▽ More Simulation is a crucial component of any robotic system. In order to simulate correctly, we need to write complex rules of the environment: how dynamic agents behave, and how the actions of each of the agents affect the behavior of others. In this paper, we aim to learn a simulator by simply watching an agent interact with an environment. We focus on graphics games as a proxy of the real environment. We introduce GameGAN, a generative model that learns to visually imitate a desired game by ingesting screenplay and keyboard actions during training. Given a key pressed by the agent, GameGAN "renders" the next screen using a carefully designed generative adversarial network. Our approach offers key advantages over existing work: we design a memory module that builds an internal map of the environment, allowing for the agent to return to previously visited locations with high visual consistency. In addition, GameGAN is able to disentangle static and dynamic components within an image making the behavior of the model more interpretable, and relevant for downstream tasks that require explicit reasoning over dynamic elements. This enables many interesting applications such as swap** different components of the game to build new games that do not exist. △ Less

Submitted 25 May, 2020; originally announced May 2020.

Comments: CVPR 2020

arXiv:2001.05153 [pdf, other]

Extending Class Activation Map** Using Gaussian Receptive Field

Authors: Bum Jun Kim, Gyogwon Koo, Hyeyeon Choi, Sang Woo Kim

Abstract: This paper addresses the visualization task of deep learning models. To improve Class Activation Map** (CAM) based visualization method, we offer two options. First, we propose Gaussian upsampling, an improved upsampling method that can reflect the characteristics of deep learning models. Second, we identify and modify unnatural terms in the mathematical derivation of the existing CAM studies. B… ▽ More This paper addresses the visualization task of deep learning models. To improve Class Activation Map** (CAM) based visualization method, we offer two options. First, we propose Gaussian upsampling, an improved upsampling method that can reflect the characteristics of deep learning models. Second, we identify and modify unnatural terms in the mathematical derivation of the existing CAM studies. Based on two options, we propose Extended-CAM, an advanced CAM-based visualization method, which exhibits improved theoretical properties. Experimental results show that Extended-CAM provides more accurate visualization than the existing methods. △ Less

Submitted 15 January, 2020; originally announced January 2020.

Comments: 7 pages, 5 figures

arXiv:1912.13082 [pdf, other]

The Shmoop Corpus: A Dataset of Stories with Loosely Aligned Summaries

Authors: Atef Chaudhury, Makarand Tapaswi, Seung Wook Kim, Sanja Fidler

Abstract: Understanding stories is a challenging reading comprehension problem for machines as it requires reading a large volume of text and following long-range dependencies. In this paper, we introduce the Shmoop Corpus: a dataset of 231 stories that are paired with detailed multi-paragraph summaries for each individual chapter (7,234 chapters), where the summary is chronologically aligned with respect t… ▽ More Understanding stories is a challenging reading comprehension problem for machines as it requires reading a large volume of text and following long-range dependencies. In this paper, we introduce the Shmoop Corpus: a dataset of 231 stories that are paired with detailed multi-paragraph summaries for each individual chapter (7,234 chapters), where the summary is chronologically aligned with respect to the story chapter. From the corpus, we construct a set of common NLP tasks, including Cloze-form question answering and a simplified form of abstractive summarization, as benchmarks for reading comprehension on stories. We then show that the chronological alignment provides a strong supervisory signal that learning-based methods can exploit leading to significant improvements on these tasks. We believe that the unique structure of this corpus provides an important foothold towards making machine story comprehension more approachable. △ Less

Submitted 1 January, 2020; v1 submitted 30 December, 2019; originally announced December 2019.

Comments: Project page: http://www.cs.toronto.edu/~makarand/shmoop/ Dataset at: https://github.com/achaudhury/shmoop-corpus/

arXiv:1810.04029 [pdf, other]

doi 10.1109/ACCESS.2019.2899109

Selective Distillation of Weakly Annotated GTD for Vision-based Slab Identification System

Authors: Sang Jun Lee, Sang Woo Kim, Wookyong Kwon, Gyogwon Koo, Jong Pil Yun

Abstract: This paper proposes an algorithm for recognizing slab identification numbers in factory scenes. In the development of a deep-learning based system, manual labeling to make ground truth data (GTD) is an important but expensive task. Furthermore, the quality of GTD is closely related to the performance of a supervised learning algorithm. To reduce manual work in the labeling process, we generated we… ▽ More This paper proposes an algorithm for recognizing slab identification numbers in factory scenes. In the development of a deep-learning based system, manual labeling to make ground truth data (GTD) is an important but expensive task. Furthermore, the quality of GTD is closely related to the performance of a supervised learning algorithm. To reduce manual work in the labeling process, we generated weakly annotated GTD by marking only character centroids. Whereas bounding-boxes for characters require at least a drag-and-drop operation or two clicks to annotate a character location, the weakly annotated GTD requires a single click to record a character location. The main contribution of this paper is on selective distillation to improve the quality of the weakly annotated GTD. Because manual GTD are usually generated by many people, it may contain personal bias or human error. To address this problem, the information in manual GTD is integrated and refined by selective distillation. In the process of selective distillation, a fully convolutional network is trained using the weakly annotated GTD, and its prediction maps are selectively used to revise locations and boundaries of semantic regions of characters in the initial GTD. The modified GTD are used in the main training stage, and a post-processing is conducted to retrieve text information. Experiments were thoroughly conducted on actual industry data collected at a steelmaking factory to demonstrate the effectiveness of the proposed method. △ Less

Submitted 13 December, 2018; v1 submitted 9 October, 2018; originally announced October 2018.

Comments: 10 pages, 12 figures, submitted to a journal

Journal ref: IEEE Access 7 (2019) 23177-23186

arXiv:1810.01616 [pdf, other]

Cascaded Pyramid Network for 3D Human Pose Estimation Challenge

Authors: Sungeun Hong, Won** Jung, Ilsang Woo, Seung Wook Kim

Abstract: Over the past decade, there has been a growing interest in human pose estimation. Although much work has been done on 2D pose estimation, 3D pose estimation has still been relatively studied less. In this paper, we propose a top-bottom based two-stage 3D estimation framework. GloabalNet and RefineNet in our 2D pose estimation process enable us to find occluded or invisible 2D joints while 2D-to-3D… ▽ More Over the past decade, there has been a growing interest in human pose estimation. Although much work has been done on 2D pose estimation, 3D pose estimation has still been relatively studied less. In this paper, we propose a top-bottom based two-stage 3D estimation framework. GloabalNet and RefineNet in our 2D pose estimation process enable us to find occluded or invisible 2D joints while 2D-to-3D pose estimator composed of residual blocks is used to lift 2D joints to 3D joints effectively. The proposed method achieves promising results with mean per joint position error at 42.39 on the validation dataset on `3D Human Pose Estimation within the ECCV 2018 PoseTrack Challenge.' △ Less

Submitted 3 October, 2018; originally announced October 2018.

Comments: Accepted to ECCV Workshop 2018

arXiv:1806.02453 [pdf, other]

Visual Reasoning by Progressive Module Networks

Authors: Seung Wook Kim, Makarand Tapaswi, Sanja Fidler

Abstract: Humans learn to solve tasks of increasing complexity by building on top of previously acquired knowledge. Typically, there exists a natural progression in the tasks that we learn - most do not require completely independent solutions, but can be broken down into simpler subtasks. We propose to represent a solver for each task as a neural module that calls existing modules (solvers for simpler task… ▽ More Humans learn to solve tasks of increasing complexity by building on top of previously acquired knowledge. Typically, there exists a natural progression in the tasks that we learn - most do not require completely independent solutions, but can be broken down into simpler subtasks. We propose to represent a solver for each task as a neural module that calls existing modules (solvers for simpler tasks) in a functional program-like manner. Lower modules are a black box to the calling module, and communicate only via a query and an output. Thus, a module for a new task learns to query existing modules and composes their outputs in order to produce its own output. Our model effectively combines previous skill-sets, does not suffer from forgetting, and is fully differentiable. We test our model in learning a set of visual reasoning tasks, and demonstrate improved performances in all tasks by learning progressively. By evaluating the reasoning process using human judges, we show that our model is more interpretable than an attention-based baseline. △ Less

Submitted 27 September, 2018; v1 submitted 6 June, 2018; originally announced June 2018.

Comments: 17 pages, 5 figures

arXiv:1804.03533 [pdf, ps, other]

Multi-band RF Energy and Spectrum Harvesting in Cognitive Radio Networks

Authors: Ahmad Alsharoa, Nathan M Neihart, Sang W Kim, Ahmed E Kamal

Abstract: This paper investigates a multi-band harvesting (EH) schemes under cognitive radio interweave framework. All secondary users are considered as EH nodes that are allowed to harvest energy from multiple bands of Radio Frequency (RF) sources. A win-win framework is proposed, where SUs can sense the spectrum to determine whether the spectrum is busy, and hence they may harvest from RF energy, or if it… ▽ More This paper investigates a multi-band harvesting (EH) schemes under cognitive radio interweave framework. All secondary users are considered as EH nodes that are allowed to harvest energy from multiple bands of Radio Frequency (RF) sources. A win-win framework is proposed, where SUs can sense the spectrum to determine whether the spectrum is busy, and hence they may harvest from RF energy, or if it is idle, and hence they can use it for transmission. Only a subset of the SUs can sense in order to reduce sensing energy, and then machine learning is used to characterize areas of harvesting and spectrum usage. We formulate an optimization problem that jointly optimize number of sensing samples and sensing threshold in order to minimize the sensing time and hence maximize the amount of energy harvested. A near optimal solution is proposed using Geometric Programming (GP) to optimally solve the problem in a time-slotted period. Finally, an energy efficient approach based on multi-class Support Vector Machine (SVM) is proposed by involving only training SUs instead of all SUs. △ Less

Submitted 10 April, 2018; originally announced April 2018.

arXiv:1403.6555 [pdf, ps, other]

Modify-and-Forward for Securing Cooperative Relay Communications

Authors: Sang Wu Kim

Abstract: We proposed a new physical layer technique that can enhance the security of cooperative relay communications. The proposed approach modifies the decoded message at the relay according to the unique channel state between the relay and the destination such that the destination can utilize the modified message to its advantage while the eavesdropper cannot. We present a practical method for securely… ▽ More We proposed a new physical layer technique that can enhance the security of cooperative relay communications. The proposed approach modifies the decoded message at the relay according to the unique channel state between the relay and the destination such that the destination can utilize the modified message to its advantage while the eavesdropper cannot. We present a practical method for securely sharing the modification rule between the legitimate partners and present the secrecy outage probability in a quasi-static fading channel. It is demonstrated that the proposed scheme can provide a significant improvement over other schemes when the relay can successfully decode the source message. △ Less

Submitted 25 March, 2014; originally announced March 2014.

Comments: IEEE International Zurich Seminar on Communications, Feb. 2014

Showing 1–37 of 37 results for author: Kim, S W