Search | arXiv e-print repository

Motion-Conditioned Image Animation for Video Editing

Authors: Wilson Yan, Andrew Brown, Pieter Abbeel, Rohit Girdhar, Samaneh Azadi

Abstract: We introduce MoCA, a Motion-Conditioned Image Animation approach for video editing. It leverages a simple decomposition of the video editing problem into image editing followed by motion-conditioned image animation. Furthermore, given the lack of robust evaluation datasets for video editing, we introduce a new benchmark that measures edit capability across a wide variety of tasks, such as object r… ▽ More We introduce MoCA, a Motion-Conditioned Image Animation approach for video editing. It leverages a simple decomposition of the video editing problem into image editing followed by motion-conditioned image animation. Furthermore, given the lack of robust evaluation datasets for video editing, we introduce a new benchmark that measures edit capability across a wide variety of tasks, such as object replacement, background changes, style changes, and motion edits. We present a comprehensive human evaluation of the latest video editing methods along with MoCA, on our proposed benchmark. MoCA establishes a new state-of-the-art, demonstrating greater human preference win-rate, and outperforming notable recent approaches including Dreamix (63%), MasaCtrl (75%), and Tune-A-Video (72%), with especially significant improvements for motion edits. △ Less

Submitted 30 November, 2023; originally announced November 2023.

Comments: Project page: https://facebookresearch.github.io/MoCA

arXiv:2311.10709 [pdf, other]

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

Authors: Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, Ishan Misra

Abstract: We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions--adjusted noise schedules for diffusion, and multi-stage training--that enable us to directly generate high quality and high resolut… ▽ More We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions--adjusted noise schedules for diffusion, and multi-stage training--that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work--81% vs. Google's Imagen Video, 90% vs. Nvidia's PYOCO, and 96% vs. Meta's Make-A-Video. Our model outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user's text prompt, where our generations are preferred 96% over prior work. △ Less

Submitted 17 November, 2023; originally announced November 2023.

Comments: Project page: https://emu-video.metademolab.com

arXiv:2310.09243 [pdf, other]

Augmented Computational Design: Methodical Application of Artificial Intelligence in Generative Design

Authors: Pirouz Nourian, Shervin Azadi, Roy Uijtendaal, Nan Bai

Abstract: This chapter presents methodological reflections on the necessity and utility of artificial intelligence in generative design. Specifically, the chapter discusses how generative design processes can be augmented by AI to deliver in terms of a few outcomes of interest or performance indicators while dealing with hundreds or thousands of small decisions. The core of the performance-based generative… ▽ More This chapter presents methodological reflections on the necessity and utility of artificial intelligence in generative design. Specifically, the chapter discusses how generative design processes can be augmented by AI to deliver in terms of a few outcomes of interest or performance indicators while dealing with hundreds or thousands of small decisions. The core of the performance-based generative design paradigm is about making statistical or simulation-driven associations between these choices and consequences for map** and navigating such a complex decision space. This chapter will discuss promising directions in Artificial Intelligence for augmenting decision-making processes in architectural design for map** and navigating complex design spaces. △ Less

Submitted 13 October, 2023; originally announced October 2023.

Comments: This is the author's version of the book chapter Augmented Computational Design: Methodical Application of Artificial Intelligence in Generative Design. In Artificial Intelligence in Performance-Driven Design: Theories, Methods, and Tools Towards Sustainability, edited by Narjes Abbasabadi and Mehdi Ashayeri. Wiley, 2023

arXiv:2309.15472 [pdf, other]

Voxel Graph Operators: Topological Voxelization, Graph Generation, and Derivation of Discrete Differential Operators from Voxel Complexes

Authors: Pirouz Nourian, Shervin Azadi

Abstract: In this paper, we present a novel workflow consisting of algebraic algorithms and data structures for fast and topologically accurate conversion of vector data models such as Boundary Representations into voxels (topological voxelization); spatially indexing them; constructing connectivity graphs from voxels; and constructing a coherent set of multivariate differential and integral operators from… ▽ More In this paper, we present a novel workflow consisting of algebraic algorithms and data structures for fast and topologically accurate conversion of vector data models such as Boundary Representations into voxels (topological voxelization); spatially indexing them; constructing connectivity graphs from voxels; and constructing a coherent set of multivariate differential and integral operators from these graphs. Topological Voxelization is revisited and presented in the paper as a reversible map** of geometric models from $\mathbb{R}^3$ to $\mathbb{Z}^3$ to $\mathbb{N}^3$ and eventually to an index space created by Morton Codes in $\mathbb{N}$ while ensuring the topological validity of the voxel models; namely their topological thinness and their geometrical consistency. In addition, we present algorithms for constructing graphs and hyper-graph connectivity models on voxel data for graph traversal and field interpolations and utilize them algebraically in elegantly discretizing differential and integral operators for geometric, graphical, or spatial analyses and digital simulations. The multi-variate differential and integral operators presented in this paper can be used particularly in the formulation of Partial Differential Equations for physics simulations. △ Less

Submitted 27 September, 2023; originally announced September 2023.

Comments: 23 pages

arXiv:2309.13396 [pdf, other]

EquiCity Game: A mathematical serious game for participatory design of spatial configurations

Authors: Pirouz Nourian, Shervin Azadi, Nan Bai, Bruno de Andrade, Nour Abu Zaid, Samaneh Rezvani, Ana Pereira Roders

Abstract: We propose mechanisms for a mathematical social-choice game that is designed to mediate decision-making processes for city planning, urban area redevelopment, and architectural design (massing) of urban housing complexes. The proposed game is effectively a multi-player generative configurator equipped with automated appraisal/scoring mechanisms for revealing the aggregate impact of alternatives; f… ▽ More We propose mechanisms for a mathematical social-choice game that is designed to mediate decision-making processes for city planning, urban area redevelopment, and architectural design (massing) of urban housing complexes. The proposed game is effectively a multi-player generative configurator equipped with automated appraisal/scoring mechanisms for revealing the aggregate impact of alternatives; featuring a participatory digital process to support transparent and inclusive decision-making processes in spatial design for ensuring an equitable balance of sustainable development goals. As such, the game effectively empowers a group of decision-makers to reach a fair consensus by mathematically simulating many rounds of trade-offs between their decisions, with different levels of interest or control over various types of investments. Our proposed gamified design process encompasses decision-making about the most idiosyncratic aspects of a site related to its heritage status and cultural significance to the physical aspects such as balancing access to sunlight and the right to sunlight of the neighbours of the site, ensuring coherence of the entire configuration with regards to a network of desired closeness ratings, the satisfaction of a programme of requirements, and intricately balancing individual development goals in conjunction with communal goals and environmental design codes. The game is developed fully based on an algebraic computational process on our own digital twinning platform, using open geospatial data and open-source computational tools such as NumPy. The mathematical process consists of a Markovian design machine for balancing the decisions of actors, a massing configurator equipped with Fuzzy Logic and Multi-Criteria Decision Analysis, algebraic graph-theoretical accessibility evaluators, and automated solar-climatic evaluators using geospatial computational geometry. △ Less

Submitted 30 September, 2023; v1 submitted 23 September, 2023; originally announced September 2023.

Comments: 16 pages (the paper), 15 pages (supplemental materials), references missing in the supplemental document

arXiv:2305.09662 [pdf, other]

Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation

Authors: Samaneh Azadi, Akbar Shah, Thomas Hayes, Devi Parikh, Sonal Gupta

Abstract: Text-guided human motion generation has drawn significant interest because of its impactful applications spanning animation and robotics. Recently, application of diffusion models for motion generation has enabled improvements in the quality of generated motions. However, existing approaches are limited by their reliance on relatively small-scale motion capture data, leading to poor performance on… ▽ More Text-guided human motion generation has drawn significant interest because of its impactful applications spanning animation and robotics. Recently, application of diffusion models for motion generation has enabled improvements in the quality of generated motions. However, existing approaches are limited by their reliance on relatively small-scale motion capture data, leading to poor performance on more diverse, in-the-wild prompts. In this paper, we introduce Make-An-Animation, a text-conditioned human motion generation model which learns more diverse poses and prompts from large-scale image-text datasets, enabling significant improvement in performance over prior works. Make-An-Animation is trained in two stages. First, we train on a curated large-scale dataset of (text, static pseudo-pose) pairs extracted from image-text datasets. Second, we fine-tune on motion capture data, adding additional layers to model the temporal dimension. Unlike prior diffusion models for motion generation, Make-An-Animation uses a U-Net architecture similar to recent text-to-video generation models. Human evaluation of motion realism and alignment with input text shows that our model reaches state-of-the-art performance on text-to-motion generation. △ Less

Submitted 16 May, 2023; originally announced May 2023.

Comments: arXiv admin note: text overlap with arXiv:2304.07410

arXiv:2304.07410 [pdf, other]

Text-Conditional Contextualized Avatars For Zero-Shot Personalization

Authors: Samaneh Azadi, Thomas Hayes, Akbar Shah, Guan Pang, Devi Parikh, Sonal Gupta

Abstract: Recent large-scale text-to-image generation models have made significant improvements in the quality, realism, and diversity of the synthesized images and enable users to control the created content through language. However, the personalization aspect of these generative models is still challenging and under-explored. In this work, we propose a pipeline that enables personalization of image gener… ▽ More Recent large-scale text-to-image generation models have made significant improvements in the quality, realism, and diversity of the synthesized images and enable users to control the created content through language. However, the personalization aspect of these generative models is still challenging and under-explored. In this work, we propose a pipeline that enables personalization of image generation with avatars capturing a user's identity in a delightful way. Our pipeline is zero-shot, avatar texture and style agnostic, and does not require training on the avatar at all - it is scalable to millions of users who can generate a scene with their avatar. To render the avatar in a pose faithful to the given text prompt, we propose a novel text-to-3D pose diffusion model trained on a curated large-scale dataset of in-the-wild human poses improving the performance of the SOTA text-to-motion models significantly. We show, for the first time, how to leverage large-scale image datasets to learn human 3D pose parameters and overcome the limitations of motion capture datasets. △ Less

Submitted 14 April, 2023; originally announced April 2023.

arXiv:2212.00210 [pdf, other]

Shape-Guided Diffusion with Inside-Outside Attention

Authors: Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi, Xihui Liu, Maka Karalashvili, Anna Rohrbach, Trevor Darrell

Abstract: We introduce precise object silhouette as a new form of user control in text-to-image diffusion models, which we dub Shape-Guided Diffusion. Our training-free method uses an Inside-Outside Attention mechanism during the inversion and generation process to apply a shape constraint to the cross- and self-attention maps. Our mechanism designates which spatial region is the object (inside) vs. backgro… ▽ More We introduce precise object silhouette as a new form of user control in text-to-image diffusion models, which we dub Shape-Guided Diffusion. Our training-free method uses an Inside-Outside Attention mechanism during the inversion and generation process to apply a shape constraint to the cross- and self-attention maps. Our mechanism designates which spatial region is the object (inside) vs. background (outside) then associates edits to the correct region. We demonstrate the efficacy of our method on the shape-guided editing task, where the model must replace an object according to a text prompt and object mask. We curate a new ShapePrompts benchmark derived from MS-COCO and achieve SOTA results in shape faithfulness without a degradation in text alignment or image realism according to both automatic metrics and annotator ratings. Our data and code will be made available at https://shape-guided-diffusion.github.io. △ Less

Submitted 1 April, 2024; v1 submitted 30 November, 2022; originally announced December 2022.

Comments: WACV 2024

arXiv:2202.05183 [pdf, other]

doi 10.1103/PhysRevLett.130.036401

Discovering Quantum Phase Transitions with Fermionic Neural Networks

Authors: G. Cassella, H. Sutterud, S. Azadi, N. D. Drummond, D. Pfau, J. S. Spencer, W. M. C. Foulkes

Abstract: Deep neural networks have been extremely successful as highly accurate wave function ansätze for variational Monte Carlo calculations of molecular ground states. We present an extension of one such ansatz, FermiNet, to calculations of the ground states of periodic Hamiltonians, and study the homogeneous electron gas. FermiNet calculations of the ground-state energies of small electron gas systems… ▽ More Deep neural networks have been extremely successful as highly accurate wave function ansätze for variational Monte Carlo calculations of molecular ground states. We present an extension of one such ansatz, FermiNet, to calculations of the ground states of periodic Hamiltonians, and study the homogeneous electron gas. FermiNet calculations of the ground-state energies of small electron gas systems are in excellent agreement with previous initiator full configuration interaction quantum Monte Carlo and diffusion Monte Carlo calculations. We investigate the spin-polarized homogeneous electron gas and demonstrate that the same neural network architecture is capable of accurately representing both the delocalized Fermi liquid state and the localized Wigner crystal state. The network is given no \emph{a priori} knowledge that a phase transition exists, but converges on the translationally invariant ground state at high density and spontaneously breaks the symmetry to produce the crystalline ground state at low density. △ Less

Submitted 5 July, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

Comments: 12 pages, 3 figures

arXiv:2112.05744 [pdf, other]

More Control for Free! Image Synthesis with Semantic Diffusion Guidance

Authors: Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, Trevor Darrell

Abstract: Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from a reference image. Recently, denoising diffusion probabilistic models have been shown to generate more realistic imagery than prior methods, and have been successfully demonstrated in unconditional and class-conditional settings. We investigate fine-grained, continuous control of this m… ▽ More Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from a reference image. Recently, denoising diffusion probabilistic models have been shown to generate more realistic imagery than prior methods, and have been successfully demonstrated in unconditional and class-conditional settings. We investigate fine-grained, continuous control of this model class, and introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both. Guidance is injected into a pretrained unconditional diffusion model using the gradient of image-text or image matching scores, without re-training the diffusion model. We explore CLIP-based language guidance as well as both content and style-based image guidance in a unified framework. Our text-guided synthesis approach can be applied to datasets without associated text annotations. We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis, synthesis of images related to a style or content reference image, and examples with both textual and image guidance. △ Less

Submitted 5 December, 2022; v1 submitted 10 December, 2021; originally announced December 2021.

Comments: WACV 2023. Project page https://xh-liu.github.io/sdg/

arXiv:2109.11037 [pdf, other]

A Computational Approach for Checking Compliance with European View and Sunlight Exposure Criteria

Authors: Eleonora Brembilla, Shervin Azadi, Pirouz Nourian

Abstract: The paper presents open-source computational workflows for assessing the "Exposure to sunlight" and "View out" criteria as defined in the European standard EN 17037 "Daylight in Buildings", issued by the European Committee for Standardization. In addition to these factors, the standard document also addresses daylight provision and protection from glare, both of which fall out of the scope of this… ▽ More The paper presents open-source computational workflows for assessing the "Exposure to sunlight" and "View out" criteria as defined in the European standard EN 17037 "Daylight in Buildings", issued by the European Committee for Standardization. In addition to these factors, the standard document also addresses daylight provision and protection from glare, both of which fall out of the scope of this paper. The purpose of the standard is stated as 'encouraging building designers to assess and ensure successfully daylit spaces'. The standard document proposes verification methods for performing such assessments, albeit without recommending a simulation procedure for computing the aforementioned criteria. The workflows proposed in this paper are arguably the first attempt to standardize these assessment methods using de-facto open-source standard technologies currently used in practice. The approach of this work is twofold: establish that the compliance check can be systematically performed on a 3D model by a novel simulation tool developed by the authors; and highlighting the additional assumptions that need to be implemented to build a robust and unambiguous tool within existing open-source frameworks. △ Less

Submitted 21 September, 2021; originally announced September 2021.

Comments: 7 pages, 8 figures, accepted and presented in the 17th IBPSA Conference Bruges, Belgium, Sept. 1-3, 2021

ACM Class: J.6; J.2; I.3

arXiv:1911.11357 [pdf, other]

Semantic Bottleneck Scene Generation

Authors: Samaneh Azadi, Michael Tschannen, Eric Tzeng, Sylvain Gelly, Trevor Darrell, Mario Lucic

Abstract: Coupling the high-fidelity generation capabilities of label-conditional image synthesis methods with the flexibility of unconditional generative models, we propose a semantic bottleneck GAN model for unconditional synthesis of complex scenes. We assume pixel-wise segmentation labels are available during training and use them to learn the scene structure. During inference, our model first synthesiz… ▽ More Coupling the high-fidelity generation capabilities of label-conditional image synthesis methods with the flexibility of unconditional generative models, we propose a semantic bottleneck GAN model for unconditional synthesis of complex scenes. We assume pixel-wise segmentation labels are available during training and use them to learn the scene structure. During inference, our model first synthesizes a realistic segmentation layout from scratch, then synthesizes a realistic scene conditioned on that layout. For the former, we use an unconditional progressive segmentation generation network that captures the distribution of realistic semantic scene layouts. For the latter, we use a conditional segmentation-to-image synthesis network that captures the distribution of photo-realistic images conditioned on the semantic layout. When trained end-to-end, the resulting model outperforms state-of-the-art generative models in unsupervised image synthesis on two challenging domains in terms of the Frechet Inception Distance and user-study evaluations. Moreover, we demonstrate the generated segmentation maps can be used as additional training data to strongly improve recent segmentation-to-image synthesis networks. △ Less

Submitted 26 November, 2019; originally announced November 2019.

arXiv:1907.03244 [pdf]

doi 10.22055/JACM.2022.40688.3675

Time Distance: A Novel Collision Prediction and Path Planning Method

Authors: Ali Analooee, Shahram Azadi, Reza Kazemi

Abstract: In this paper, a new fast algorithm for path planning and a collision prediction framework for two dimensional dynamically changing environments are introduced. The method is called Time Distance (TD) and benefits from the space-time space idea. First, the TD concept is defined as the time interval that must be spent in order for an object to reach another object or a location. Next, TD functions… ▽ More In this paper, a new fast algorithm for path planning and a collision prediction framework for two dimensional dynamically changing environments are introduced. The method is called Time Distance (TD) and benefits from the space-time space idea. First, the TD concept is defined as the time interval that must be spent in order for an object to reach another object or a location. Next, TD functions are derived as a function of location, velocity and geometry of objects. To construct the configuration-time space, TD functions in conjunction with another function named "Z-Infinity" are exploited. Finally, an explicit formula for creating the length optimal collision free path is presented. Length optimization in this formula is achieved using a function named "Route Function" which minimizes a cost function. Performance of the path planning algorithm is evaluated in simulations. Comparisons indicate that the algorithm is fast enough and capable to generate length optimal paths as the most effective methods do. Finally, as another usage of the TD functions, a collision prediction framework is presented. This framework consists of an explicit function which is a function of TD functions and calculates the TD of the vehicle with respect to all objects of the environment. △ Less

Submitted 6 April, 2023; v1 submitted 7 July, 2019; originally announced July 2019.

Journal ref: Journal of Applied and Computational Mechanics, Vol. 9, No. 3, (2023), 656-677

arXiv:1810.06758 [pdf, other]

Discriminator Rejection Sampling

Authors: Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian Goodfellow, Augustus Odena

Abstract: We propose a rejection sampling scheme using the discriminator of a GAN to approximately correct errors in the GAN generator distribution. We show that under quite strict assumptions, this will allow us to recover the data distribution exactly. We then examine where those strict assumptions break down and design a practical algorithm - called Discriminator Rejection Sampling (DRS) - that can be us… ▽ More We propose a rejection sampling scheme using the discriminator of a GAN to approximately correct errors in the GAN generator distribution. We show that under quite strict assumptions, this will allow us to recover the data distribution exactly. We then examine where those strict assumptions break down and design a practical algorithm - called Discriminator Rejection Sampling (DRS) - that can be used on real data-sets. Finally, we demonstrate the efficacy of DRS on a mixture of Gaussians and on the SAGAN model, state-of-the-art in the image generation task at the time of develo** this work. On ImageNet, we train an improved baseline that increases the Inception Score from 52.52 to 62.36 and reduces the Frechet Inception Distance from 18.65 to 14.79. We then use DRS to further improve on this baseline, improving the Inception Score to 76.08 and the FID to 13.75. △ Less

Submitted 26 February, 2019; v1 submitted 15 October, 2018; originally announced October 2018.

Comments: Published as a conference paper at ICLR 2019

arXiv:1807.07560 [pdf, other]

Compositional GAN: Learning Image-Conditional Binary Composition

Authors: Samaneh Azadi, Deepak Pathak, Sayna Ebrahimi, Trevor Darrell

Abstract: Generative Adversarial Networks (GANs) can produce images of remarkable complexity and realism but are generally structured to sample from a single latent source ignoring the explicit spatial interaction between multiple entities that could be present in a scene. Capturing such complex interactions between different objects in the world, including their relative scaling, spatial layout, occlusion,… ▽ More Generative Adversarial Networks (GANs) can produce images of remarkable complexity and realism but are generally structured to sample from a single latent source ignoring the explicit spatial interaction between multiple entities that could be present in a scene. Capturing such complex interactions between different objects in the world, including their relative scaling, spatial layout, occlusion, or viewpoint transformation is a challenging problem. In this work, we propose a novel self-consistent Composition-by-Decomposition (CoDe) network to compose a pair of objects. Given object images from two distinct distributions, our model can generate a realistic composite image from their joint distribution following the texture and shape of the input objects. We evaluate our approach through qualitative experiments and user evaluations. Our results indicate that the learned model captures potential interactions between the two object domains, and generates realistic composed scenes at test time. △ Less

Submitted 28 March, 2019; v1 submitted 19 July, 2018; originally announced July 2018.

arXiv:1712.00516 [pdf, other]

Multi-Content GAN for Few-Shot Font Style Transfer

Authors: Samaneh Azadi, Matthew Fisher, Vladimir Kim, Zhaowen Wang, Eli Shechtman, Trevor Darrell

Abstract: In this work, we focus on the challenge of taking partial observations of highly-stylized text and generalizing the observations to generate unobserved glyphs in the ornamented typeface. To generate a set of multi-content images following a consistent style from very few examples, we propose an end-to-end stacked conditional GAN model considering content along channels and style along network laye… ▽ More In this work, we focus on the challenge of taking partial observations of highly-stylized text and generalizing the observations to generate unobserved glyphs in the ornamented typeface. To generate a set of multi-content images following a consistent style from very few examples, we propose an end-to-end stacked conditional GAN model considering content along channels and style along network layers. Our proposed network transfers the style of given glyphs to the contents of unseen ones, capturing highly stylized fonts found in the real-world such as those on movie posters or infographics. We seek to transfer both the typographic stylization (ex. serifs and ears) as well as the textual stylization (ex. color gradients and effects.) We base our experiments on our collected data set including 10,000 fonts with different styles and demonstrate effective generalization from a very small number of observed glyphs. △ Less

Submitted 1 December, 2017; originally announced December 2017.

arXiv:1704.03533 [pdf, other]

Learning Detection with Diverse Proposals

Authors: Samaneh Azadi, Jiashi Feng, Trevor Darrell

Abstract: To predict a set of diverse and informative proposals with enriched representations, this paper introduces a differentiable Determinantal Point Process (DPP) layer that is able to augment the object detection architectures. Most modern object detection architectures, such as Faster R-CNN, learn to localize objects by minimizing deviations from the ground-truth but ignore correlation between multip… ▽ More To predict a set of diverse and informative proposals with enriched representations, this paper introduces a differentiable Determinantal Point Process (DPP) layer that is able to augment the object detection architectures. Most modern object detection architectures, such as Faster R-CNN, learn to localize objects by minimizing deviations from the ground-truth but ignore correlation between multiple proposals and object categories. Non-Maximum Suppression (NMS) as a widely used proposal pruning scheme ignores label- and instance-level relations between object candidates resulting in multi-labeled detections. In the multi-class case, NMS selects boxes with the largest prediction scores ignoring the semantic relation between categories of potential election. In contrast, our trainable DPP layer, allowing for Learning Detection with Diverse Proposals (LDDP), considers both label-level contextual information and spatial layout relationships between proposals without increasing the number of parameters of the network, and thus improves location and category specifications of final detected bounding boxes substantially during both training and inference schemes. Furthermore, we show that LDDP keeps it superiority over Faster R-CNN even if the number of proposals generated by LDPP is only ~30% as many as those for Faster R-CNN. △ Less

Submitted 11 April, 2017; originally announced April 2017.

Comments: Accepted to CVPR 2017

arXiv:1511.07069 [pdf, other]

Auxiliary Image Regularization for Deep CNNs with Noisy Labels

Authors: Samaneh Azadi, Jiashi Feng, Stefanie Jegelka, Trevor Darrell

Abstract: Precisely-labeled data sets with sufficient amount of samples are very important for training deep convolutional neural networks (CNNs). However, many of the available real-world data sets contain erroneously labeled samples and those errors substantially hinder the learning of very accurate CNN models. In this work, we consider the problem of training a deep CNN model for image classification wit… ▽ More Precisely-labeled data sets with sufficient amount of samples are very important for training deep convolutional neural networks (CNNs). However, many of the available real-world data sets contain erroneously labeled samples and those errors substantially hinder the learning of very accurate CNN models. In this work, we consider the problem of training a deep CNN model for image classification with mislabeled training samples - an issue that is common in real image data sets with tags supplied by amateur users. To solve this problem, we propose an auxiliary image regularization technique, optimized by the stochastic Alternating Direction Method of Multipliers (ADMM) algorithm, that automatically exploits the mutual context information among training images and encourages the model to select reliable images to robustify the learning process. Comprehensive experiments on benchmark data sets clearly demonstrate our proposed regularized CNN model is resistant to label noise in training data. △ Less

Submitted 2 March, 2016; v1 submitted 22 November, 2015; originally announced November 2015.

Comments: Published as a conference paper at ICLR 2016

Showing 1–18 of 18 results for author: Azadi, S