-
Semantic Pyramid for Image Generation
Authors:
Assaf Shocher,
Yossi Gandelsman,
Inbar Mosseri,
Michal Yarom,
Michal Irani,
William T. Freeman,
Tali Dekel
Abstract:
We present a novel GAN-based model that utilizes the space of deep features learned by a pre-trained classification model. Inspired by classical image pyramid representations, we construct our model as a Semantic Generation Pyramid -- a hierarchical framework which leverages the continuum of semantic information encapsulated in such deep features; this ranges from low level information contained i…
▽ More
We present a novel GAN-based model that utilizes the space of deep features learned by a pre-trained classification model. Inspired by classical image pyramid representations, we construct our model as a Semantic Generation Pyramid -- a hierarchical framework which leverages the continuum of semantic information encapsulated in such deep features; this ranges from low level information contained in fine features to high level, semantic information contained in deeper features. More specifically, given a set of features extracted from a reference image, our model generates diverse image samples, each with matching features at each semantic level of the classification model. We demonstrate that our model results in a versatile and flexible framework that can be used in various classic and novel image generation tasks. These include: generating images with a controllable extent of semantic similarity to a reference image, and different manipulation tasks such as semantically-controlled inpainting and compositing; all achieved with the same model, with no further training.
△ Less
Submitted 16 March, 2020; v1 submitted 13 March, 2020;
originally announced March 2020.
-
Computational Mirrors: Blind Inverse Light Transport by Deep Matrix Factorization
Authors:
Miika Aittala,
Prafull Sharma,
Lukas Murmann,
Adam B. Yedidia,
Gregory W. Wornell,
William T. Freeman,
Fredo Durand
Abstract:
We recover a video of the motion taking place in a hidden scene by observing changes in indirect illumination in a nearby uncalibrated visible region. We solve this problem by factoring the observed video into a matrix product between the unknown hidden scene video and an unknown light transport matrix. This task is extremely ill-posed, as any non-negative factorization will satisfy the data. Insp…
▽ More
We recover a video of the motion taking place in a hidden scene by observing changes in indirect illumination in a nearby uncalibrated visible region. We solve this problem by factoring the observed video into a matrix product between the unknown hidden scene video and an unknown light transport matrix. This task is extremely ill-posed, as any non-negative factorization will satisfy the data. Inspired by recent work on the Deep Image Prior, we parameterize the factor matrices using randomly initialized convolutional neural networks trained in a one-off manner, and show that this results in decompositions that reflect the true motion in the hidden scene.
△ Less
Submitted 4 December, 2019;
originally announced December 2019.
-
Program-Guided Image Manipulators
Authors:
Jiayuan Mao,
Xiuming Zhang,
Yikai Li,
William T. Freeman,
Joshua B. Tenenbaum,
Jiajun Wu
Abstract:
Humans are capable of building holistic representations for images at various levels, from local objects, to pairwise relations, to global structures. The interpretation of structures involves reasoning over repetition and symmetry of the objects in the image. In this paper, we present the Program-Guided Image Manipulator (PG-IM), inducing neuro-symbolic program-like representations to represent a…
▽ More
Humans are capable of building holistic representations for images at various levels, from local objects, to pairwise relations, to global structures. The interpretation of structures involves reasoning over repetition and symmetry of the objects in the image. In this paper, we present the Program-Guided Image Manipulator (PG-IM), inducing neuro-symbolic program-like representations to represent and manipulate images. Given an image, PG-IM detects repeated patterns, induces symbolic programs, and manipulates the image using a neural network that is guided by the program. PG-IM learns from a single image, exploiting its internal statistics. Despite trained only on image inpainting, PG-IM is directly capable of extrapolation and regularity editing in a unified framework. Extensive experiments show that PG-IM achieves superior performance on all the tasks.
△ Less
Submitted 4 September, 2019;
originally announced September 2019.
-
Visual Deprojection: Probabilistic Recovery of Collapsed Dimensions
Authors:
Guha Balakrishnan,
Adrian V. Dalca,
Amy Zhao,
John V. Guttag,
Fredo Durand,
William T. Freeman
Abstract:
We introduce visual deprojection: the task of recovering an image or video that has been collapsed along a dimension. Projections arise in various contexts, such as long-exposure photography, where a dynamic scene is collapsed in time to produce a motion-blurred image, and corner cameras, where reflected light from a scene is collapsed along a spatial dimension because of an edge occluder to yield…
▽ More
We introduce visual deprojection: the task of recovering an image or video that has been collapsed along a dimension. Projections arise in various contexts, such as long-exposure photography, where a dynamic scene is collapsed in time to produce a motion-blurred image, and corner cameras, where reflected light from a scene is collapsed along a spatial dimension because of an edge occluder to yield a 1D video. Deprojection is ill-posed-- often there are many plausible solutions for a given input. We first propose a probabilistic model capturing the ambiguity of the task. We then present a variational inference strategy using convolutional neural networks as functional approximators. Sampling from the inference network at test time yields plausible candidates from the distribution of original signals that are consistent with a given input projection. We evaluate the method on several datasets for both spatial and temporal deprojection tasks. We first demonstrate the method can recover human gait videos and face images from spatial projections, and then show that it can recover videos of moving digits from dramatically motion-blurred images obtained via temporal projection.
△ Less
Submitted 1 September, 2019;
originally announced September 2019.
-
Boundless: Generative Adversarial Networks for Image Extension
Authors:
Piotr Teterwak,
Aaron Sarna,
Dilip Krishnan,
Aaron Maschinot,
David Belanger,
Ce Liu,
William T. Freeman
Abstract:
Image extension models have broad applications in image editing, computational photography and computer graphics. While image inpainting has been extensively studied in the literature, it is challenging to directly apply the state-of-the-art inpainting methods to image extension as they tend to generate blurry or repetitive pixels with inconsistent semantics. We introduce semantic conditioning to…
▽ More
Image extension models have broad applications in image editing, computational photography and computer graphics. While image inpainting has been extensively studied in the literature, it is challenging to directly apply the state-of-the-art inpainting methods to image extension as they tend to generate blurry or repetitive pixels with inconsistent semantics. We introduce semantic conditioning to the discriminator of a generative adversarial network (GAN), and achieve strong results on image extension with coherent semantics and visually pleasing colors and textures. We also show promising results in extreme extensions, such as panorama generation.
△ Less
Submitted 19 August, 2019;
originally announced August 2019.
-
Speech2Face: Learning the Face Behind a Voice
Authors:
Tae-Hyun Oh,
Tali Dekel,
Changil Kim,
Inbar Mosseri,
William T. Freeman,
Michael Rubinstein,
Wojciech Matusik
Abstract:
How much can we infer about a person's looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural Internet/YouTube videos of people speaking. During training, our model learns voice-face correlations that al…
▽ More
How much can we infer about a person's looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural Internet/YouTube videos of people speaking. During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly. We evaluate and numerically quantify how--and in what manner--our Speech2Face reconstructions, obtained directly from audio, resemble the true face images of the speakers.
△ Less
Submitted 23 May, 2019;
originally announced May 2019.
-
Learning the Depths of Moving People by Watching Frozen People
Authors:
Zhengqi Li,
Tali Dekel,
Forrester Cole,
Richard Tucker,
Noah Snavely,
Ce Liu,
William T. Freeman
Abstract:
We present a method for predicting dense depth in scenarios where both a monocular camera and people in the scene are freely moving. Existing methods for recovering depth for dynamic, non-rigid objects from monocular video impose strong assumptions on the objects' motion and may only recover sparse depth. In this paper, we take a data-driven approach and learn human depth priors from a new source…
▽ More
We present a method for predicting dense depth in scenarios where both a monocular camera and people in the scene are freely moving. Existing methods for recovering depth for dynamic, non-rigid objects from monocular video impose strong assumptions on the objects' motion and may only recover sparse depth. In this paper, we take a data-driven approach and learn human depth priors from a new source of data: thousands of Internet videos of people imitating mannequins, i.e., freezing in diverse, natural poses, while a hand-held camera tours the scene. Because people are stationary, training data can be generated using multi-view stereo reconstruction. At inference time, our method uses motion parallax cues from the static areas of the scenes to guide the depth prediction. We demonstrate our method on real-world sequences of complex human actions captured by a moving hand-held camera, show improvement over state-of-the-art monocular depth prediction methods, and show various 3D effects produced using our predicted depth.
△ Less
Submitted 24 April, 2019;
originally announced April 2019.
-
Learning Shape Templates with Structured Implicit Functions
Authors:
Kyle Genova,
Forrester Cole,
Daniel Vlasic,
Aaron Sarna,
William T. Freeman,
Thomas Funkhouser
Abstract:
Template 3D shapes are useful for many tasks in graphics and vision, including fitting observation data, analyzing shape collections, and transferring shape attributes. Because of the variety of geometry and topology of real-world shapes, previous methods generally use a library of hand-made templates. In this paper, we investigate learning a general shape template from data. To allow for widely v…
▽ More
Template 3D shapes are useful for many tasks in graphics and vision, including fitting observation data, analyzing shape collections, and transferring shape attributes. Because of the variety of geometry and topology of real-world shapes, previous methods generally use a library of hand-made templates. In this paper, we investigate learning a general shape template from data. To allow for widely varying geometry and topology, we choose an implicit surface representation based on composition of local shape elements. While long known to computer graphics, this representation has not yet been explored in the context of machine learning for vision. We show that structured implicit functions are suitable for learning and allow a network to smoothly and simultaneously fit multiple classes of shapes. The learned shape template supports applications such as shape exploration, correspondence, abstraction, interpolation, and semantic segmentation from an RGB image.
△ Less
Submitted 12 April, 2019;
originally announced April 2019.
-
Unsupervised Discovery of Parts, Structure, and Dynamics
Authors:
Zhenjia Xu,
Zhijian Liu,
Chen Sun,
Kevin Murphy,
William T. Freeman,
Joshua B. Tenenbaum,
Jiajun Wu
Abstract:
Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future. In this paper, we propose a novel formulation that simultaneously learns a hierarchical, disentangled object representation and a dynamics model for object parts from unlabeled videos. Our Parts, Structure, and Dynamics (PSD) model learns to, fir…
▽ More
Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future. In this paper, we propose a novel formulation that simultaneously learns a hierarchical, disentangled object representation and a dynamics model for object parts from unlabeled videos. Our Parts, Structure, and Dynamics (PSD) model learns to, first, recognize the object parts via a layered image representation; second, predict hierarchy via a structural descriptor that composes low-level concepts into a hierarchical structure; and third, model the system dynamics by predicting the future. Experiments on multiple real and synthetic datasets demonstrate that our PSD model works well on all three tasks: segmenting object parts, building their hierarchical structure, and capturing their motion distributions.
△ Less
Submitted 12 March, 2019;
originally announced March 2019.
-
On the Units of GANs (Extended Abstract)
Authors:
David Bau,
Jun-Yan Zhu,
Hendrik Strobelt,
Bolei Zhou,
Joshua B. Tenenbaum,
William T. Freeman,
Antonio Torralba
Abstract:
Generative Adversarial Networks (GANs) have achieved impressive results for many real-world applications. As an active research topic, many GAN variants have emerged with improvements in sample quality and training stability. However, visualization and understanding of GANs is largely missing. How does a GAN represent our visual world internally? What causes the artifacts in GAN results? How do ar…
▽ More
Generative Adversarial Networks (GANs) have achieved impressive results for many real-world applications. As an active research topic, many GAN variants have emerged with improvements in sample quality and training stability. However, visualization and understanding of GANs is largely missing. How does a GAN represent our visual world internally? What causes the artifacts in GAN results? How do architectural choices affect GAN learning? Answering such questions could enable us to develop new insights and better models. In this work, we present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level. We first identify a group of interpretable units that are closely related to concepts with a segmentation-based network dissection method. We quantify the causal effect of interpretable units by measuring the ability of interventions to control objects in the output. Finally, we examine the contextual relationship between these units and their surrounding by inserting the discovered concepts into new images. We show several practical applications enabled by our framework, from comparing internal representations across different layers, models, and datasets, to improving GANs by locating and removing artifact-causing units, to interactively manipulating objects in the scene. We will open source our interactive tools to help researchers and practitioners better understand their models.
△ Less
Submitted 6 August, 2020; v1 submitted 29 January, 2019;
originally announced January 2019.
-
Learning to Infer and Execute 3D Shape Programs
Authors:
Yonglong Tian,
Andrew Luo,
Xingyuan Sun,
Kevin Ellis,
William T. Freeman,
Joshua B. Tenenbaum,
Jiajun Wu
Abstract:
Human perception of 3D shapes goes beyond reconstructing them as a set of points or a composition of geometric primitives: we also effortlessly understand higher-level shape structure such as the repetition and reflective symmetry of object parts. In contrast, recent advances in 3D shape sensing focus more on low-level geometry but less on these higher-level relationships. In this paper, we propos…
▽ More
Human perception of 3D shapes goes beyond reconstructing them as a set of points or a composition of geometric primitives: we also effortlessly understand higher-level shape structure such as the repetition and reflective symmetry of object parts. In contrast, recent advances in 3D shape sensing focus more on low-level geometry but less on these higher-level relationships. In this paper, we propose 3D shape programs, integrating bottom-up recognition systems with top-down, symbolic program structure to capture both low-level geometry and high-level structural priors for 3D shapes. Because there are no annotations of shape programs for real shapes, we develop neural modules that not only learn to infer 3D shape programs from raw, unannotated shapes, but also to execute these programs for shape reconstruction. After initial bootstrap**, our end-to-end differentiable model learns 3D shape programs by reconstructing shapes in a self-supervised manner. Experiments demonstrate that our model accurately infers and executes 3D shape programs for highly complex shapes from various categories. It can also be integrated with an image-to-shape module to infer 3D shape programs directly from an RGB image, leading to 3D shape reconstructions that are both more accurate and more physically plausible.
△ Less
Submitted 9 August, 2019; v1 submitted 9 January, 2019;
originally announced January 2019.
-
Learning to Reconstruct Shapes from Unseen Classes
Authors:
Xiuming Zhang,
Zhoutong Zhang,
Chengkai Zhang,
Joshua B. Tenenbaum,
William T. Freeman,
Jiajun Wu
Abstract:
From a single image, humans are able to perceive the full 3D shape of an object by exploiting learned shape priors from everyday life. Contemporary single-image 3D reconstruction algorithms aim to solve this task in a similar fashion, but often end up with priors that are highly biased by training classes. Here we present an algorithm, Generalizable Reconstruction (GenRe), designed to capture more…
▽ More
From a single image, humans are able to perceive the full 3D shape of an object by exploiting learned shape priors from everyday life. Contemporary single-image 3D reconstruction algorithms aim to solve this task in a similar fashion, but often end up with priors that are highly biased by training classes. Here we present an algorithm, Generalizable Reconstruction (GenRe), designed to capture more generic, class-agnostic shape priors. We achieve this with an inference network and training procedure that combine 2.5D representations of visible surfaces (depth and silhouette), spherical shape representations of both visible and non-visible surfaces, and 3D voxel-based representations, in a principled manner that exploits the causal structure of how 3D shapes give rise to 2D images. Experiments demonstrate that GenRe performs well on single-view shape reconstruction, and generalizes to diverse novel objects from categories not seen during training.
△ Less
Submitted 28 December, 2018;
originally announced December 2018.
-
Reasoning About Physical Interactions with Object-Oriented Prediction and Planning
Authors:
Michael Janner,
Sergey Levine,
William T. Freeman,
Joshua B. Tenenbaum,
Chelsea Finn,
Jiajun Wu
Abstract:
Object-based factorizations provide a useful level of abstraction for interacting with the world. Building explicit object representations, however, often requires supervisory signals that are difficult to obtain in practice. We present a paradigm for learning object-centric representations for physical scene understanding without direct supervision of object properties. Our model, Object-Oriented…
▽ More
Object-based factorizations provide a useful level of abstraction for interacting with the world. Building explicit object representations, however, often requires supervisory signals that are difficult to obtain in practice. We present a paradigm for learning object-centric representations for physical scene understanding without direct supervision of object properties. Our model, Object-Oriented Prediction and Planning (O2P2), jointly learns a perception function to map from image observations to object representations, a pairwise physics interaction function to predict the time evolution of a collection of objects, and a rendering function to map objects back to pixels. For evaluation, we consider not only the accuracy of the physical predictions of the model, but also its utility for downstream tasks that require an actionable representation of intuitive physics. After training our model on an image prediction task, we can use its learned representations to build block towers more complicated than those observed during training.
△ Less
Submitted 7 January, 2019; v1 submitted 28 December, 2018;
originally announced December 2018.
-
Visual Object Networks: Image Generation with Disentangled 3D Representation
Authors:
Jun-Yan Zhu,
Zhoutong Zhang,
Chengkai Zhang,
Jiajun Wu,
Antonio Torralba,
Joshua B. Tenenbaum,
William T. Freeman
Abstract:
Recent progress in deep generative models has led to tremendous breakthroughs in image generation. However, while existing models can synthesize photorealistic images, they lack an understanding of our underlying 3D world. We present a new generative model, Visual Object Networks (VON), synthesizing natural images of objects with a disentangled 3D representation. Inspired by classic graphics rende…
▽ More
Recent progress in deep generative models has led to tremendous breakthroughs in image generation. However, while existing models can synthesize photorealistic images, they lack an understanding of our underlying 3D world. We present a new generative model, Visual Object Networks (VON), synthesizing natural images of objects with a disentangled 3D representation. Inspired by classic graphics rendering pipelines, we unravel our image formation process into three conditionally independent factors---shape, viewpoint, and texture---and present an end-to-end adversarial learning framework that jointly models 3D shapes and 2D images. Our model first learns to synthesize 3D shapes that are indistinguishable from real shapes. It then renders the object's 2.5D sketches (i.e., silhouette and depth map) from its shape under a sampled viewpoint. Finally, it learns to add realistic texture to these 2.5D sketches to generate natural images. The VON not only generates images that are more realistic than state-of-the-art 2D image synthesis methods, but also enables many 3D operations such as changing the viewpoint of a generated image, editing of shape and texture, linear interpolation in texture and shape space, and transferring appearance across different objects and viewpoints.
△ Less
Submitted 6 December, 2018;
originally announced December 2018.
-
GAN Dissection: Visualizing and Understanding Generative Adversarial Networks
Authors:
David Bau,
Jun-Yan Zhu,
Hendrik Strobelt,
Bolei Zhou,
Joshua B. Tenenbaum,
William T. Freeman,
Antonio Torralba
Abstract:
Generative Adversarial Networks (GANs) have recently achieved impressive results for many real-world applications, and many GAN variants have emerged with improvements in sample quality and training stability. However, they have not been well visualized or understood. How does a GAN represent our visual world internally? What causes the artifacts in GAN results? How do architectural choices affect…
▽ More
Generative Adversarial Networks (GANs) have recently achieved impressive results for many real-world applications, and many GAN variants have emerged with improvements in sample quality and training stability. However, they have not been well visualized or understood. How does a GAN represent our visual world internally? What causes the artifacts in GAN results? How do architectural choices affect GAN learning? Answering such questions could enable us to develop new insights and better models.
In this work, we present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level. We first identify a group of interpretable units that are closely related to object concepts using a segmentation-based network dissection method. Then, we quantify the causal effect of interpretable units by measuring the ability of interventions to control objects in the output. We examine the contextual relationship between these units and their surroundings by inserting the discovered object concepts into new images. We show several practical applications enabled by our framework, from comparing internal representations across different layers, models, and datasets, to improving GANs by locating and removing artifact-causing units, to interactively manipulating objects in a scene. We provide open source interpretation tools to help researchers and practitioners better understand their GAN models.
△ Less
Submitted 8 December, 2018; v1 submitted 26 November, 2018;
originally announced November 2018.
-
Co-regularized Alignment for Unsupervised Domain Adaptation
Authors:
Abhishek Kumar,
Prasanna Sattigeri,
Kahini Wadhawan,
Leonid Karlinsky,
Rogerio Feris,
William T. Freeman,
Gregory Wornell
Abstract:
Deep neural networks, trained with large amount of labeled data, can fail to generalize well when tested with examples from a \emph{target domain} whose distribution differs from the training data distribution, referred as the \emph{source domain}. It can be expensive or even infeasible to obtain required amount of labeled data in all possible domains. Unsupervised domain adaptation sets out to ad…
▽ More
Deep neural networks, trained with large amount of labeled data, can fail to generalize well when tested with examples from a \emph{target domain} whose distribution differs from the training data distribution, referred as the \emph{source domain}. It can be expensive or even infeasible to obtain required amount of labeled data in all possible domains. Unsupervised domain adaptation sets out to address this problem, aiming to learn a good predictive model for the target domain using labeled examples from the source domain but only unlabeled examples from the target domain. Domain alignment approaches this problem by matching the source and target feature distributions, and has been used as a key component in many state-of-the-art domain adaptation methods. However, matching the marginal feature distributions does not guarantee that the corresponding class conditional distributions will be aligned across the two domains. We propose co-regularized domain alignment for unsupervised domain adaptation, which constructs multiple diverse feature spaces and aligns source and target distributions in each of them individually, while encouraging that alignments agree with each other with regard to the class predictions on the unlabeled target examples. The proposed method is generic and can be used to improve any domain adaptation method which uses domain alignment. We instantiate it in the context of a recent state-of-the-art method and observe that it provides significant performance improvements on several domain adaptation benchmarks.
△ Less
Submitted 13 November, 2018;
originally announced November 2018.
-
ChainQueen: A Real-Time Differentiable Physical Simulator for Soft Robotics
Authors:
Yuanming Hu,
Jiancheng Liu,
Andrew Spielberg,
Joshua B. Tenenbaum,
William T. Freeman,
Jiajun Wu,
Daniela Rus,
Wojciech Matusik
Abstract:
Physical simulators have been widely used in robot planning and control. Among them, differentiable simulators are particularly favored, as they can be incorporated into gradient-based optimization algorithms that are efficient in solving inverse problems such as optimal control and motion planning. Simulating deformable objects is, however, more challenging compared to rigid body dynamics. The un…
▽ More
Physical simulators have been widely used in robot planning and control. Among them, differentiable simulators are particularly favored, as they can be incorporated into gradient-based optimization algorithms that are efficient in solving inverse problems such as optimal control and motion planning. Simulating deformable objects is, however, more challenging compared to rigid body dynamics. The underlying physical laws of deformable objects are more complex, and the resulting systems have orders of magnitude more degrees of freedom and therefore they are significantly more computationally expensive to simulate. Computing gradients with respect to physical design or controller parameters is typically even more computationally challenging. In this paper, we propose a real-time, differentiable hybrid Lagrangian-Eulerian physical simulator for deformable objects, ChainQueen, based on the Moving Least Squares Material Point Method (MLS-MPM). MLS-MPM can simulate deformable objects including contact and can be seamlessly incorporated into inference, control and co-design systems. We demonstrate that our simulator achieves high precision in both forward simulation and backward gradient computation. We have successfully employed it in a diverse set of control tasks for soft robots, including problems with nearly 3,000 decision variables.
△ Less
Submitted 1 October, 2018;
originally announced October 2018.
-
MoSculp: Interactive Visualization of Shape and Time
Authors:
Xiuming Zhang,
Tali Dekel,
Tianfan Xue,
Andrew Owens,
Qiurui He,
Jiajun Wu,
Stefanie Mueller,
William T. Freeman
Abstract:
We present a system that allows users to visualize complex human motion via 3D motion sculptures---a representation that conveys the 3D structure swept by a human body as it moves through space. Given an input video, our system computes the motion sculptures and provides a user interface for rendering it in different styles, including the options to insert the sculpture back into the original vide…
▽ More
We present a system that allows users to visualize complex human motion via 3D motion sculptures---a representation that conveys the 3D structure swept by a human body as it moves through space. Given an input video, our system computes the motion sculptures and provides a user interface for rendering it in different styles, including the options to insert the sculpture back into the original video, render it in a synthetic scene or physically print it.
To provide this end-to-end workflow, we introduce an algorithm that estimates that human's 3D geometry over time from a set of 2D images and develop a 3D-aware image-based rendering approach that embeds the sculpture back into the scene. By automating the process, our system takes motion sculpture creation out of the realm of professional artists, and makes it applicable to a wide range of existing video material.
By providing viewers with 3D information, motion sculptures reveal space-time motion information that is difficult to perceive with the naked eye, and allow viewers to interpret how different parts of the object interact over time. We validate the effectiveness of this approach with user studies, finding that our motion sculpture visualizations are significantly more informative about motion than existing stroboscopic and space-time visualization methods.
△ Less
Submitted 2 January, 2019; v1 submitted 14 September, 2018;
originally announced September 2018.
-
Physical Primitive Decomposition
Authors:
Zhijian Liu,
William T. Freeman,
Joshua B. Tenenbaum,
Jiajun Wu
Abstract:
Objects are made of parts, each with distinct geometry, physics, functionality, and affordances. Develo** such a distributed, physical, interpretable representation of objects will facilitate intelligent agents to better explore and interact with the world. In this paper, we study physical primitive decomposition---understanding an object through its components, each with physical and geometric…
▽ More
Objects are made of parts, each with distinct geometry, physics, functionality, and affordances. Develo** such a distributed, physical, interpretable representation of objects will facilitate intelligent agents to better explore and interact with the world. In this paper, we study physical primitive decomposition---understanding an object through its components, each with physical and geometric attributes. As annotated data for object parts and physics are rare, we propose a novel formulation that learns physical primitives by explaining both an object's appearance and its behaviors in physical events. Our model performs well on block towers and tools in both synthetic and real scenarios; we also demonstrate that visual and physical observations often provide complementary signals. We further present ablation and behavioral studies to better understand our model and contrast it with human performance.
△ Less
Submitted 13 September, 2018;
originally announced September 2018.
-
Learning Shape Priors for Single-View 3D Completion and Reconstruction
Authors:
Jiajun Wu,
Chengkai Zhang,
Xiuming Zhang,
Zhoutong Zhang,
William T. Freeman,
Joshua B. Tenenbaum
Abstract:
The problem of single-view 3D shape completion or reconstruction is challenging, because among the many possible shapes that explain an observation, most are implausible and do not correspond to natural objects. Recent research in the field has tackled this problem by exploiting the expressiveness of deep convolutional networks. In fact, there is another level of ambiguity that is often overlooked…
▽ More
The problem of single-view 3D shape completion or reconstruction is challenging, because among the many possible shapes that explain an observation, most are implausible and do not correspond to natural objects. Recent research in the field has tackled this problem by exploiting the expressiveness of deep convolutional networks. In fact, there is another level of ambiguity that is often overlooked: among plausible shapes, there are still multiple shapes that fit the 2D image equally well; i.e., the ground truth shape is non-deterministic given a single-view input. Existing fully supervised approaches fail to address this issue, and often produce blurry mean shapes with smooth surfaces but no fine details.
In this paper, we propose ShapeHD, pushing the limit of single-view shape completion and reconstruction by integrating deep generative models with adversarially learned shape priors. The learned priors serve as a regularizer, penalizing the model only if its output is unrealistic, not if it deviates from the ground truth. Our design thus overcomes both levels of ambiguity aforementioned. Experiments demonstrate that ShapeHD outperforms state of the art by a large margin in both shape completion and shape reconstruction on multiple real datasets.
△ Less
Submitted 13 September, 2018;
originally announced September 2018.
-
Seeing Tree Structure from Vibration
Authors:
Tianfan Xue,
Jiajun Wu,
Zhoutong Zhang,
Chengkai Zhang,
Joshua B. Tenenbaum,
William T. Freeman
Abstract:
Humans recognize object structure from both their appearance and motion; often, motion helps to resolve ambiguities in object structure that arise when we observe object appearance only. There are particular scenarios, however, where neither appearance nor spatial-temporal motion signals are informative: occluding twigs may look connected and have almost identical movements, though they belong to…
▽ More
Humans recognize object structure from both their appearance and motion; often, motion helps to resolve ambiguities in object structure that arise when we observe object appearance only. There are particular scenarios, however, where neither appearance nor spatial-temporal motion signals are informative: occluding twigs may look connected and have almost identical movements, though they belong to different, possibly disconnected branches. We propose to tackle this problem through spectrum analysis of motion signals, because vibrations of disconnected branches, though visually similar, often have distinctive natural frequencies. We propose a novel formulation of tree structure based on a physics-based link model, and validate its effectiveness by theoretical analysis, numerical simulation, and empirical experiments. With this formulation, we use nonparametric Bayesian inference to reconstruct tree structure from both spectral vibration signals and appearance cues. Our model performs well in recognizing hierarchical tree structure from real-world videos of trees and vessels.
△ Less
Submitted 13 September, 2018;
originally announced September 2018.
-
3D-Aware Scene Manipulation via Inverse Graphics
Authors:
Shunyu Yao,
Tzu Ming Harry Hsu,
Jun-Yan Zhu,
Jiajun Wu,
Antonio Torralba,
William T. Freeman,
Joshua B. Tenenbaum
Abstract:
We aim to obtain an interpretable, expressive, and disentangled scene representation that contains comprehensive structural and textural information for each object. Previous scene representations learned by neural networks are often uninterpretable, limited to a single object, or lacking 3D knowledge. In this work, we propose 3D scene de-rendering networks (3D-SDN) to address the above issues by…
▽ More
We aim to obtain an interpretable, expressive, and disentangled scene representation that contains comprehensive structural and textural information for each object. Previous scene representations learned by neural networks are often uninterpretable, limited to a single object, or lacking 3D knowledge. In this work, we propose 3D scene de-rendering networks (3D-SDN) to address the above issues by integrating disentangled representations for semantics, geometry, and appearance into a deep generative model. Our scene encoder performs inverse graphics, translating a scene into a structured object-wise representation. Our decoder has two components: a differentiable shape renderer and a neural texture generator. The disentanglement of semantics, geometry, and appearance supports 3D-aware scene manipulation, e.g., rotating and moving objects freely while kee** the consistent shape and texture, and changing the object appearance without affecting its shape. Experiments demonstrate that our editing scheme based on 3D-SDN is superior to its 2D counterpart.
△ Less
Submitted 18 December, 2018; v1 submitted 28 August, 2018;
originally announced August 2018.
-
Medical Image Imputation from Image Collections
Authors:
Adrian V. Dalca,
Katherine L. Bouman,
William T. Freeman,
Natalia S. Rost,
Mert R. Sabuncu,
Polina Golland
Abstract:
We present an algorithm for creating high resolution anatomically plausible images consistent with acquired clinical brain MRI scans with large inter-slice spacing. Although large data sets of clinical images contain a wealth of information, time constraints during acquisition result in sparse scans that fail to capture much of the anatomy. These characteristics often render computational analysis…
▽ More
We present an algorithm for creating high resolution anatomically plausible images consistent with acquired clinical brain MRI scans with large inter-slice spacing. Although large data sets of clinical images contain a wealth of information, time constraints during acquisition result in sparse scans that fail to capture much of the anatomy. These characteristics often render computational analysis impractical as many image analysis algorithms tend to fail when applied to such images. Highly specialized algorithms that explicitly handle sparse slice spacing do not generalize well across problem domains. In contrast, we aim to enable application of existing algorithms that were originally developed for high resolution research scans to significantly undersampled scans. We introduce a generative model that captures fine-scale anatomical structure across subjects in clinical image collections and derive an algorithm for filling in the missing data in scans with large inter-slice spacing. Our experimental results demonstrate that the resulting method outperforms state-of-the-art upsampling super-resolution techniques, and promises to facilitate subsequent analysis not previously possible with scans of this quality. Our implementation is freely available at https://github.com/adalca/papago .
△ Less
Submitted 16 August, 2018;
originally announced August 2018.
-
3D Shape Perception from Monocular Vision, Touch, and Shape Priors
Authors:
Shaoxiong Wang,
Jiajun Wu,
Xingyuan Sun,
Wenzhen Yuan,
William T. Freeman,
Joshua B. Tenenbaum,
Edward H. Adelson
Abstract:
Perceiving accurate 3D object shape is important for robots to interact with the physical world. Current research along this direction has been primarily relying on visual observations. Vision, however useful, has inherent limitations due to occlusions and the 2D-3D ambiguities, especially for perception with a monocular camera. In contrast, touch gets precise local shape information, though its e…
▽ More
Perceiving accurate 3D object shape is important for robots to interact with the physical world. Current research along this direction has been primarily relying on visual observations. Vision, however useful, has inherent limitations due to occlusions and the 2D-3D ambiguities, especially for perception with a monocular camera. In contrast, touch gets precise local shape information, though its efficiency for reconstructing the entire shape could be low. In this paper, we propose a novel paradigm that efficiently perceives accurate 3D object shape by incorporating visual and tactile observations, as well as prior knowledge of common object shapes learned from large-scale shape repositories. We use vision first, applying neural networks with learned shape priors to predict an object's 3D shape from a single-view color image. We then use tactile sensing to refine the shape; the robot actively touches the object regions where the visual prediction has high uncertainty. Our method efficiently builds the 3D shape of common objects from a color image and a small number of tactile explorations (around 10). Our setup is easy to apply and has potentials to help robots better perform gras** or manipulation tasks on real-world objects.
△ Less
Submitted 9 August, 2018;
originally announced August 2018.
-
Visual Dynamics: Stochastic Future Generation via Layered Cross Convolutional Networks
Authors:
Tianfan Xue,
Jiajun Wu,
Katherine L. Bouman,
William T. Freeman
Abstract:
We study the problem of synthesizing a number of likely future frames from a single input image. In contrast to traditional methods that have tackled this problem in a deterministic or non-parametric way, we propose to model future frames in a probabilistic manner. Our probabilistic model makes it possible for us to sample and synthesize many possible future frames from a single input image. To sy…
▽ More
We study the problem of synthesizing a number of likely future frames from a single input image. In contrast to traditional methods that have tackled this problem in a deterministic or non-parametric way, we propose to model future frames in a probabilistic manner. Our probabilistic model makes it possible for us to sample and synthesize many possible future frames from a single input image. To synthesize realistic movement of objects, we propose a novel network structure, namely a Cross Convolutional Network; this network encodes image and motion information as feature maps and convolutional kernels, respectively. In experiments, our model performs well on synthetic data, such as 2D shapes and animated game sprites, and on real-world video frames. We present analyses of the learned network representations, showing it is implicitly learning a compact encoding of object appearance and motion. We also demonstrate a few of its applications, including visual analogy-making and video extrapolation.
△ Less
Submitted 9 August, 2019; v1 submitted 24 July, 2018;
originally announced July 2018.
-
Unsupervised Training for 3D Morphable Model Regression
Authors:
Kyle Genova,
Forrester Cole,
Aaron Maschinot,
Aaron Sarna,
Daniel Vlasic,
William T. Freeman
Abstract:
We present a method for training a regression network from image pixels to 3D morphable model coordinates using only unlabeled photographs. The training loss is based on features from a facial recognition network, computed on-the-fly by rendering the predicted faces with a differentiable renderer. To make training from features feasible and avoid network fooling effects, we introduce three objecti…
▽ More
We present a method for training a regression network from image pixels to 3D morphable model coordinates using only unlabeled photographs. The training loss is based on features from a facial recognition network, computed on-the-fly by rendering the predicted faces with a differentiable renderer. To make training from features feasible and avoid network fooling effects, we introduce three objectives: a batch distribution loss that encourages the output distribution to match the distribution of the morphable model, a loopback loss that ensures the network can correctly reinterpret its own output, and a multi-view identity loss that compares the features of the predicted 3D face and the input photograph from multiple viewing angles. We train a regression network using these objectives, a set of unlabeled photographs, and the morphable model itself, and demonstrate state-of-the-art results.
△ Less
Submitted 15 June, 2018;
originally announced June 2018.
-
Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling
Authors:
Xingyuan Sun,
Jiajun Wu,
Xiuming Zhang,
Zhoutong Zhang,
Chengkai Zhang,
Tianfan Xue,
Joshua B. Tenenbaum,
William T. Freeman
Abstract:
We study 3D shape modeling from a single image and make contributions to it in three aspects. First, we present Pix3D, a large-scale benchmark of diverse image-shape pairs with pixel-level 2D-3D alignment. Pix3D has wide applications in shape-related tasks including reconstruction, retrieval, viewpoint estimation, etc. Building such a large-scale dataset, however, is highly challenging; existing d…
▽ More
We study 3D shape modeling from a single image and make contributions to it in three aspects. First, we present Pix3D, a large-scale benchmark of diverse image-shape pairs with pixel-level 2D-3D alignment. Pix3D has wide applications in shape-related tasks including reconstruction, retrieval, viewpoint estimation, etc. Building such a large-scale dataset, however, is highly challenging; existing datasets either contain only synthetic data, or lack precise alignment between 2D images and 3D shapes, or only have a small number of images. Second, we calibrate the evaluation criteria for 3D shape reconstruction through behavioral studies, and use them to objectively and systematically benchmark cutting-edge reconstruction algorithms on Pix3D. Third, we design a novel model that simultaneously performs 3D reconstruction and pose estimation; our multi-task learning approach achieves state-of-the-art performance on both tasks.
△ Less
Submitted 12 April, 2018;
originally announced April 2018.
-
Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Authors:
Ariel Ephrat,
Inbar Mosseri,
Oran Lang,
Tali Dekel,
Kevin Wilson,
Avinatan Hassidim,
William T. Freeman,
Michael Rubinstein
Abstract:
We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and aud…
▽ More
We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).
△ Less
Submitted 9 August, 2018; v1 submitted 10 April, 2018;
originally announced April 2018.
-
Learning-based Video Motion Magnification
Authors:
Tae-Hyun Oh,
Ronnachai Jaroensri,
Changil Kim,
Mohamed Elgharib,
Frédo Durand,
William T. Freeman,
Wojciech Matusik
Abstract:
Video motion magnification techniques allow us to see small motions previously invisible to the naked eyes, such as those of vibrating airplane wings, or swaying buildings under the influence of the wind. Because the motion is small, the magnification results are prone to noise or excessive blurring. The state of the art relies on hand-designed filters to extract representations that may not be op…
▽ More
Video motion magnification techniques allow us to see small motions previously invisible to the naked eyes, such as those of vibrating airplane wings, or swaying buildings under the influence of the wind. Because the motion is small, the magnification results are prone to noise or excessive blurring. The state of the art relies on hand-designed filters to extract representations that may not be optimal. In this paper, we seek to learn the filters directly from examples using deep convolutional neural networks. To make training tractable, we carefully design a synthetic dataset that captures small motion well, and use two-frame input for training. We show that the learned filters achieve high-quality results on real videos, with less ringing artifacts and better noise characteristics than previous methods. While our model is not trained with temporal filters, we found that the temporal filters can be used with our extracted representations up to a moderate magnification, enabling a frequency-based motion selection. Finally, we analyze the learned filters and show that they behave similarly to the derivative filters used in previous works. Our code, trained model, and datasets will be available online.
△ Less
Submitted 31 July, 2018; v1 submitted 8 April, 2018;
originally announced April 2018.
-
3D Interpreter Networks for Viewer-Centered Wireframe Modeling
Authors:
Jiajun Wu,
Tianfan Xue,
Joseph J. Lim,
Yuandong Tian,
Joshua B. Tenenbaum,
Antonio Torralba,
William T. Freeman
Abstract:
Understanding 3D object structure from a single image is an important but challenging task in computer vision, mostly due to the lack of 3D object annotations to real images. Previous research tackled this problem by either searching for a 3D shape that best explains 2D annotations, or training purely on synthetic data with ground truth 3D information. In this work, we propose 3D INterpreter Netwo…
▽ More
Understanding 3D object structure from a single image is an important but challenging task in computer vision, mostly due to the lack of 3D object annotations to real images. Previous research tackled this problem by either searching for a 3D shape that best explains 2D annotations, or training purely on synthetic data with ground truth 3D information. In this work, we propose 3D INterpreter Networks (3D-INN), an end-to-end trainable framework that sequentially estimates 2D keypoint heatmaps and 3D object skeletons and poses. Our system learns from both 2D-annotated real images and synthetic 3D data. This is made possible mainly by two technical innovations. First, heatmaps of 2D keypoints serve as an intermediate representation to connect real and synthetic data. 3D-INN is trained on real images to estimate 2D keypoint heatmaps from an input image; it then predicts 3D object structure from heatmaps using knowledge learned from synthetic 3D shapes. By doing so, 3D-INN benefits from the variation and abundance of synthetic 3D objects, without suffering from the domain difference between real and synthesized images, often due to imperfect rendering. Second, we propose a Projection Layer, map** estimated 3D structure back to 2D. During training, it ensures 3D-INN to predict 3D structure whose projection is consistent with the 2D annotations to real images. Experiments show that the proposed system performs well on both 2D keypoint estimation and 3D structure recovery. We also demonstrate that the recovered 3D information has wide vision applications, such as image retrieval.
△ Less
Submitted 9 August, 2019; v1 submitted 2 April, 2018;
originally announced April 2018.
-
Smart, Sparse Contours to Represent and Edit Images
Authors:
Tali Dekel,
Chuang Gan,
Dilip Krishnan,
Ce Liu,
William T. Freeman
Abstract:
We study the problem of reconstructing an image from information stored at contour locations. We show that high-quality reconstructions with high fidelity to the source image can be obtained from sparse input, e.g., comprising less than $6\%$ of image pixels. This is a significant improvement over existing contour-based reconstruction methods that require much denser input to capture subtle textur…
▽ More
We study the problem of reconstructing an image from information stored at contour locations. We show that high-quality reconstructions with high fidelity to the source image can be obtained from sparse input, e.g., comprising less than $6\%$ of image pixels. This is a significant improvement over existing contour-based reconstruction methods that require much denser input to capture subtle texture information and to ensure image quality. Our model, based on generative adversarial networks, synthesizes texture and details in regions where no input information is provided. The semantic knowledge encoded into our model and the sparsity of the input allows to use contours as an intuitive interface for semantically-aware image manipulation: local edits in contour domain translate to long-range and coherent changes in pixel space. We can perform complex structural changes such as changing facial expression by simple edits of contours. Our experiments demonstrate that humans as well as a face recognition system mostly cannot distinguish between our reconstructions and the source images.
△ Less
Submitted 9 April, 2018; v1 submitted 21 December, 2017;
originally announced December 2017.
-
Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning
Authors:
Andrew Owens,
Jiajun Wu,
Josh H. McDermott,
William T. Freeman,
Antonio Torralba
Abstract:
The sound of crashing waves, the roar of fast-moving cars -- sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, throug…
▽ More
The sound of crashing waves, the roar of fast-moving cars -- sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. This paper extends an earlier conference paper, Owens et al. 2016, with additional experiments and discussion.
△ Less
Submitted 19 December, 2017;
originally announced December 2017.
-
Video Enhancement with Task-Oriented Flow
Authors:
Tianfan Xue,
Baian Chen,
Jiajun Wu,
Donglai Wei,
William T. Freeman
Abstract:
Many video enhancement algorithms rely on optical flow to register frames in a video sequence. Precise flow estimation is however intractable; and optical flow itself is often a sub-optimal representation for particular video processing tasks. In this paper, we propose task-oriented flow (TOFlow), a motion representation learned in a self-supervised, task-specific manner. We design a neural networ…
▽ More
Many video enhancement algorithms rely on optical flow to register frames in a video sequence. Precise flow estimation is however intractable; and optical flow itself is often a sub-optimal representation for particular video processing tasks. In this paper, we propose task-oriented flow (TOFlow), a motion representation learned in a self-supervised, task-specific manner. We design a neural network with a trainable motion estimation component and a video processing component, and train them jointly to learn the task-oriented flow. For evaluation, we build Vimeo-90K, a large-scale, high-quality video dataset for low-level video processing. TOFlow outperforms traditional optical flow on standard benchmarks as well as our Vimeo-90K dataset in three video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution.
△ Less
Submitted 10 November, 2019; v1 submitted 24 November, 2017;
originally announced November 2017.
-
Exploiting Occlusion in Non-Line-of-Sight Active Imaging
Authors:
Christos Thrampoulidis,
Gal Shulkind,
Feihu Xu,
William T. Freeman,
Jeffrey H. Shapiro,
Antonio Torralba,
Franco N. C. Wong,
Gregory W. Wornell
Abstract:
Active non-line-of-sight imaging systems are of growing interest for diverse applications. The most commonly proposed approaches to date rely on exploiting time-resolved measurements, i.e., measuring the time it takes for short light pulses to transit the scene. This typically requires expensive, specialized, ultrafast lasers and detectors that must be carefully calibrated. We develop an alternati…
▽ More
Active non-line-of-sight imaging systems are of growing interest for diverse applications. The most commonly proposed approaches to date rely on exploiting time-resolved measurements, i.e., measuring the time it takes for short light pulses to transit the scene. This typically requires expensive, specialized, ultrafast lasers and detectors that must be carefully calibrated. We develop an alternative approach that exploits the valuable role that natural occluders in a scene play in enabling accurate and practical image formation in such settings without such hardware complexity. In particular, we demonstrate that the presence of occluders in the hidden scene can obviate the need for collecting time-resolved measurements, and develop an accompanying analysis for such systems and their generalizations. Ultimately, the results suggest the potential to develop increasingly sophisticated future systems that are able to identify and exploit diverse structural features of the environment to reconstruct scenes hidden from view.
△ Less
Submitted 16 November, 2017;
originally announced November 2017.
-
MarrNet: 3D Shape Reconstruction via 2.5D Sketches
Authors:
Jiajun Wu,
Yifan Wang,
Tianfan Xue,
Xingyuan Sun,
William T Freeman,
Joshua B Tenenbaum
Abstract:
3D object reconstruction from a single image is a highly under-determined problem, requiring strong prior knowledge of plausible 3D shapes. This introduces challenges for learning-based approaches, as 3D object annotations are scarce in real images. Previous work chose to train on synthetic data with ground truth 3D information, but suffered from domain adaptation when tested on real data. In this…
▽ More
3D object reconstruction from a single image is a highly under-determined problem, requiring strong prior knowledge of plausible 3D shapes. This introduces challenges for learning-based approaches, as 3D object annotations are scarce in real images. Previous work chose to train on synthetic data with ground truth 3D information, but suffered from domain adaptation when tested on real data. In this work, we propose MarrNet, an end-to-end trainable model that sequentially estimates 2.5D sketches and 3D object shape. Our disentangled, two-step formulation has three advantages. First, compared to full 3D shape, 2.5D sketches are much easier to be recovered from a 2D image; models that recover 2.5D sketches are also more likely to transfer from synthetic to real data. Second, for 3D reconstruction from 2.5D sketches, systems can learn purely from synthetic data. This is because we can easily render realistic 2.5D sketches without modeling object appearance variations in real images, including lighting, texture, etc. This further relieves the domain adaptation problem. Third, we derive differentiable projective functions from 3D shape to 2.5D sketches; the framework is therefore end-to-end trainable on real images, requiring no human annotations. Our model achieves state-of-the-art performance on 3D shape reconstruction.
△ Less
Submitted 8 November, 2017;
originally announced November 2017.
-
Reconstructing Video from Interferometric Measurements of Time-Varying Sources
Authors:
Katherine L. Bouman,
Michael D. Johnson,
Adrian V. Dalca,
Andrew A. Chael,
Freek Roelofs,
Sheperd S. Doeleman,
William T. Freeman
Abstract:
Very long baseline interferometry (VLBI) makes it possible to recover images of astronomical sources with extremely high angular resolution. Most recently, the Event Horizon Telescope (EHT) has extended VLBI to short millimeter wavelengths with a goal of achieving angular resolution sufficient for imaging the event horizons of nearby supermassive black holes. VLBI provides measurements related to…
▽ More
Very long baseline interferometry (VLBI) makes it possible to recover images of astronomical sources with extremely high angular resolution. Most recently, the Event Horizon Telescope (EHT) has extended VLBI to short millimeter wavelengths with a goal of achieving angular resolution sufficient for imaging the event horizons of nearby supermassive black holes. VLBI provides measurements related to the underlying source image through a sparse set spatial frequencies. An image can then be recovered from these measurements by making assumptions about the underlying image. One of the most important assumptions made by conventional imaging methods is that over the course of a night's observation the image is static. However, for quickly evolving sources, such as the galactic center's supermassive black hole (Sgr A*) targeted by the EHT, this assumption is violated and these conventional imaging approaches fail. In this work we propose a new way to model VLBI measurements that allows us to recover both the appearance and dynamics of an evolving source by reconstructing a video rather than a static image. By modeling VLBI measurements using a Gaussian Markov Model, we are able to propagate information across observations in time to reconstruct a video, while simultaneously learning about the dynamics of the source's emission region. We demonstrate our proposed Expectation-Maximization (EM) algorithm, StarWarps, on realistic synthetic observations of black holes, and show how it substantially improves results compared to conventional imaging algorithms. Additionally, we demonstrate StarWarps on real VLBI data of the M87 Jet from the VLBA.
△ Less
Submitted 1 February, 2018; v1 submitted 3 November, 2017;
originally announced November 2017.
-
Synthesizing Normalized Faces from Facial Identity Features
Authors:
Forrester Cole,
David Belanger,
Dilip Krishnan,
Aaron Sarna,
Inbar Mosseri,
William T. Freeman
Abstract:
We present a method for synthesizing a frontal, neutral-expression image of a person's face given an input face photograph. This is achieved by learning to generate facial landmarks and textures from features extracted from a facial-recognition network. Unlike previous approaches, our encoding feature vector is largely invariant to lighting, pose, and facial expression. Exploiting this invariance,…
▽ More
We present a method for synthesizing a frontal, neutral-expression image of a person's face given an input face photograph. This is achieved by learning to generate facial landmarks and textures from features extracted from a facial-recognition network. Unlike previous approaches, our encoding feature vector is largely invariant to lighting, pose, and facial expression. Exploiting this invariance, we train our decoder network using only frontal, neutral-expression photographs. Since these photographs are well aligned, we can decompose them into a sparse set of landmark points and aligned texture maps. The decoder then predicts landmarks and textures independently and combines them using a differentiable image war** operation. The resulting images can be used for a number of applications, such as analyzing facial attributes, exposure and white balance adjustment, or creating a 3-D avatar.
△ Less
Submitted 17 October, 2017; v1 submitted 17 January, 2017;
originally announced January 2017.
-
Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling
Authors:
Jiajun Wu,
Chengkai Zhang,
Tianfan Xue,
William T. Freeman,
Joshua B. Tenenbaum
Abstract:
We study the problem of 3D object generation. We propose a novel framework, namely 3D Generative Adversarial Network (3D-GAN), which generates 3D objects from a probabilistic space by leveraging recent advances in volumetric convolutional networks and generative adversarial nets. The benefits of our model are three-fold: first, the use of an adversarial criterion, instead of traditional heuristic…
▽ More
We study the problem of 3D object generation. We propose a novel framework, namely 3D Generative Adversarial Network (3D-GAN), which generates 3D objects from a probabilistic space by leveraging recent advances in volumetric convolutional networks and generative adversarial nets. The benefits of our model are three-fold: first, the use of an adversarial criterion, instead of traditional heuristic criteria, enables the generator to capture object structure implicitly and to synthesize high-quality 3D objects; second, the generator establishes a map** from a low-dimensional probabilistic space to the space of 3D objects, so that we can sample objects without a reference image or CAD models, and explore the 3D object manifold; third, the adversarial discriminator provides a powerful 3D shape descriptor which, learned without supervision, has wide applications in 3D object recognition. Experiments demonstrate that our method generates high-quality 3D objects, and our unsupervisedly learned features achieve impressive performance on 3D object recognition, comparable with those of supervised learning methods.
△ Less
Submitted 4 January, 2017; v1 submitted 24 October, 2016;
originally announced October 2016.
-
Best-Buddies Similarity - Robust Template Matching using Mutual Nearest Neighbors
Authors:
Shaul Oron,
Tali Dekel,
Tianfan Xue,
William T. Freeman,
Shai Avidan
Abstract:
We propose a novel method for template matching in unconstrained environments. Its essence is the Best-Buddies Similarity (BBS), a useful, robust, and parameter-free similarity measure between two sets of points. BBS is based on counting the number of Best-Buddies Pairs (BBPs)--pairs of points in source and target sets, where each point is the nearest neighbor of the other. BBS has several key fea…
▽ More
We propose a novel method for template matching in unconstrained environments. Its essence is the Best-Buddies Similarity (BBS), a useful, robust, and parameter-free similarity measure between two sets of points. BBS is based on counting the number of Best-Buddies Pairs (BBPs)--pairs of points in source and target sets, where each point is the nearest neighbor of the other. BBS has several key features that make it robust against complex geometric deformations and high levels of outliers, such as those arising from background clutter and occlusions. We study these properties, provide a statistical analysis that justifies them, and demonstrate the consistent success of BBS on a challenging real-world dataset while using different types of features.
△ Less
Submitted 6 September, 2016;
originally announced September 2016.
-
Ambient Sound Provides Supervision for Visual Learning
Authors:
Andrew Owens,
Jiajun Wu,
Josh H. McDermott,
William T. Freeman,
Antonio Torralba
Abstract:
The sound of crashing waves, the roar of fast-moving cars -- sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, throug…
▽ More
The sound of crashing waves, the roar of fast-moving cars -- sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds.
△ Less
Submitted 5 December, 2016; v1 submitted 25 August, 2016;
originally announced August 2016.
-
Observing---and Imaging---Active Galactic Nuclei with the Event Horizon Telescope
Authors:
Vincent L. Fish,
Kazunori Akiyama,
Katherine L. Bouman,
Andrew A. Chael,
Michael D. Johnson,
Sheperd S. Doeleman,
Lindy Blackburn,
John F. C. Wardle,
William T. Freeman,
the Event Horizon Telescope Collaboration
Abstract:
Originally developed to image the shadow region of the central black hole in Sagittarius A* and in the nearby galaxy M87, the Event Horizon Telescope (EHT) provides deep, very high angular resolution data on other AGN sources too. The challenges of working with EHT data have spurred the development of new image reconstruction algorithms. This work briefly reviews the status of the EHT and its util…
▽ More
Originally developed to image the shadow region of the central black hole in Sagittarius A* and in the nearby galaxy M87, the Event Horizon Telescope (EHT) provides deep, very high angular resolution data on other AGN sources too. The challenges of working with EHT data have spurred the development of new image reconstruction algorithms. This work briefly reviews the status of the EHT and its utility for observing AGN sources, with emphasis on novel imaging techniques that offer the promise of better reconstructions at 1.3 mm and other wavelengths.
△ Less
Submitted 11 July, 2016;
originally announced July 2016.
-
Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks
Authors:
Tianfan Xue,
Jiajun Wu,
Katherine L. Bouman,
William T. Freeman
Abstract:
We study the problem of synthesizing a number of likely future frames from a single input image. In contrast to traditional methods, which have tackled this problem in a deterministic or non-parametric way, we propose a novel approach that models future frames in a probabilistic manner. Our probabilistic model makes it possible for us to sample and synthesize many possible future frames from a sin…
▽ More
We study the problem of synthesizing a number of likely future frames from a single input image. In contrast to traditional methods, which have tackled this problem in a deterministic or non-parametric way, we propose a novel approach that models future frames in a probabilistic manner. Our probabilistic model makes it possible for us to sample and synthesize many possible future frames from a single input image. Future frame synthesis is challenging, as it involves low- and high-level image and motion understanding. We propose a novel network structure, namely a Cross Convolutional Network to aid in synthesizing future frames; this network structure encodes image and motion information as feature maps and convolutional kernels, respectively. In experiments, our model performs well on synthetic data, such as 2D shapes and animated game sprites, as well as on real-wold videos. We also show that our model can be applied to tasks such as visual analogy-making, and present an analysis of the learned network representations.
△ Less
Submitted 9 July, 2016;
originally announced July 2016.
-
A Comparative Evaluation of Approximate Probabilistic Simulation and Deep Neural Networks as Accounts of Human Physical Scene Understanding
Authors:
Renqiao Zhang,
Jiajun Wu,
Chengkai Zhang,
William T. Freeman,
Joshua B. Tenenbaum
Abstract:
Humans demonstrate remarkable abilities to predict physical events in complex scenes. Two classes of models for physical scene understanding have recently been proposed: "Intuitive Physics Engines", or IPEs, which posit that people make predictions by running approximate probabilistic simulations in causal mental models similar in nature to video-game physics engines, and memory-based models, whic…
▽ More
Humans demonstrate remarkable abilities to predict physical events in complex scenes. Two classes of models for physical scene understanding have recently been proposed: "Intuitive Physics Engines", or IPEs, which posit that people make predictions by running approximate probabilistic simulations in causal mental models similar in nature to video-game physics engines, and memory-based models, which make judgments based on analogies to stored experiences of previously encountered scenes and physical outcomes. Versions of the latter have recently been instantiated in convolutional neural network (CNN) architectures. Here we report four experiments that, to our knowledge, are the first rigorous comparisons of simulation-based and CNN-based models, where both approaches are concretely instantiated in algorithms that can run on raw image inputs and produce as outputs physical judgments such as whether a stack of blocks will fall. Both approaches can achieve super-human accuracy levels and can quantitatively predict human judgments to a similar degree, but only the simulation-based models generalize to novel situations in ways that people do, and are qualitatively consistent with systematic perceptual illusions and judgment asymmetries that people show.
△ Less
Submitted 3 October, 2016; v1 submitted 4 May, 2016;
originally announced May 2016.
-
Single Image 3D Interpreter Network
Authors:
Jiajun Wu,
Tianfan Xue,
Joseph J. Lim,
Yuandong Tian,
Joshua B. Tenenbaum,
Antonio Torralba,
William T. Freeman
Abstract:
Understanding 3D object structure from a single image is an important but difficult task in computer vision, mostly due to the lack of 3D object annotations in real images. Previous work tackles this problem by either solving an optimization task given 2D keypoint positions, or training on synthetic data with ground truth 3D information. In this work, we propose 3D INterpreter Network (3D-INN), an…
▽ More
Understanding 3D object structure from a single image is an important but difficult task in computer vision, mostly due to the lack of 3D object annotations in real images. Previous work tackles this problem by either solving an optimization task given 2D keypoint positions, or training on synthetic data with ground truth 3D information. In this work, we propose 3D INterpreter Network (3D-INN), an end-to-end framework which sequentially estimates 2D keypoint heatmaps and 3D object structure, trained on both real 2D-annotated images and synthetic 3D data. This is made possible mainly by two technical innovations. First, we propose a Projection Layer, which projects estimated 3D structure to 2D space, so that 3D-INN can be trained to predict 3D structural parameters supervised by 2D annotations on real images. Second, heatmaps of keypoints serve as an intermediate representation connecting real and synthetic data, enabling 3D-INN to benefit from the variation and abundance of synthetic 3D objects, without suffering from the difference between the statistics of real and synthesized images due to imperfect rendering. The network achieves state-of-the-art performance on both 2D keypoint estimation and 3D structure recovery. We also show that the recovered 3D information can be used in other vision applications, such as 3D rendering and image retrieval.
△ Less
Submitted 4 October, 2016; v1 submitted 29 April, 2016;
originally announced April 2016.
-
Visually Indicated Sounds
Authors:
Andrew Owens,
Phillip Isola,
Josh McDermott,
Antonio Torralba,
Edward H. Adelson,
William T. Freeman
Abstract:
Objects make distinctive sounds when they are hit or scratched. These sounds reveal aspects of an object's material properties, as well as the actions that produced them. In this paper, we propose the task of predicting what sound an object makes when struck as a way of studying physical interactions within a visual scene. We present an algorithm that synthesizes sound from silent videos of people…
▽ More
Objects make distinctive sounds when they are hit or scratched. These sounds reveal aspects of an object's material properties, as well as the actions that produced them. In this paper, we propose the task of predicting what sound an object makes when struck as a way of studying physical interactions within a visual scene. We present an algorithm that synthesizes sound from silent videos of people hitting and scratching objects with a drumstick. This algorithm uses a recurrent neural network to predict sound features from videos and then produces a waveform from these features with an example-based synthesis procedure. We show that the sounds predicted by our model are realistic enough to fool participants in a "real or fake" psychophysical experiment, and that they convey significant information about material properties and physical interactions.
△ Less
Submitted 29 April, 2016; v1 submitted 28 December, 2015;
originally announced December 2015.
-
Computational Imaging for VLBI Image Reconstruction
Authors:
Katherine L. Bouman,
Michael D. Johnson,
Daniel Zoran,
Vincent L. Fish,
Sheperd S. Doeleman,
William T. Freeman
Abstract:
Very long baseline interferometry (VLBI) is a technique for imaging celestial radio emissions by simultaneously observing a source from telescopes distributed across Earth. The challenges in reconstructing images from fine angular resolution VLBI data are immense. The data is extremely sparse and noisy, thus requiring statistical image models such as those designed in the computer vision community…
▽ More
Very long baseline interferometry (VLBI) is a technique for imaging celestial radio emissions by simultaneously observing a source from telescopes distributed across Earth. The challenges in reconstructing images from fine angular resolution VLBI data are immense. The data is extremely sparse and noisy, thus requiring statistical image models such as those designed in the computer vision community. In this paper we present a novel Bayesian approach for VLBI image reconstruction. While other methods often require careful tuning and parameter selection for different types of data, our method (CHIRP) produces good results under different settings such as low SNR or extended emission. The success of our method is demonstrated on realistic synthetic experiments as well as publicly available real data. We present this problem in a way that is accessible to members of the community, and provide a dataset website (vlbiimaging.csail.mit.edu) that facilitates controlled comparisons across algorithms.
△ Less
Submitted 7 November, 2016; v1 submitted 4 December, 2015;
originally announced December 2015.
-
Imaging an Event Horizon: Mitigation of Scattering Toward Sagittarius A*
Authors:
Vincent L. Fish,
Michael D. Johnson,
Ru-Sen Lu,
Sheperd S. Doeleman,
Katherine L. Bouman,
Daniel Zoran,
William T. Freeman,
Dimitrios Psaltis,
Ramesh Narayan,
Victor Pankratius,
Avery E. Broderick,
Carl R. Gwinn,
Laura E. Vertatschitsch
Abstract:
The image of the emission surrounding the black hole in the center of the Milky Way is predicted to exhibit the imprint of general relativistic (GR) effects, including the existence of a shadow feature and a photon ring of diameter ~50 microarcseconds. Structure on these scales can be resolved by millimeter-wavelength very long baseline interferometry (VLBI). However, strong-field GR features of i…
▽ More
The image of the emission surrounding the black hole in the center of the Milky Way is predicted to exhibit the imprint of general relativistic (GR) effects, including the existence of a shadow feature and a photon ring of diameter ~50 microarcseconds. Structure on these scales can be resolved by millimeter-wavelength very long baseline interferometry (VLBI). However, strong-field GR features of interest will be blurred at lambda >= 1.3 mm due to scattering by interstellar electrons. The scattering properties are well understood over most of the relevant range of baseline lengths, suggesting that the scattering may be (mostly) invertible. We simulate observations of a model image of Sgr A* and demonstrate that the effects of scattering can indeed be mitigated by correcting the visibilities before reconstructing the image. This technique is also applicable to Sgr A* at longer wavelengths.
△ Less
Submitted 16 September, 2014;
originally announced September 2014.
-
Network analysis reveals distinct clinical syndromes underlying acute mountain sickness
Authors:
David P Hall,
Ian JC MacCormick,
Alex T Phythian-Adams,
Nina M Rzechorzek,
David Hope-Jones,
Sorrel Cosens,
Stewart Jackson,
Matthew GD Bates,
David J Collier,
David A Hume,
Thomas Freeman,
AA Roger Thompson,
J Kenneth Baillie
Abstract:
Acute mountain sickness (AMS) is a common problem among visitors at high altitude, and may progress to life-threatening pulmonary and cerebral oedema in a minority of cases. International consensus defines AMS as a constellation of subjective, non-specific symptoms. Specifically, headache, sleep disturbance, fatigue and dizziness are given equal diagnostic weighting. Different pathophysiological m…
▽ More
Acute mountain sickness (AMS) is a common problem among visitors at high altitude, and may progress to life-threatening pulmonary and cerebral oedema in a minority of cases. International consensus defines AMS as a constellation of subjective, non-specific symptoms. Specifically, headache, sleep disturbance, fatigue and dizziness are given equal diagnostic weighting. Different pathophysiological mechanisms are now thought to underlie headache and sleep disturbance during acute exposure to high altitude. Hence, these symptoms may not belong together as a single syndrome. Using a novel visual analogue scale (VAS), we sought to undertake a systematic exploration of the symptomatology of AMS using an unbiased, data-driven approach originally designed for analysis of gene expression. Symptom scores were collected from 293 subjects during 1110 subject-days at altitudes between 3650m and 5200m on Apex expeditions to Bolivia and Kilimanjaro. Three distinct patterns of symptoms were consistently identified. Although fatigue is a ubiquitous finding, sleep disturbance and headache are each commonly reported without the other. The commonest pattern of symptoms was sleep disturbance and fatigue, with little or no headache. In subjects reporting severe headache, 40% did not report sleep disturbance. Sleep disturbance correlates poorly with other symptoms of AMS (Pearson r = 0.31 vs headache). These results challenge the accepted paradigm that AMS is a single disease process and describe at least two distinct syndromes following acute ascent to high altitude. This approach to analysing symptom patterns has potential utility in other clinical syndromes.
△ Less
Submitted 26 March, 2013;
originally announced March 2013.
-
Exploiting compositionality to explore a large space of model structures
Authors:
Roger Grosse,
Ruslan R Salakhutdinov,
William T. Freeman,
Joshua B. Tenenbaum
Abstract:
The recent proliferation of richly structured probabilistic models raises the question of how to automatically determine an appropriate model for a dataset. We investigate this question for a space of matrix decomposition models which can express a variety of widely used models from unsupervised learning. To enable model selection, we organize these models into a context-free grammar which generat…
▽ More
The recent proliferation of richly structured probabilistic models raises the question of how to automatically determine an appropriate model for a dataset. We investigate this question for a space of matrix decomposition models which can express a variety of widely used models from unsupervised learning. To enable model selection, we organize these models into a context-free grammar which generates a wide variety of structures through the compositional application of a few simple rules. We use our grammar to generically and efficiently infer latent components and estimate predictive likelihood for nearly 2500 structures using a small toolbox of reusable algorithms. Using a greedy search over our grammar, we automatically choose the decomposition structure from raw data by evaluating only a small fraction of all models. The proposed method typically finds the correct structure for synthetic data and backs off gracefully to simpler models under heavy noise. It learns sensible structures for datasets as diverse as image patches, motion capture, 20 Questions, and U.S. Senate votes, all using exactly the same code.
△ Less
Submitted 16 October, 2012;
originally announced October 2012.
-
Informative Sensing
Authors:
Hyun Sung Chang,
Yair Weiss,
William T. Freeman
Abstract:
Compressed sensing is a recent set of mathematical results showing that sparse signals can be exactly reconstructed from a small number of linear measurements. Interestingly, for ideal sparse signals with no measurement noise, random measurements allow perfect reconstruction while measurements based on principal component analysis (PCA) or independent component analysis (ICA) do not. At the same…
▽ More
Compressed sensing is a recent set of mathematical results showing that sparse signals can be exactly reconstructed from a small number of linear measurements. Interestingly, for ideal sparse signals with no measurement noise, random measurements allow perfect reconstruction while measurements based on principal component analysis (PCA) or independent component analysis (ICA) do not. At the same time, for other signal and noise distributions, PCA and ICA can significantly outperform random projections in terms of enabling reconstruction from a small number of measurements. In this paper we ask: given the distribution of signals we wish to measure, what are the optimal set of linear projections for compressed sensing? We consider the problem of finding a small number of linear projections that are maximally informative about the signal. Formally, we use the InfoMax criterion and seek to maximize the mutual information between the signal, x, and the (possibly noisy) projection y=Wx. We show that in general the optimal projections are not the principal components of the data nor random projections, but rather a seemingly novel set of projections that capture what is still uncertain about the signal, given the knowledge of distribution. We present analytic solutions for certain special cases including natural images. In particular, for natural images, the near-optimal projections are bandwise random, i.e., incoherent to the sparse bases at a particular frequency band but with more weights on the low-frequencies, which has a physical relation to the multi-resolution representation of images.
△ Less
Submitted 27 January, 2009;
originally announced January 2009.