-
SlowFast Networks for Video Recognition
Authors:
Christoph Feichtenhofer,
Haoqi Fan,
Jitendra Malik,
Kaiming He
Abstract:
We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our…
▽ More
We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/facebookresearch/SlowFast
△ Less
Submitted 29 October, 2019; v1 submitted 10 December, 2018;
originally announced December 2018.
-
Fetal whole-heart 4D imaging using motion-corrected multi-planar real-time MRI
Authors:
Joshua FP van Amerom,
David FA Lloyd,
Maria Deprez,
Anthony N Price,
Shaihan J Malik,
Kuberan Pushparajah,
Milou PM van Poppel,
Mary A Rutherford,
Reza Razavi,
Joseph V Hajnal
Abstract:
Purpose: To develop a MRI acquisition and reconstruction framework for volumetric cine visualisation of the fetal heart and great vessels in the presence of maternal and fetal motion.
Methods: Four-dimensional depiction was achieved using a highly-accelerated multi-planar real-time balanced steady state free precession acquisition combined with retrospective image-domain techniques for motion co…
▽ More
Purpose: To develop a MRI acquisition and reconstruction framework for volumetric cine visualisation of the fetal heart and great vessels in the presence of maternal and fetal motion.
Methods: Four-dimensional depiction was achieved using a highly-accelerated multi-planar real-time balanced steady state free precession acquisition combined with retrospective image-domain techniques for motion correction, cardiac synchronisation and outlier rejection. The framework was evaluated and optimised using a numerical phantom, and evaluated in a study of 20 mid- to late-gestational age human fetal subjects. Reconstructed cine volumes were evaluated by experienced cardiologists and compared with matched ultrasound. A preliminary assessment of flow-sensitive reconstruction using the velocity information encoded in the phase of dynamic images is included.
Results: Reconstructed cine volumes could be visualised in any 2D plane without the need for highly-specific scan plane prescription prior to acquisition or for maternal breath hold to minimise motion. Reconstruction was fully automated aside from user-specified masks of the fetal heart and chest. The framework proved robust when applied to fetal data and simulations confirmed that spatial and temporal features could be reliably recovered. Expert evaluation suggested the reconstructed volumes can be used for comprehensive assessment of the fetal heart, either as an adjunct to ultrasound or in combination with other MRI techniques.
Conclusion: The proposed methods show promise as a framework for motion-compensated 4D assessment of the fetal heart and great vessels.
△ Less
Submitted 5 December, 2018;
originally announced December 2018.
-
Learning 3D Human Dynamics from Video
Authors:
Angjoo Kanazawa,
Jason Y. Zhang,
Panna Felsen,
Jitendra Malik
Abstract:
From an image of a person in action, we can easily guess the 3D motion of the person in the immediate past and future. This is because we have a mental model of 3D human dynamics that we have acquired from observing visual sequences of humans in motion. We present a framework that can similarly learn a representation of 3D dynamics of humans from video via a simple but effective temporal encoding…
▽ More
From an image of a person in action, we can easily guess the 3D motion of the person in the immediate past and future. This is because we have a mental model of 3D human dynamics that we have acquired from observing visual sequences of humans in motion. We present a framework that can similarly learn a representation of 3D dynamics of humans from video via a simple but effective temporal encoding of image features. At test time, from video, the learned temporal representation give rise to smooth 3D mesh predictions. From a single image, our model can recover the current 3D mesh as well as its 3D past and future motion. Our approach is designed so it can learn from videos with 2D pose annotations in a semi-supervised manner. Though annotated data is always limited, there are millions of videos uploaded daily on the Internet. In this work, we harvest this Internet-scale source of unlabeled data by training our model on unlabeled video with pseudo-ground truth 2D pose obtained from an off-the-shelf 2D pose detector. Our experiments show that adding more videos with pseudo-ground truth 2D pose monotonically improves 3D prediction performance. We evaluate our model, Human Mesh and Motion Recovery (HMMR), on the recent challenging dataset of 3D Poses in the Wild and obtain state-of-the-art performance on the 3D prediction task without any fine-tuning. The project website with video, code, and data can be found at https://akanazawa.github.io/human_dynamics/.
△ Less
Submitted 16 September, 2019; v1 submitted 4 December, 2018;
originally announced December 2018.
-
Visual Memory for Robust Path Following
Authors:
Ashish Kumar,
Saurabh Gupta,
David Fouhey,
Sergey Levine,
Jitendra Malik
Abstract:
Humans routinely retrace paths in a novel environment both forwards and backwards despite uncertainty in their motion. This paper presents an approach for doing so. Given a demonstration of a path, a first network generates a path abstraction. Equipped with this abstraction, a second network observes the world and decides how to act to retrace the path under noisy actuation and a changing environm…
▽ More
Humans routinely retrace paths in a novel environment both forwards and backwards despite uncertainty in their motion. This paper presents an approach for doing so. Given a demonstration of a path, a first network generates a path abstraction. Equipped with this abstraction, a second network observes the world and decides how to act to retrace the path under noisy actuation and a changing environment. The two networks are optimized end-to-end at training time. We evaluate the method in two realistic simulators, performing path following and homing under actuation noise and environmental changes. Our experiments show that our approach outperforms classical approaches and other learning based baselines.
△ Less
Submitted 3 December, 2018;
originally announced December 2018.
-
Are All Training Examples Created Equal? An Empirical Study
Authors:
Kailas Vodrahalli,
Ke Li,
Jitendra Malik
Abstract:
Modern computer vision algorithms often rely on very large training datasets. However, it is conceivable that a carefully selected subsample of the dataset is sufficient for training. In this paper, we propose a gradient-based importance measure that we use to empirically analyze relative importance of training images in four datasets of varying complexity. We find that in some cases, a small subs…
▽ More
Modern computer vision algorithms often rely on very large training datasets. However, it is conceivable that a carefully selected subsample of the dataset is sufficient for training. In this paper, we propose a gradient-based importance measure that we use to empirically analyze relative importance of training images in four datasets of varying complexity. We find that in some cases, a small subsample is indeed sufficient for training. For other datasets, however, the relative differences in importance are negligible. These results have important implications for active learning on deep networks. Additionally, our analysis method can be used as a general tool to better understand diversity of training examples in datasets.
△ Less
Submitted 29 November, 2018;
originally announced November 2018.
-
On the Implicit Assumptions of GANs
Authors:
Ke Li,
Jitendra Malik
Abstract:
Generative adversarial nets (GANs) have generated a lot of excitement. Despite their popularity, they exhibit a number of well-documented issues in practice, which apparently contradict theoretical guarantees. A number of enlightening papers have pointed out that these issues arise from unjustified assumptions that are commonly made, but the message seems to have been lost amid the optimism of rec…
▽ More
Generative adversarial nets (GANs) have generated a lot of excitement. Despite their popularity, they exhibit a number of well-documented issues in practice, which apparently contradict theoretical guarantees. A number of enlightening papers have pointed out that these issues arise from unjustified assumptions that are commonly made, but the message seems to have been lost amid the optimism of recent years. We believe the identified problems deserve more attention, and highlight the implications on both the properties of GANs and the trajectory of research on probabilistic models. We recently proposed an alternative method that sidesteps these problems.
△ Less
Submitted 29 November, 2018;
originally announced November 2018.
-
Diverse Image Synthesis from Semantic Layouts via Conditional IMLE
Authors:
Ke Li,
Tianhao Zhang,
Jitendra Malik
Abstract:
Most existing methods for conditional image synthesis are only able to generate a single plausible image for any given input, or at best a fixed number of plausible images. In this paper, we focus on the problem of generating images from semantic segmentation maps and present a simple new method that can generate an arbitrary number of images with diverse appearance for the same semantic layout. U…
▽ More
Most existing methods for conditional image synthesis are only able to generate a single plausible image for any given input, or at best a fixed number of plausible images. In this paper, we focus on the problem of generating images from semantic segmentation maps and present a simple new method that can generate an arbitrary number of images with diverse appearance for the same semantic layout. Unlike most existing approaches which adopt the GAN framework, our method is based on the recently introduced Implicit Maximum Likelihood Estimation (IMLE) framework. Compared to the leading approach, our method is able to generate more diverse images while producing fewer artifacts despite using the same architecture. The learned latent space also has sensible structure despite the lack of supervision that encourages such behaviour. Videos and code are available at https://people.eecs.berkeley.edu/~ke.li/projects/imle/scene_layouts/.
△ Less
Submitted 29 August, 2019; v1 submitted 29 November, 2018;
originally announced November 2018.
-
Recycling cardiogenic artifacts in impedance pneumography
Authors:
Yao Lu,
Hau-tieng Wu,
John Malik
Abstract:
Purpose: Biomedical sensors often exhibit cardiogenic artifacts which, while distorting the signal of interest, carry useful hemodynamic information. We propose an algorithm to remove and extract hemodynamic information from these cardiogenic artifacts. Methods: We apply a nonlinear time-frequency analysis technique, the de-shape synchrosqueezing transform (dsSST), to adaptively isolate the high-…
▽ More
Purpose: Biomedical sensors often exhibit cardiogenic artifacts which, while distorting the signal of interest, carry useful hemodynamic information. We propose an algorithm to remove and extract hemodynamic information from these cardiogenic artifacts. Methods: We apply a nonlinear time-frequency analysis technique, the de-shape synchrosqueezing transform (dsSST), to adaptively isolate the high- and low-frequency components of a single-channel signal. We demonstrate this technique's effectiveness by removing and deriving hemodynamic information from the cardiogenic artifact in an impedance pneumography (IP). Results: The instantaneous heart rate is extracted, and the cardiac and respiratory signals are reconstructed. Conclusions: The dsSST is suitable for generating useful hemodynamic information from the cardiogenic artifact in a single-channel IP. We propose that the usefulness of the dsSST as a recycling tool extends to other biomedical sensors exhibiting cardiogenic artifacts.
△ Less
Submitted 26 February, 2019; v1 submitted 27 November, 2018;
originally announced November 2018.
-
SFV: Reinforcement Learning of Physical Skills from Videos
Authors:
Xue Bin Peng,
Angjoo Kanazawa,
Jitendra Malik,
Pieter Abbeel,
Sergey Levine
Abstract:
Data-driven character animation based on motion capture can produce highly naturalistic behaviors and, when combined with physics simulation, can provide for natural procedural responses to physical perturbations, environmental changes, and morphological discrepancies. Motion capture remains the most popular source of motion data, but collecting mocap data typically requires heavily instrumented e…
▽ More
Data-driven character animation based on motion capture can produce highly naturalistic behaviors and, when combined with physics simulation, can provide for natural procedural responses to physical perturbations, environmental changes, and morphological discrepancies. Motion capture remains the most popular source of motion data, but collecting mocap data typically requires heavily instrumented environments and actors. In this paper, we propose a method that enables physically simulated characters to learn skills from videos (SFV). Our approach, based on deep pose estimation and deep reinforcement learning, allows data-driven animation to leverage the abundance of publicly available video clips from the web, such as those from YouTube. This has the potential to enable fast and easy design of character controllers simply by querying for video recordings of the desired behavior. The resulting controllers are robust to perturbations, can be adapted to new settings, can perform basic object interactions, and can be retargeted to new morphologies via reinforcement learning. We further demonstrate that our method can predict potential human motions from still images, by forward simulation of learned controllers initialized from the observed pose. Our framework is able to learn a broad range of dynamic skills, including locomotion, acrobatics, and martial arts.
△ Less
Submitted 15 October, 2018; v1 submitted 8 October, 2018;
originally announced October 2018.
-
Super-Resolution via Conditional Implicit Maximum Likelihood Estimation
Authors:
Ke Li,
Shichong Peng,
Jitendra Malik
Abstract:
Single-image super-resolution (SISR) is a canonical problem with diverse applications. Leading methods like SRGAN produce images that contain various artifacts, such as high-frequency noise, hallucinated colours and shape distortions, which adversely affect the realism of the result. In this paper, we propose an alternative approach based on an extension of the method of Implicit Maximum Likelihoo…
▽ More
Single-image super-resolution (SISR) is a canonical problem with diverse applications. Leading methods like SRGAN produce images that contain various artifacts, such as high-frequency noise, hallucinated colours and shape distortions, which adversely affect the realism of the result. In this paper, we propose an alternative approach based on an extension of the method of Implicit Maximum Likelihood Estimation (IMLE). We demonstrate greater effectiveness at noise reduction and preservation of the original colours and shapes, yielding more realistic super-resolved images.
△ Less
Submitted 2 October, 2018;
originally announced October 2018.
-
Implicit Maximum Likelihood Estimation
Authors:
Ke Li,
Jitendra Malik
Abstract:
Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood un…
▽ More
Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood under some conditions. Our result holds in the non-asymptotic parametric setting, where both the capacity of the model and the number of data examples are finite. We also demonstrate encouraging experimental results.
△ Less
Submitted 22 October, 2018; v1 submitted 24 September, 2018;
originally announced September 2018.
-
Cost-Sensitive Active Learning for Intracranial Hemorrhage Detection
Authors:
Weicheng Kuo,
Christian Häne,
Esther Yuh,
Pratik Mukherjee,
Jitendra Malik
Abstract:
Deep learning for clinical applications is subject to stringent performance requirements, which raises a need for large labeled datasets. However, the enormous cost of labeling medical data makes this challenging. In this paper, we build a cost-sensitive active learning system for the problem of intracranial hemorrhage detection and segmentation on head computed tomography (CT). We show that our e…
▽ More
Deep learning for clinical applications is subject to stringent performance requirements, which raises a need for large labeled datasets. However, the enormous cost of labeling medical data makes this challenging. In this paper, we build a cost-sensitive active learning system for the problem of intracranial hemorrhage detection and segmentation on head computed tomography (CT). We show that our ensemble method compares favorably with the state-of-the-art, while running faster and using less memory. Moreover, our experiments are done using a substantially larger dataset than earlier papers on this topic. Since the labeling time could vary tremendously across examples, we model the labeling time and optimize the return on investment. We validate this idea by core-set selection on our large labeled dataset and by growing it with data from the wild.
△ Less
Submitted 8 September, 2018;
originally announced September 2018.
-
Gibson Env: Real-World Perception for Embodied Agents
Authors:
Fei Xia,
Amir Zamir,
Zhi-Yang He,
Alexander Sax,
Jitendra Malik,
Silvio Savarese
Abstract:
Develo** visual perception models for active agents and sensorimotor control are cumbersome to be done in the physical world, as existing algorithms are too slow to efficiently learn in real-time and robots are fragile and costly. This has given rise to learning-in-simulation which consequently casts a question on whether the results transfer to real-world. In this paper, we are concerned with t…
▽ More
Develo** visual perception models for active agents and sensorimotor control are cumbersome to be done in the physical world, as existing algorithms are too slow to efficiently learn in real-time and robots are fragile and costly. This has given rise to learning-in-simulation which consequently casts a question on whether the results transfer to real-world. In this paper, we are concerned with the problem of develo** real-world perception for active agents, propose Gibson Virtual Environment for this purpose, and showcase sample perceptual tasks learned therein. Gibson is based on virtualizing real spaces, rather than using artificially designed ones, and currently includes over 1400 floor spaces from 572 full buildings. The main characteristics of Gibson are: I. being from the real-world and reflecting its semantic complexity, II. having an internal synthesis mechanism, "Goggles", enabling deploying the trained models in real-world without needing further domain adaptation, III. embodiment of agents and making them subject to constraints of physics and space.
△ Less
Submitted 31 August, 2018;
originally announced August 2018.
-
DeepHPS: End-to-end Estimation of 3D Hand Pose and Shape by Learning from Synthetic Depth
Authors:
Jameel Malik,
Ahmed Elhayek,
Fabrizio Nunnari,
Kiran Varanasi,
Kiarash Tamaddon,
Alexis Heloir,
Didier Stricker
Abstract:
Articulated hand pose and shape estimation is an important problem for vision-based applications such as augmented reality and animation. In contrast to the existing methods which optimize only for joint positions, we propose a fully supervised deep network which learns to jointly estimate a full 3D hand mesh representation and pose from a single depth image. To this end, a CNN architecture is emp…
▽ More
Articulated hand pose and shape estimation is an important problem for vision-based applications such as augmented reality and animation. In contrast to the existing methods which optimize only for joint positions, we propose a fully supervised deep network which learns to jointly estimate a full 3D hand mesh representation and pose from a single depth image. To this end, a CNN architecture is employed to estimate parametric representations i.e. hand pose, bone scales and complex shape parameters. Then, a novel hand pose and shape layer, embedded inside our deep framework, produces 3D joint positions and hand mesh. Lack of sufficient training data with varying hand shapes limits the generalized performance of learning based methods. Also, manually annotating real data is suboptimal. Therefore, we present SynHand5M: a million-scale synthetic dataset with accurate joint annotations, segmentation masks and mesh files of depth maps. Among model based learning (hybrid) methods, we show improved results on our dataset and two of the public benchmarks i.e. NYU and ICVL. Also, by employing a joint training strategy with real and synthetic data, we recover 3D hand mesh and pose from real images in 3.7ms.
△ Less
Submitted 28 August, 2018;
originally announced August 2018.
-
Sleep-wake classification via quantifying heart rate variability by convolutional neural network
Authors:
John Malik,
Yu-Lun Lo,
Hau-tieng Wu
Abstract:
Fluctuations in heart rate are intimately tied to changes in the physiological state of the organism. We examine and exploit this relationship by classifying a human subject's wake/sleep status using his instantaneous heart rate (IHR) series. We use a convolutional neural network (CNN) to build features from the IHR series extracted from a whole-night electrocardiogram (ECG) and predict every 30 s…
▽ More
Fluctuations in heart rate are intimately tied to changes in the physiological state of the organism. We examine and exploit this relationship by classifying a human subject's wake/sleep status using his instantaneous heart rate (IHR) series. We use a convolutional neural network (CNN) to build features from the IHR series extracted from a whole-night electrocardiogram (ECG) and predict every 30 seconds whether the subject is awake or asleep. Our training database consists of 56 normal subjects, and we consider three different databases for validation; one is private, and two are public with different races and apnea severities. On our private database of 27 subjects, our accuracy, sensitivity, specificity, and AUC values for predicting the wake stage are 83.1%, 52.4%, 89.4%, and 0.83, respectively. Validation performance is similar on our two public databases. When we use the photoplethysmography instead of the ECG to obtain the IHR series, the performance is also comparable. A robustness check is carried out to confirm the obtained performance statistics. This result advocates for an effective and scalable method for recognizing changes in physiological state using non-invasive heart rate monitoring. The CNN model adaptively quantifies IHR fluctuation as well as its location in time and is suitable for differentiating between the wake and sleep stages.
△ Less
Submitted 31 July, 2018;
originally announced August 2018.
-
On Evaluation of Embodied Navigation Agents
Authors:
Peter Anderson,
Angel Chang,
Devendra Singh Chaplot,
Alexey Dosovitskiy,
Saurabh Gupta,
Vladlen Koltun,
Jana Kosecka,
Jitendra Malik,
Roozbeh Mottaghi,
Manolis Savva,
Amir R. Zamir
Abstract:
Skillful mobile operation in three-dimensional environments is a primary topic of study in Artificial Intelligence. The past two years have seen a surge of creative work on navigation. This creative output has produced a plethora of sometimes incompatible task definitions and evaluation protocols. To coordinate ongoing and future research in this area, we have convened a working group to study emp…
▽ More
Skillful mobile operation in three-dimensional environments is a primary topic of study in Artificial Intelligence. The past two years have seen a surge of creative work on navigation. This creative output has produced a plethora of sometimes incompatible task definitions and evaluation protocols. To coordinate ongoing and future research in this area, we have convened a working group to study empirical methodology in navigation research. The present document summarizes the consensus recommendations of this working group. We discuss different problem statements and the role of generalization, present evaluation measures, and provide standard scenarios that can be used for benchmarking.
△ Less
Submitted 17 July, 2018;
originally announced July 2018.
-
Learning Instance Segmentation by Interaction
Authors:
Deepak Pathak,
Yide Shentu,
Dian Chen,
Pulkit Agrawal,
Trevor Darrell,
Sergey Levine,
Jitendra Malik
Abstract:
We present an approach for building an active agent that learns to segment its visual observations into individual objects by interacting with its environment in a completely self-supervised manner. The agent uses its current segmentation model to infer pixels that constitute objects and refines the segmentation model by interacting with these pixels. The model learned from over 50K interactions g…
▽ More
We present an approach for building an active agent that learns to segment its visual observations into individual objects by interacting with its environment in a completely self-supervised manner. The agent uses its current segmentation model to infer pixels that constitute objects and refines the segmentation model by interacting with these pixels. The model learned from over 50K interactions generalizes to novel objects and backgrounds. To deal with noisy training signal for segmenting objects obtained by self-supervised interactions, we propose robust set loss. A dataset of robot's interactions along-with a few human labeled examples is provided as a benchmark for future research. We test the utility of the learned segmentation model by providing results on a downstream vision-based control task of rearranging multiple objects into target configurations from visual inputs alone. Videos, code, and robotic interaction dataset are available at https://pathak22.github.io/seg-by-interaction/
△ Less
Submitted 21 June, 2018;
originally announced June 2018.
-
PatchFCN for Intracranial Hemorrhage Detection
Authors:
Weicheng Kuo,
Christian Häne,
Esther Yuh,
Pratik Mukherjee,
Jitendra Malik
Abstract:
This paper studies the problem of detecting and segmenting acute intracranial hemorrhage on head computed tomography (CT) scans. We propose to solve both tasks as a semantic segmentation problem using a patch-based fully convolutional network (PatchFCN). This formulation allows us to accurately localize hemorrhages while bypassing the complexity of object detection. Our system demonstrates competi…
▽ More
This paper studies the problem of detecting and segmenting acute intracranial hemorrhage on head computed tomography (CT) scans. We propose to solve both tasks as a semantic segmentation problem using a patch-based fully convolutional network (PatchFCN). This formulation allows us to accurately localize hemorrhages while bypassing the complexity of object detection. Our system demonstrates competitive performance with a human expert and the state-of-the-art on classification tasks (0.976, 0.966 AUC of ROC on retrospective and prospective test sets) and on segmentation tasks (0.785 pixel AP, 0.766 Dice score), while using much less data and a simpler system. In addition, we conduct a series of controlled experiments to understand "why" PatchFCN outperforms standard FCN. Our studies show that PatchFCN finds a good trade-off between batch diversity and the amount of context during training. These findings may also apply to other medical segmentation tasks.
△ Less
Submitted 14 April, 2019; v1 submitted 8 June, 2018;
originally announced June 2018.
-
More Than a Feeling: Learning to Grasp and Regrasp using Vision and Touch
Authors:
Roberto Calandra,
Andrew Owens,
Dinesh Jayaraman,
Justin Lin,
Wenzhen Yuan,
Jitendra Malik,
Edward H. Adelson,
Sergey Levine
Abstract:
For humans, the process of gras** an object relies heavily on rich tactile feedback. Most recent robotic gras** work, however, has been based only on visual input, and thus cannot easily benefit from feedback after initiating contact. In this paper, we investigate how a robot can learn to use tactile information to iteratively and efficiently adjust its grasp. To this end, we propose an end-to…
▽ More
For humans, the process of gras** an object relies heavily on rich tactile feedback. Most recent robotic gras** work, however, has been based only on visual input, and thus cannot easily benefit from feedback after initiating contact. In this paper, we investigate how a robot can learn to use tactile information to iteratively and efficiently adjust its grasp. To this end, we propose an end-to-end action-conditional model that learns regras** policies from raw visuo-tactile data. This model -- a deep, multimodal convolutional network -- predicts the outcome of a candidate grasp adjustment, and then executes a grasp by iteratively selecting the most promising actions. Our approach requires neither calibration of the tactile sensors, nor any analytical modeling of contact forces, thus reducing the engineering effort required to obtain efficient gras** policies. We train our model with data from about 6,450 gras** trials on a two-finger gripper equipped with GelSight high-resolution tactile sensors on each finger. Across extensive experiments, our approach outperforms a variety of baselines at (i) estimating grasp adjustment outcomes, (ii) selecting efficient grasp adjustments for quick gras**, and (iii) reducing the amount of force applied at the fingers, while maintaining competitive performance. Finally, we study the choices made by our model and show that it has successfully acquired useful and interpretable gras** behaviors.
△ Less
Submitted 26 July, 2018; v1 submitted 28 May, 2018;
originally announced May 2018.
-
Colouring $(P_r+P_s)$-Free Graphs
Authors:
Tereza Klimošová,
Josef Malík,
Tomáš Masařík,
Jana Novotná,
Daniël Paulusma,
Veronika Slívová
Abstract:
The $k$-Colouring problem is to decide if the vertices of a graph can be coloured with at most $k$ colours for a fixed integer $k$ such that no two adjacent vertices are coloured alike. If each vertex u must be assigned a colour from a prescribed list $L(u) \subseteq \{1,\cdots, k\}$, then we obtain the List $k$-Colouring problem. A graph $G$ is $H$-free if $G$ does not contain $H$ as an induced s…
▽ More
The $k$-Colouring problem is to decide if the vertices of a graph can be coloured with at most $k$ colours for a fixed integer $k$ such that no two adjacent vertices are coloured alike. If each vertex u must be assigned a colour from a prescribed list $L(u) \subseteq \{1,\cdots, k\}$, then we obtain the List $k$-Colouring problem. A graph $G$ is $H$-free if $G$ does not contain $H$ as an induced subgraph. We continue an extensive study into the complexity of these two problems for $H$-free graphs. The graph $P_r+P_s$ is the disjoint union of the $r$-vertex path $P_r$ and the $s$-vertex path $P_s$. We prove that List $3$-Colouring is polynomial-time solvable for $(P_2+P_5)$-free graphs and for $(P_3+P_4)$-free graphs. Combining our results with known results yields complete complexity classifications of $3$-Colouring and List $3$-Colouring on $H$-free graphs for all graphs $H$ up to seven vertices.
△ Less
Submitted 16 March, 2021; v1 submitted 30 April, 2018;
originally announced April 2018.
-
Zero-Shot Visual Imitation
Authors:
Deepak Pathak,
Parsa Mahmoudieh,
Guanghao Luo,
Pulkit Agrawal,
Dian Chen,
Yide Shentu,
Evan Shelhamer,
Jitendra Malik,
Alexei A. Efros,
Trevor Darrell
Abstract:
The current dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both 'what' and 'how' to imitate. We pursue an alternative paradigm wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss. In our framework, the role of the expert…
▽ More
The current dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both 'what' and 'how' to imitate. We pursue an alternative paradigm wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss. In our framework, the role of the expert is only to communicate the goals (i.e., what to imitate) during inference. The learned policy is then employed to mimic the expert (i.e., how to imitate) after seeing just a sequence of images demonstrating the desired task. Our method is 'zero-shot' in the sense that the agent never has access to expert actions during training or for the task demonstration at inference. We evaluate our zero-shot imitator in two real-world settings: complex rope manipulation with a Baxter robot and navigation in previously unseen office environments with a TurtleBot. Through further experiments in VizDoom simulation, we provide evidence that better mechanisms for exploration lead to learning a more capable policy which in turn improves end task performance. Videos, models, and more details are available at https://pathak22.github.io/zeroshot-imitation/
△ Less
Submitted 23 April, 2018;
originally announced April 2018.
-
Taskonomy: Disentangling Task Transfer Learning
Authors:
Amir Zamir,
Alexander Sax,
William Shen,
Leonidas Guibas,
Jitendra Malik,
Silvio Savarese
Abstract:
Do visual tasks have a relationship, or are they unrelated? For instance, could having surface normals simplify estimating the depth of an image? Intuition answers these questions positively, implying existence of a structure among visual tasks. Knowing this structure has notable values; it is the concept underlying transfer learning and provides a principled way for identifying redundancies acros…
▽ More
Do visual tasks have a relationship, or are they unrelated? For instance, could having surface normals simplify estimating the depth of an image? Intuition answers these questions positively, implying existence of a structure among visual tasks. Knowing this structure has notable values; it is the concept underlying transfer learning and provides a principled way for identifying redundancies across tasks, e.g., to seamlessly reuse supervision among related tasks or solve many tasks in one system without piling up the complexity.
We proposes a fully computational approach for modeling the structure of space of visual tasks. This is done via finding (first and higher-order) transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D, and semantic tasks in a latent space. The product is a computational taxonomic map for task transfer learning. We study the consequences of this structure, e.g. nontrivial emerged relationships, and exploit them to reduce the demand for labeled data. For example, we show that the total number of labeled datapoints needed for solving a set of 10 tasks can be reduced by roughly 2/3 (compared to training independently) while kee** the performance nearly the same. We provide a set of tools for computing and probing this taxonomical structure including a solver that users can employ to devise efficient supervision policies for their use cases.
△ Less
Submitted 23 April, 2018;
originally announced April 2018.
-
Connecting Dots -- from Local Covariance to Empirical Intrinsic Geometry and Locally Linear Embedding
Authors:
John Malik,
Chao Shen,
Hau-Tieng Wu,
Nan Wu
Abstract:
Local covariance structure under the manifold setup has been widely applied in the machine learning society. Based on the established theoretical results, we provide an extensive study of two relevant manifold learning algorithms, empirical intrinsic geometry (EIG) and the locally linear embedding (LLE) under the manifold setup. Particularly, we show that without an accurate dimension estimation,…
▽ More
Local covariance structure under the manifold setup has been widely applied in the machine learning society. Based on the established theoretical results, we provide an extensive study of two relevant manifold learning algorithms, empirical intrinsic geometry (EIG) and the locally linear embedding (LLE) under the manifold setup. Particularly, we show that without an accurate dimension estimation, the geodesic distance estimation by EIG might be corrupted. Furthermore, we show that by taking the local covariance matrix into account, we can more accurately estimate the local geodesic distance. When understanding LLE based on the local covariance structure, its intimate relationship with the curvature suggests a variation of LLE depending on the "truncation scheme". We provide a theoretical analysis of the variation.
△ Less
Submitted 8 February, 2019; v1 submitted 9 April, 2018;
originally announced April 2018.
-
Learning Category-Specific Mesh Reconstruction from Image Collections
Authors:
Angjoo Kanazawa,
Shubham Tulsiani,
Alexei A. Efros,
Jitendra Malik
Abstract:
We present a learning framework for recovering the 3D shape, camera, and texture of an object from a single image. The shape is represented as a deformable 3D mesh model of an object category where a shape is parameterized by a learned mean shape and per-instance predicted deformation. Our approach allows leveraging an annotated image collection for training, where the deformable model and the 3D…
▽ More
We present a learning framework for recovering the 3D shape, camera, and texture of an object from a single image. The shape is represented as a deformable 3D mesh model of an object category where a shape is parameterized by a learned mean shape and per-instance predicted deformation. Our approach allows leveraging an annotated image collection for training, where the deformable model and the 3D prediction mechanism are learned without relying on ground-truth 3D or multi-view supervision. Our representation enables us to go beyond existing 3D prediction approaches by incorporating texture inference as prediction of an image in a canonical appearance space. Additionally, we show that semantic keypoints can be easily associated with the predicted shapes. We present qualitative and quantitative results of our approach on CUB and PASCAL3D datasets and show that we can learn to predict diverse shapes and textures across objects using only annotated image collections. The project website can be found at https://akanazawa.github.io/cmr/.
△ Less
Submitted 30 July, 2018; v1 submitted 20 March, 2018;
originally announced March 2018.
-
Diffuse to fuse EEG spectra -- intrinsic geometry of sleep dynamics for classification
Authors:
Gi-Ren Liu,
Yu-Lun Lo,
John Malik,
Yuan-Chung Sheu,
Hau-tieng Wu
Abstract:
We propose a novel algorithm for sleep dynamics visualization and automatic annotation by applying diffusion geometry based sensor fusion algorithm to fuse spectral information from two electroencephalograms (EEG). The diffusion geometry approach helps organize the nonlinear dynamical structure hidden in the EEG signal. The visualization is achieved by the nonlinear dimension reduction capability…
▽ More
We propose a novel algorithm for sleep dynamics visualization and automatic annotation by applying diffusion geometry based sensor fusion algorithm to fuse spectral information from two electroencephalograms (EEG). The diffusion geometry approach helps organize the nonlinear dynamical structure hidden in the EEG signal. The visualization is achieved by the nonlinear dimension reduction capability of the chosen diffusion geometry algorithms. For the automatic annotation purpose, the {support vector machine} is trained to predict the sleep stage. The prediction performance is validated on a publicly available benchmark database, Physionet Sleep-EDF [extended] SC$^*$ {(SC = Sleep Cassette)} and ST$^*$ {(ST = Sleep Telemetry)}, with the leave-one-subject-out cross validation. When we have a single EEG channel (Fpz-Cz), the overall accuracy, macro F1 and Cohen's kappa achieve $82.72\%$,$75.91\%$ and $76.1\%$ respectively in Sleep-EDF SC$^*$ and $78.63\%$, $73.58\%$ and $69.48\%$ in Sleep-EDF ST$^*$. This performance is compatible {with} the state-of-the-art results. When we have two EEG channels (Fpz-Cz and Pz-Oz), the overall accuracy, macro F1 and Cohen's kappa achieve $84.44\%$,$78.25\%$ and $78.36\%$ respectively in Sleep-EDF SC$^*$ and $79.05\%$, $74.73\%$ and $70.31\%$ in Sleep-EDF ST$^*$. The results suggest the potential of the proposed algorithm in practical applications.
△ Less
Submitted 6 May, 2019; v1 submitted 28 February, 2018;
originally announced March 2018.
-
Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction
Authors:
Shubham Tulsiani,
Alexei A. Efros,
Jitendra Malik
Abstract:
We present a framework for learning single-view shape and pose prediction without using direct supervision for either. Our approach allows leveraging multi-view observations from unknown poses as supervisory signal during training. Our proposed training setup enforces geometric consistency between the independently predicted shape and pose from two views of the same instance. We consequently learn…
▽ More
We present a framework for learning single-view shape and pose prediction without using direct supervision for either. Our approach allows leveraging multi-view observations from unknown poses as supervisory signal during training. Our proposed training setup enforces geometric consistency between the independently predicted shape and pose from two views of the same instance. We consequently learn to predict shape in an emergent canonical (view-agnostic) frame along with a corresponding pose predictor. We show empirical and qualitative results using the ShapeNet dataset and observe encouragingly competitive performance to previous techniques which rely on stronger forms of supervision. We also demonstrate the applicability of our framework in a realistic setting which is beyond the scope of existing techniques: using a training dataset comprised of online product images where the underlying shape and pose are unknown.
△ Less
Submitted 24 April, 2018; v1 submitted 11 January, 2018;
originally announced January 2018.
-
Unifying Map and Landmark Based Representations for Visual Navigation
Authors:
Saurabh Gupta,
David Fouhey,
Sergey Levine,
Jitendra Malik
Abstract:
This works presents a formulation for visual navigation that unifies map based spatial reasoning and path planning, with landmark based robust plan execution in noisy environments. Our proposed formulation is learned from data and is thus able to leverage statistical regularities of the world. This allows it to efficiently navigate in novel environments given only a sparse set of registered images…
▽ More
This works presents a formulation for visual navigation that unifies map based spatial reasoning and path planning, with landmark based robust plan execution in noisy environments. Our proposed formulation is learned from data and is thus able to leverage statistical regularities of the world. This allows it to efficiently navigate in novel environments given only a sparse set of registered images as input for building representations for space. Our formulation is based on three key ideas: a learned path planner that outputs path plans to reach the goal, a feature synthesis engine that predicts features for locations along the planned path, and a learned goal-driven closed loop controller that can follow plans given these synthesized features. We test our approach for goal-driven navigation in simulated real world environments and report performance gains over competitive baseline approaches.
△ Less
Submitted 21 December, 2017;
originally announced December 2017.
-
End-to-end Recovery of Human Shape and Pose
Authors:
Angjoo Kanazawa,
Michael J. Black,
David W. Jacobs,
Jitendra Malik
Abstract:
We describe Human Mesh Recovery (HMR), an end-to-end framework for reconstructing a full 3D mesh of a human body from a single RGB image. In contrast to most current methods that compute 2D or 3D joint locations, we produce a richer and more useful mesh representation that is parameterized by shape and 3D joint angles. The main objective is to minimize the reprojection loss of keypoints, which all…
▽ More
We describe Human Mesh Recovery (HMR), an end-to-end framework for reconstructing a full 3D mesh of a human body from a single RGB image. In contrast to most current methods that compute 2D or 3D joint locations, we produce a richer and more useful mesh representation that is parameterized by shape and 3D joint angles. The main objective is to minimize the reprojection loss of keypoints, which allow our model to be trained using images in-the-wild that only have ground truth 2D annotations. However, the reprojection loss alone leaves the model highly under constrained. In this work we address this problem by introducing an adversary trained to tell whether a human body parameter is real or not using a large database of 3D human meshes. We show that HMR can be trained with and without using any paired 2D-to-3D supervision. We do not rely on intermediate 2D keypoint detections and infer 3D pose and shape parameters directly from image pixels. Our model runs in real-time given a bounding box containing the person. We demonstrate our approach on various images in-the-wild and out-perform previous optimization based methods that output 3D meshes and show competitive results on tasks such as 3D joint location estimation and part segmentation.
△ Less
Submitted 23 June, 2018; v1 submitted 18 December, 2017;
originally announced December 2017.
-
Simultaneous Hand Pose and Skeleton Bone-Lengths Estimation from a Single Depth Image
Authors:
Jameel Malik,
Ahmed Elhayek,
Didier Stricker
Abstract:
Articulated hand pose estimation is a challenging task for human-computer interaction. The state-of-the-art hand pose estimation algorithms work only with one or a few subjects for which they have been calibrated or trained. Particularly, the hybrid methods based on learning followed by model fitting or model based deep learning do not explicitly consider varying hand shapes and sizes. In this wor…
▽ More
Articulated hand pose estimation is a challenging task for human-computer interaction. The state-of-the-art hand pose estimation algorithms work only with one or a few subjects for which they have been calibrated or trained. Particularly, the hybrid methods based on learning followed by model fitting or model based deep learning do not explicitly consider varying hand shapes and sizes. In this work, we introduce a novel hybrid algorithm for estimating the 3D hand pose as well as bone-lengths of the hand skeleton at the same time, from a single depth image. The proposed CNN architecture learns hand pose parameters and scale parameters associated with the bone-lengths simultaneously. Subsequently, a new hybrid forward kinematics layer employs both parameters to estimate 3D joint positions of the hand. For end-to-end training, we combine three public datasets NYU, ICVL and MSRA-2015 in one unified format to achieve large variation in hand shapes and sizes. Among hybrid methods, our method shows improved accuracy over the state-of-the-art on the combined dataset and the ICVL dataset that contain multiple subjects. Also, our algorithm is demonstrated to work well with unseen images.
△ Less
Submitted 8 December, 2017;
originally announced December 2017.
-
From Lifestyle Vlogs to Everyday Interactions
Authors:
David F. Fouhey,
Wei-cheng Kuo,
Alexei A. Efros,
Jitendra Malik
Abstract:
A major stumbling block to progress in understanding basic human interactions, such as getting out of bed or opening a refrigerator, is lack of good training data. Most past efforts have gathered this data explicitly: starting with a laundry list of action labels, and then querying search engines for videos tagged with each label. In this work, we do the reverse and search implicitly: we start wit…
▽ More
A major stumbling block to progress in understanding basic human interactions, such as getting out of bed or opening a refrigerator, is lack of good training data. Most past efforts have gathered this data explicitly: starting with a laundry list of action labels, and then querying search engines for videos tagged with each label. In this work, we do the reverse and search implicitly: we start with a large collection of interaction-rich video data and then annotate and analyze it. We use Internet Lifestyle Vlogs as the source of surprisingly large and diverse interaction data. We show that by collecting the data first, we are able to achieve greater scale and far greater diversity in terms of actions and actors. Additionally, our data exposes biases built into common explicitly gathered data. We make sense of our data by analyzing the central component of interaction -- hands. We benchmark two tasks: identifying semantic object contact at the video level and non-semantic contact state at the frame level. We additionally demonstrate future prediction of hands.
△ Less
Submitted 6 December, 2017;
originally announced December 2017.
-
Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene
Authors:
Shubham Tulsiani,
Saurabh Gupta,
David Fouhey,
Alexei A. Efros,
Jitendra Malik
Abstract:
The goal of this paper is to take a single 2D image of a scene and recover the 3D structure in terms of a small set of factors: a layout representing the enclosing surfaces as well as a set of objects represented in terms of shape and pose. We propose a convolutional neural network-based approach to predict this representation and benchmark it on a large dataset of indoor scenes. Our experiments e…
▽ More
The goal of this paper is to take a single 2D image of a scene and recover the 3D structure in terms of a small set of factors: a layout representing the enclosing surfaces as well as a set of objects represented in terms of shape and pose. We propose a convolutional neural network-based approach to predict this representation and benchmark it on a large dataset of indoor scenes. Our experiments evaluate a number of practical design questions, demonstrate that we can infer this representation, and quantitatively and qualitatively demonstrate its merits compared to alternate representations.
△ Less
Submitted 24 April, 2018; v1 submitted 5 December, 2017;
originally announced December 2017.
-
Generic 3D Representation via Pose Estimation and Matching
Authors:
Amir R. Zamir,
Tilman Wekel,
Pulkit Argrawal,
Colin Weil,
Jitendra Malik,
Silvio Savarese
Abstract:
Though a large body of computer vision research has investigated develo** generic semantic representations, efforts towards develo** a similar representation for 3D has been limited. In this paper, we learn a generic 3D representation through solving a set of foundational proxy 3D tasks: object-centric camera pose estimation and wide baseline feature matching. Our method is based upon the prem…
▽ More
Though a large body of computer vision research has investigated develo** generic semantic representations, efforts towards develo** a similar representation for 3D has been limited. In this paper, we learn a generic 3D representation through solving a set of foundational proxy 3D tasks: object-centric camera pose estimation and wide baseline feature matching. Our method is based upon the premise that by providing supervision over a set of carefully selected foundational tasks, generalization to novel tasks and abstraction capabilities can be achieved. We empirically show that the internal representation of a multi-task ConvNet trained to solve the above core problems generalizes to novel 3D tasks (e.g., scene layout estimation, object pose estimation, surface normal estimation) without the need for fine-tuning and shows traits of abstraction abilities (e.g., cross-modality pose estimation). In the context of the core supervised tasks, we demonstrate our representation achieves state-of-the-art wide baseline feature matching results without requiring apriori rectification (unlike SIFT and the majority of learned features). We also show 6DOF camera pose estimation given a pair local image patches. The accuracy of both supervised tasks come comparable to humans. Finally, we contribute a large-scale dataset composed of object-centric street view scenes along with point correspondences and camera pose information, and conclude with a discussion on the learned representation and open research questions.
△ Less
Submitted 23 October, 2017;
originally announced October 2017.
-
Large-Scale 3D Shape Reconstruction and Segmentation from ShapeNet Core55
Authors:
Li Yi,
Lin Shao,
Manolis Savva,
Haibin Huang,
Yang Zhou,
Qirui Wang,
Benjamin Graham,
Martin Engelcke,
Roman Klokov,
Victor Lempitsky,
Yuan Gan,
Pengyu Wang,
Kun Liu,
Fenggen Yu,
Panpan Shui,
Bingyang Hu,
Yan Zhang,
Yangyan Li,
Rui Bu,
Mingchao Sun,
Wei Wu,
Minki Jeong,
Jaehoon Choi,
Changick Kim,
Angom Geetchandra
, et al. (25 additional authors not shown)
Abstract:
We introduce a large-scale 3D shape understanding benchmark using data and annotation from ShapeNet 3D object database. The benchmark consists of two tasks: part-level segmentation of 3D shapes and 3D reconstruction from single view images. Ten teams have participated in the challenge and the best performing teams have outperformed state-of-the-art approaches on both tasks. A few novel deep learni…
▽ More
We introduce a large-scale 3D shape understanding benchmark using data and annotation from ShapeNet 3D object database. The benchmark consists of two tasks: part-level segmentation of 3D shapes and 3D reconstruction from single view images. Ten teams have participated in the challenge and the best performing teams have outperformed state-of-the-art approaches on both tasks. A few novel deep learning architectures have been proposed on various 3D representations on both tasks. We report the techniques used by each team and the corresponding performances. In addition, we summarize the major discoveries from the reported results and possible trends for the future work in the field.
△ Less
Submitted 27 October, 2017; v1 submitted 17 October, 2017;
originally announced October 2017.
-
Extended Phase Graph formalism for systems with Magnetization Transfer and Chemical Exchange
Authors:
Shaihan J. Malik,
Rui P. A. G. Teixeira,
Joseph V. Hajnal
Abstract:
An Extended Phase Graph framework for modelling systems with exchange or magnetization transfer (MT) is proposed. The framework, referred to as EPG-X, models coupled two-compartment systems by describing each compartment with separate phase graphs that exchange during evolution periods. There are two variants: EPG-X(BM) for systems governed by the Bloch-McConnell equations; and EPG-X(MT) for the p…
▽ More
An Extended Phase Graph framework for modelling systems with exchange or magnetization transfer (MT) is proposed. The framework, referred to as EPG-X, models coupled two-compartment systems by describing each compartment with separate phase graphs that exchange during evolution periods. There are two variants: EPG-X(BM) for systems governed by the Bloch-McConnell equations; and EPG-X(MT) for the pulsed MT formalism. For the MT case the "bound" protons have no transverse components so their phase graph consists only longitudinal states. EPG-X was used to model steady-state gradient echo imaging, MT effects in multislice Turbo Spin Echo imaging, multiecho CPMG for multicomponent T2 relaxometry and transient variable flip angle gradient echo imaging of the type used for MR Fingerprinting. Experimental data were also collected for the final case. Steady-state predictions from EPG-X closely match directly derived steady-state solutions which differ substantially from classic "single pool" EPG predictions. EPG-X(MT) predicts similar MT related levels of signal attenuation in white matter as have been reported elsewhere in the literature. Modelling of CPMG echo trains with EPG-X(BM) suggests that exchange processes can lead to an underestimate of the fraction of short T2 species. Modelling of transient gradient echo sequences with EPG-X(MT) suggests that measurable MT effects result from variable saturation of bound protons, particularly after inversion pulses. In conclusion, EPG-X can be used for modelling of the transient signal response of systems exhibiting chemical exchange or MT. This may be particularly beneficial for relaxometry approaches that rely on characterising transient rather than steady-state sequences.
△ Less
Submitted 4 September, 2017;
originally announced September 2017.
-
Learning a Multi-View Stereo Machine
Authors:
Abhishek Kar,
Christian Häne,
Jitendra Malik
Abstract:
We present a learnt system for multi-view stereopsis. In contrast to recent learning based methods for 3D reconstruction, we leverage the underlying 3D geometry of the problem through feature projection and unprojection along viewing rays. By formulating these operations in a differentiable manner, we are able to learn the system end-to-end for the task of metric 3D reconstruction. End-to-end lear…
▽ More
We present a learnt system for multi-view stereopsis. In contrast to recent learning based methods for 3D reconstruction, we leverage the underlying 3D geometry of the problem through feature projection and unprojection along viewing rays. By formulating these operations in a differentiable manner, we are able to learn the system end-to-end for the task of metric 3D reconstruction. End-to-end learning allows us to jointly reason about shape priors while conforming geometric constraints, enabling reconstruction from much fewer images (even a single image) than required by classical approaches as well as completion of unseen surfaces. We thoroughly evaluate our approach on the ShapeNet dataset and demonstrate the benefits over classical approaches as well as recent learning based methods.
△ Less
Submitted 17 August, 2017;
originally announced August 2017.
-
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Authors:
Chunhui Gu,
Chen Sun,
David A. Ross,
Carl Vondrick,
Caroline Pantofaru,
Yeqing Li,
Sudheendra Vijayanarasimhan,
George Toderici,
Susanna Ricco,
Rahul Sukthankar,
Cordelia Schmid,
Jitendra Malik
Abstract:
This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual…
▽ More
This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly.
AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for develo** new approaches for video understanding.
△ Less
Submitted 30 April, 2018; v1 submitted 23 May, 2017;
originally announced May 2017.
-
Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency
Authors:
Shubham Tulsiani,
Tinghui Zhou,
Alexei A. Efros,
Jitendra Malik
Abstract:
We study the notion of consistency between a 3D shape and a 2D observation and propose a differentiable formulation which allows computing gradients of the 3D shape given an observation from an arbitrary view. We do so by reformulating view consistency using a differentiable ray consistency (DRC) term. We show that this formulation can be incorporated in a learning framework to leverage different…
▽ More
We study the notion of consistency between a 3D shape and a 2D observation and propose a differentiable formulation which allows computing gradients of the 3D shape given an observation from an arbitrary view. We do so by reformulating view consistency using a differentiable ray consistency (DRC) term. We show that this formulation can be incorporated in a learning framework to leverage different types of multi-view observations e.g. foreground masks, depth, color images, semantics etc. as supervision for learning single-view 3D prediction. We present empirical analysis of our technique in a controlled setting. We also show that this approach allows us to improve over existing techniques for single-view reconstruction of objects from the PASCAL VOC dataset.
△ Less
Submitted 20 April, 2017;
originally announced April 2017.
-
Hierarchical Surface Prediction for 3D Object Reconstruction
Authors:
Christian Häne,
Shubham Tulsiani,
Jitendra Malik
Abstract:
Recently, Convolutional Neural Networks have shown promising results for 3D geometry prediction. They can make predictions from very little input data such as a single color image. A major limitation of such approaches is that they only predict a coarse resolution voxel grid, which does not capture the surface of the objects well. We propose a general framework, called hierarchical surface predict…
▽ More
Recently, Convolutional Neural Networks have shown promising results for 3D geometry prediction. They can make predictions from very little input data such as a single color image. A major limitation of such approaches is that they only predict a coarse resolution voxel grid, which does not capture the surface of the objects well. We propose a general framework, called hierarchical surface prediction (HSP), which facilitates prediction of high resolution voxel grids. The main insight is that it is sufficient to predict high resolution voxels around the predicted surfaces. The exterior and interior of the objects can be represented with coarse resolution voxels. Our approach is not dependent on a specific input type. We show results for geometry prediction from color images, depth images and shape completion from partial voxel grids. Our analysis shows that our high resolution predictions are more accurate than low resolution predictions.
△ Less
Submitted 6 November, 2017; v1 submitted 3 April, 2017;
originally announced April 2017.
-
Combining Self-Supervised Learning and Imitation for Vision-Based Rope Manipulation
Authors:
Ashvin Nair,
Dian Chen,
Pulkit Agrawal,
Phillip Isola,
Pieter Abbeel,
Jitendra Malik,
Sergey Levine
Abstract:
Manipulation of deformable objects, such as ropes and cloth, is an important but challenging problem in robotics. We present a learning-based system where a robot takes as input a sequence of images of a human manipulating a rope from an initial to goal configuration, and outputs a sequence of actions that can reproduce the human demonstration, using only monocular images as input. To perform this…
▽ More
Manipulation of deformable objects, such as ropes and cloth, is an important but challenging problem in robotics. We present a learning-based system where a robot takes as input a sequence of images of a human manipulating a rope from an initial to goal configuration, and outputs a sequence of actions that can reproduce the human demonstration, using only monocular images as input. To perform this task, the robot learns a pixel-level inverse dynamics model of rope manipulation directly from images in a self-supervised manner, using about 60K interactions with the rope collected autonomously by the robot. The human demonstration provides a high-level plan of what to do and the low-level inverse model is used to execute the plan. We show that by combining the high and low-level plans, the robot can successfully manipulate a rope into a variety of target shapes using only a sequence of human-provided images for direction.
△ Less
Submitted 6 March, 2017;
originally announced March 2017.
-
Learning to Optimize Neural Nets
Authors:
Ke Li,
Jitendra Malik
Abstract:
Learning to Optimize is a recently proposed framework for learning optimization algorithms using reinforcement learning. In this paper, we explore learning an optimization algorithm for training shallow neural nets. Such high-dimensional stochastic optimization problems present interesting challenges for existing reinforcement learning algorithms. We develop an extension that is suited to learning…
▽ More
Learning to Optimize is a recently proposed framework for learning optimization algorithms using reinforcement learning. In this paper, we explore learning an optimization algorithm for training shallow neural nets. Such high-dimensional stochastic optimization problems present interesting challenges for existing reinforcement learning algorithms. We develop an extension that is suited to learning optimization algorithms in this setting and demonstrate that the learned optimization algorithm consistently outperforms other known optimization algorithms even on unseen tasks and is robust to changes in stochasticity of gradients and the neural net architecture. More specifically, we show that an optimization algorithm trained with the proposed method on the problem of training a neural net on MNIST generalizes to the problems of training neural nets on the Toronto Faces Dataset, CIFAR-10 and CIFAR-100.
△ Less
Submitted 30 November, 2017; v1 submitted 1 March, 2017;
originally announced March 2017.
-
Fast k-Nearest Neighbour Search via Prioritized DCI
Authors:
Ke Li,
Jitendra Malik
Abstract:
Most exact methods for k-nearest neighbour search suffer from the curse of dimensionality; that is, their query times exhibit exponential dependence on either the ambient or the intrinsic dimensionality. Dynamic Continuous Indexing (DCI) offers a promising way of circumventing the curse and successfully reduces the dependence of query time on intrinsic dimensionality from exponential to sublinear.…
▽ More
Most exact methods for k-nearest neighbour search suffer from the curse of dimensionality; that is, their query times exhibit exponential dependence on either the ambient or the intrinsic dimensionality. Dynamic Continuous Indexing (DCI) offers a promising way of circumventing the curse and successfully reduces the dependence of query time on intrinsic dimensionality from exponential to sublinear. In this paper, we propose a variant of DCI, which we call Prioritized DCI, and show a remarkable improvement in the dependence of query time on intrinsic dimensionality. In particular, a linear increase in intrinsic dimensionality, or equivalently, an exponential increase in the number of points near a query, can be mostly counteracted with just a linear increase in space. We also demonstrate empirically that Prioritized DCI significantly outperforms prior methods. In particular, relative to Locality-Sensitive Hashing (LSH), Prioritized DCI reduces the number of distance evaluations by a factor of 14 to 116 and the memory consumption by a factor of 21.
△ Less
Submitted 20 July, 2017; v1 submitted 1 March, 2017;
originally announced March 2017.
-
Single-lead f-wave extraction using diffusion geometry
Authors:
John Malik,
Neil Reed,
Chun-Li Wang,
Hautieng Wu
Abstract:
A novel single-lead f-wave extraction algorithm based on the modern diffusion geometry data analysis framework is proposed. The algorithm is essentially an averaged beat subtraction algorithm, where the ventricular activity template is estimated by combining a newly designed metric, the "diffusion distance," and the non-local Euclidean median based on the non-linear manifold setup. We coined the a…
▽ More
A novel single-lead f-wave extraction algorithm based on the modern diffusion geometry data analysis framework is proposed. The algorithm is essentially an averaged beat subtraction algorithm, where the ventricular activity template is estimated by combining a newly designed metric, the "diffusion distance," and the non-local Euclidean median based on the non-linear manifold setup. We coined the algorithm DD-NLEM. Two simulation schemes are considered, and the new algorithm DD-NLEM outperforms traditional algorithms, including the average beat subtraction, principal component analysis, and adaptive singular value cancellation, in different evaluation metrics with statistical significance. The clinical potential is shown in the real Holter signal, and we introduce a new score to evaluate the performance of the algorithm.
△ Less
Submitted 28 April, 2017; v1 submitted 27 February, 2017;
originally announced February 2017.
-
Cognitive Map** and Planning for Visual Navigation
Authors:
Saurabh Gupta,
Varun Tolani,
James Davidson,
Sergey Levine,
Rahul Sukthankar,
Jitendra Malik
Abstract:
We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person views and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for map** and planning, such that the map** is driven by the needs of the task, and b) a spatia…
▽ More
We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person views and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for map** and planning, such that the map** is driven by the needs of the task, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a top-down belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. We train and test CMP on navigation problems in simulation environments derived from scans of real world buildings. Our experiments demonstrate that CMP outperforms alternate learning-based architectures, as well as, classical map** and path planning approaches in many cases. Furthermore, it naturally extends to semantically specified goals, such as 'going to a chair'. We also deploy CMP on physical robots in indoor environments, where it achieves reasonable performance, even though it is trained entirely in simulation.
△ Less
Submitted 7 February, 2019; v1 submitted 13 February, 2017;
originally announced February 2017.
-
Feedback Networks
Authors:
Amir R. Zamir,
Te-Lin Wu,
Lin Sun,
William Shen,
Jitendra Malik,
Silvio Savarese
Abstract:
Currently, the most successful learning models in computer vision are based on learning successive representations followed by a decision layer. This is usually actualized through feedforward multilayer neural networks, e.g. ConvNets, where each layer forms one of such successive representations. However, an alternative that can achieve the same goal is a feedback based approach in which the repre…
▽ More
Currently, the most successful learning models in computer vision are based on learning successive representations followed by a decision layer. This is usually actualized through feedforward multilayer neural networks, e.g. ConvNets, where each layer forms one of such successive representations. However, an alternative that can achieve the same goal is a feedback based approach in which the representation is formed in an iterative manner based on a feedback received from previous iteration's output.
We establish that a feedback based approach has several fundamental advantages over feedforward: it enables making early predictions at the query time, its output naturally conforms to a hierarchical structure in the label space (e.g. a taxonomy), and it provides a new basis for Curriculum Learning. We observe that feedback networks develop a considerably different representation compared to feedforward counterparts, in line with the aforementioned advantages. We put forth a general feedback based learning architecture with the endpoint results on par or better than existing feedforward networks with the addition of the above advantages. We also investigate several mechanisms in feedback architectures (e.g. skip connections in time) and design choices (e.g. feedback length). We hope this study offers new perspectives in quest for more natural and practical learning models.
△ Less
Submitted 20 August, 2017; v1 submitted 30 December, 2016;
originally announced December 2016.
-
Beyond Skip Connections: Top-Down Modulation for Object Detection
Authors:
Abhinav Shrivastava,
Rahul Sukthankar,
Jitendra Malik,
Abhinav Gupta
Abstract:
In recent years, we have seen tremendous progress in the field of object detection. Most of the recent improvements have been achieved by targeting deeper feedforward networks. However, many hard object categories such as bottle, remote, etc. require representation of fine details and not just coarse, semantic representations. But most of these fine details are lost in the early convolutional laye…
▽ More
In recent years, we have seen tremendous progress in the field of object detection. Most of the recent improvements have been achieved by targeting deeper feedforward networks. However, many hard object categories such as bottle, remote, etc. require representation of fine details and not just coarse, semantic representations. But most of these fine details are lost in the early convolutional layers. What we need is a way to incorporate finer details from lower layers into the detection architecture. Skip connections have been proposed to combine high-level and low-level features, but we argue that selecting the right features from low-level requires top-down contextual information. Inspired by the human visual pathway, in this paper we propose top-down modulations as a way to incorporate fine details into the detection framework. Our approach supplements the standard bottom-up, feedforward ConvNet with a top-down modulation (TDM) network, connected using lateral connections. These connections are responsible for the modulation of lower layer filters, and the top-down network handles the selection and integration of contextual information and low-level features. The proposed TDM architecture provides a significant boost on the COCO testdev benchmark, achieving 28.6 AP for VGG16, 35.2 AP for ResNet101, and 37.3 for InceptionResNetv2 network, without any bells and whistles (e.g., multi-scale, iterative box refinement, etc.).
△ Less
Submitted 19 September, 2017; v1 submitted 20 December, 2016;
originally announced December 2016.
-
Learning Shape Abstractions by Assembling Volumetric Primitives
Authors:
Shubham Tulsiani,
Hao Su,
Leonidas J. Guibas,
Alexei A. Efros,
Jitendra Malik
Abstract:
We present a learning framework for abstracting complex shapes by learning to assemble objects using 3D volumetric primitives. In addition to generating simple and geometrically interpretable explanations of 3D objects, our framework also allows us to automatically discover and exploit consistent structure in the data. We demonstrate that using our method allows predicting shape representations wh…
▽ More
We present a learning framework for abstracting complex shapes by learning to assemble objects using 3D volumetric primitives. In addition to generating simple and geometrically interpretable explanations of 3D objects, our framework also allows us to automatically discover and exploit consistent structure in the data. We demonstrate that using our method allows predicting shape representations which can be leveraged for obtaining a consistent parsing across the instances of a shape collection and constructing an interpretable shape similarity measure. We also examine applications for image-based prediction as well as shape manipulation.
△ Less
Submitted 2 August, 2018; v1 submitted 1 December, 2016;
originally announced December 2016.
-
Learning to Poke by Poking: Experiential Learning of Intuitive Physics
Authors:
Pulkit Agrawal,
Ashvin Nair,
Pieter Abbeel,
Jitendra Malik,
Sergey Levine
Abstract:
We investigate an experiential learning paradigm for acquiring an internal model of intuitive physics. Our model is evaluated on a real-world robotic manipulation task that requires displacing objects to target locations by poking. The robot gathered over 400 hours of experience by executing more than 100K pokes on different objects. We propose a novel approach based on deep neural networks for mo…
▽ More
We investigate an experiential learning paradigm for acquiring an internal model of intuitive physics. Our model is evaluated on a real-world robotic manipulation task that requires displacing objects to target locations by poking. The robot gathered over 400 hours of experience by executing more than 100K pokes on different objects. We propose a novel approach based on deep neural networks for modeling the dynamics of robot's interactions directly from images, by jointly estimating forward and inverse models of dynamics. The inverse model objective provides supervision to construct informative visual features, which the forward model can then predict and in turn regularize the feature space for the inverse model. The interplay between these two objectives creates useful, accurate models that can then be used for multi-step decision making. This formulation has the additional benefit that it is possible to learn forward models in an abstract feature space and thus alleviate the need of predicting pixels. Our experiments show that this joint modeling approach outperforms alternative methods.
△ Less
Submitted 15 February, 2017; v1 submitted 23 June, 2016;
originally announced June 2016.
-
Learning to Optimize
Authors:
Ke Li,
Jitendra Malik
Abstract:
Algorithm design is a laborious process and often requires many iterations of ideation and validation. In this paper, we explore automating algorithm design and present a method to learn an optimization algorithm, which we believe to be the first method that can automatically discover a better algorithm. We approach this problem from a reinforcement learning perspective and represent any particula…
▽ More
Algorithm design is a laborious process and often requires many iterations of ideation and validation. In this paper, we explore automating algorithm design and present a method to learn an optimization algorithm, which we believe to be the first method that can automatically discover a better algorithm. We approach this problem from a reinforcement learning perspective and represent any particular optimization algorithm as a policy. We learn an optimization algorithm using guided policy search and demonstrate that the resulting algorithm outperforms existing hand-engineered algorithms in terms of convergence speed and/or the final objective value.
△ Less
Submitted 6 June, 2016;
originally announced June 2016.
-
View Synthesis by Appearance Flow
Authors:
Tinghui Zhou,
Shubham Tulsiani,
Weilun Sun,
Jitendra Malik,
Alexei A. Efros
Abstract:
We address the problem of novel view synthesis: given an input image, synthesizing new images of the same object or scene observed from arbitrary viewpoints. We approach this as a learning task but, critically, instead of learning to synthesize pixels from scratch, we learn to copy them from the input image. Our approach exploits the observation that the visual appearance of different views of the…
▽ More
We address the problem of novel view synthesis: given an input image, synthesizing new images of the same object or scene observed from arbitrary viewpoints. We approach this as a learning task but, critically, instead of learning to synthesize pixels from scratch, we learn to copy them from the input image. Our approach exploits the observation that the visual appearance of different views of the same instance is highly correlated, and such correlation could be explicitly learned by training a convolutional neural network (CNN) to predict appearance flows -- 2-D coordinate vectors specifying which pixels in the input view could be used to reconstruct the target view. Furthermore, the proposed framework easily generalizes to multiple input views by learning how to optimally combine single-view predictions. We show that for both objects and scenes, our approach is able to synthesize novel views of higher perceptual quality than previous CNN-based techniques.
△ Less
Submitted 11 February, 2017; v1 submitted 11 May, 2016;
originally announced May 2016.
-
Amodal Instance Segmentation
Authors:
Ke Li,
Jitendra Malik
Abstract:
We consider the problem of amodal instance segmentation, the objective of which is to predict the region encompassing both visible and occluded parts of each object. Thus far, the lack of publicly available amodal segmentation annotations has stymied the development of amodal segmentation methods. In this paper, we sidestep this issue by relying solely on standard modal instance segmentation annot…
▽ More
We consider the problem of amodal instance segmentation, the objective of which is to predict the region encompassing both visible and occluded parts of each object. Thus far, the lack of publicly available amodal segmentation annotations has stymied the development of amodal segmentation methods. In this paper, we sidestep this issue by relying solely on standard modal instance segmentation annotations to train our model. The result is a new method for amodal instance segmentation, which represents the first such method to the best of our knowledge. We demonstrate the proposed method's effectiveness both qualitatively and quantitatively.
△ Less
Submitted 17 August, 2016; v1 submitted 27 April, 2016;
originally announced April 2016.