Search | arXiv e-print repository

SlowFast Networks for Video Recognition

Authors: Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He

Abstract: We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our… ▽ More We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/facebookresearch/SlowFast △ Less

Submitted 29 October, 2019; v1 submitted 10 December, 2018; originally announced December 2018.

Comments: Technical report

arXiv:1812.02249 [pdf, other]

doi 10.1002/mrm.27798

Fetal whole-heart 4D imaging using motion-corrected multi-planar real-time MRI

Authors: Joshua FP van Amerom, David FA Lloyd, Maria Deprez, Anthony N Price, Shaihan J Malik, Kuberan Pushparajah, Milou PM van Poppel, Mary A Rutherford, Reza Razavi, Joseph V Hajnal

Abstract: Purpose: To develop a MRI acquisition and reconstruction framework for volumetric cine visualisation of the fetal heart and great vessels in the presence of maternal and fetal motion. Methods: Four-dimensional depiction was achieved using a highly-accelerated multi-planar real-time balanced steady state free precession acquisition combined with retrospective image-domain techniques for motion co… ▽ More Purpose: To develop a MRI acquisition and reconstruction framework for volumetric cine visualisation of the fetal heart and great vessels in the presence of maternal and fetal motion. Methods: Four-dimensional depiction was achieved using a highly-accelerated multi-planar real-time balanced steady state free precession acquisition combined with retrospective image-domain techniques for motion correction, cardiac synchronisation and outlier rejection. The framework was evaluated and optimised using a numerical phantom, and evaluated in a study of 20 mid- to late-gestational age human fetal subjects. Reconstructed cine volumes were evaluated by experienced cardiologists and compared with matched ultrasound. A preliminary assessment of flow-sensitive reconstruction using the velocity information encoded in the phase of dynamic images is included. Results: Reconstructed cine volumes could be visualised in any 2D plane without the need for highly-specific scan plane prescription prior to acquisition or for maternal breath hold to minimise motion. Reconstruction was fully automated aside from user-specified masks of the fetal heart and chest. The framework proved robust when applied to fetal data and simulations confirmed that spatial and temporal features could be reliably recovered. Expert evaluation suggested the reconstructed volumes can be used for comprehensive assessment of the fetal heart, either as an adjunct to ultrasound or in combination with other MRI techniques. Conclusion: The proposed methods show promise as a framework for motion-compensated 4D assessment of the fetal heart and great vessels. △ Less

Submitted 5 December, 2018; originally announced December 2018.

arXiv:1812.01601 [pdf, other]

Learning 3D Human Dynamics from Video

Authors: Angjoo Kanazawa, Jason Y. Zhang, Panna Felsen, Jitendra Malik

Abstract: From an image of a person in action, we can easily guess the 3D motion of the person in the immediate past and future. This is because we have a mental model of 3D human dynamics that we have acquired from observing visual sequences of humans in motion. We present a framework that can similarly learn a representation of 3D dynamics of humans from video via a simple but effective temporal encoding… ▽ More From an image of a person in action, we can easily guess the 3D motion of the person in the immediate past and future. This is because we have a mental model of 3D human dynamics that we have acquired from observing visual sequences of humans in motion. We present a framework that can similarly learn a representation of 3D dynamics of humans from video via a simple but effective temporal encoding of image features. At test time, from video, the learned temporal representation give rise to smooth 3D mesh predictions. From a single image, our model can recover the current 3D mesh as well as its 3D past and future motion. Our approach is designed so it can learn from videos with 2D pose annotations in a semi-supervised manner. Though annotated data is always limited, there are millions of videos uploaded daily on the Internet. In this work, we harvest this Internet-scale source of unlabeled data by training our model on unlabeled video with pseudo-ground truth 2D pose obtained from an off-the-shelf 2D pose detector. Our experiments show that adding more videos with pseudo-ground truth 2D pose monotonically improves 3D prediction performance. We evaluate our model, Human Mesh and Motion Recovery (HMMR), on the recent challenging dataset of 3D Poses in the Wild and obtain state-of-the-art performance on the 3D prediction task without any fine-tuning. The project website with video, code, and data can be found at https://akanazawa.github.io/human_dynamics/. △ Less

Submitted 16 September, 2019; v1 submitted 4 December, 2018; originally announced December 2018.

Comments: To appear in CVPR 2019. Changelog: v3. +an experiment to compare improvement from pseudo-gt data on single view vs temporal context model. v2. camready ver: Minor update in model training where the gaussian shape prior is used, updated results (similar results, same trends), added more ablation study in the appendix. v1. +evaluation protocol subsection in appendix, updated results due to bug fix

arXiv:1812.00940 [pdf, other]

Visual Memory for Robust Path Following

Authors: Ashish Kumar, Saurabh Gupta, David Fouhey, Sergey Levine, Jitendra Malik

Abstract: Humans routinely retrace paths in a novel environment both forwards and backwards despite uncertainty in their motion. This paper presents an approach for doing so. Given a demonstration of a path, a first network generates a path abstraction. Equipped with this abstraction, a second network observes the world and decides how to act to retrace the path under noisy actuation and a changing environm… ▽ More Humans routinely retrace paths in a novel environment both forwards and backwards despite uncertainty in their motion. This paper presents an approach for doing so. Given a demonstration of a path, a first network generates a path abstraction. Equipped with this abstraction, a second network observes the world and decides how to act to retrace the path under noisy actuation and a changing environment. The two networks are optimized end-to-end at training time. We evaluate the method in two realistic simulators, performing path following and homing under actuation noise and environmental changes. Our experiments show that our approach outperforms classical approaches and other learning based baselines. △ Less

Submitted 3 December, 2018; originally announced December 2018.

Comments: Neural Information Processing Systems (NeurIPS) 2018. Oral Presentation

arXiv:1811.12569 [pdf, other]

Are All Training Examples Created Equal? An Empirical Study

Authors: Kailas Vodrahalli, Ke Li, Jitendra Malik

Abstract: Modern computer vision algorithms often rely on very large training datasets. However, it is conceivable that a carefully selected subsample of the dataset is sufficient for training. In this paper, we propose a gradient-based importance measure that we use to empirically analyze relative importance of training images in four datasets of varying complexity. We find that in some cases, a small subs… ▽ More Modern computer vision algorithms often rely on very large training datasets. However, it is conceivable that a carefully selected subsample of the dataset is sufficient for training. In this paper, we propose a gradient-based importance measure that we use to empirically analyze relative importance of training images in four datasets of varying complexity. We find that in some cases, a small subsample is indeed sufficient for training. For other datasets, however, the relative differences in importance are negligible. These results have important implications for active learning on deep networks. Additionally, our analysis method can be used as a general tool to better understand diversity of training examples in datasets. △ Less

Submitted 29 November, 2018; originally announced November 2018.

Comments: 12 pages, 12 figures

arXiv:1811.12402 [pdf, ps, other]

On the Implicit Assumptions of GANs

Authors: Ke Li, Jitendra Malik

Abstract: Generative adversarial nets (GANs) have generated a lot of excitement. Despite their popularity, they exhibit a number of well-documented issues in practice, which apparently contradict theoretical guarantees. A number of enlightening papers have pointed out that these issues arise from unjustified assumptions that are commonly made, but the message seems to have been lost amid the optimism of rec… ▽ More Generative adversarial nets (GANs) have generated a lot of excitement. Despite their popularity, they exhibit a number of well-documented issues in practice, which apparently contradict theoretical guarantees. A number of enlightening papers have pointed out that these issues arise from unjustified assumptions that are commonly made, but the message seems to have been lost amid the optimism of recent years. We believe the identified problems deserve more attention, and highlight the implications on both the properties of GANs and the trajectory of research on probabilistic models. We recently proposed an alternative method that sidesteps these problems. △ Less

Submitted 29 November, 2018; originally announced November 2018.

Comments: 8 pages

arXiv:1811.12373 [pdf, other]

Diverse Image Synthesis from Semantic Layouts via Conditional IMLE

Authors: Ke Li, Tianhao Zhang, Jitendra Malik

Abstract: Most existing methods for conditional image synthesis are only able to generate a single plausible image for any given input, or at best a fixed number of plausible images. In this paper, we focus on the problem of generating images from semantic segmentation maps and present a simple new method that can generate an arbitrary number of images with diverse appearance for the same semantic layout. U… ▽ More Most existing methods for conditional image synthesis are only able to generate a single plausible image for any given input, or at best a fixed number of plausible images. In this paper, we focus on the problem of generating images from semantic segmentation maps and present a simple new method that can generate an arbitrary number of images with diverse appearance for the same semantic layout. Unlike most existing approaches which adopt the GAN framework, our method is based on the recently introduced Implicit Maximum Likelihood Estimation (IMLE) framework. Compared to the leading approach, our method is able to generate more diverse images while producing fewer artifacts despite using the same architecture. The learned latent space also has sensible structure despite the lack of supervision that encourages such behaviour. Videos and code are available at https://people.eecs.berkeley.edu/~ke.li/projects/imle/scene_layouts/. △ Less

Submitted 29 August, 2019; v1 submitted 29 November, 2018; originally announced November 2018.

Comments: 18 pages, 16 figures; IEEE International Conference on Computer Vision (ICCV), 2019

arXiv:1811.11074 [pdf, other]

Recycling cardiogenic artifacts in impedance pneumography

Authors: Yao Lu, Hau-tieng Wu, John Malik

Abstract: Purpose: Biomedical sensors often exhibit cardiogenic artifacts which, while distorting the signal of interest, carry useful hemodynamic information. We propose an algorithm to remove and extract hemodynamic information from these cardiogenic artifacts. Methods: We apply a nonlinear time-frequency analysis technique, the de-shape synchrosqueezing transform (dsSST), to adaptively isolate the high-… ▽ More Purpose: Biomedical sensors often exhibit cardiogenic artifacts which, while distorting the signal of interest, carry useful hemodynamic information. We propose an algorithm to remove and extract hemodynamic information from these cardiogenic artifacts. Methods: We apply a nonlinear time-frequency analysis technique, the de-shape synchrosqueezing transform (dsSST), to adaptively isolate the high- and low-frequency components of a single-channel signal. We demonstrate this technique's effectiveness by removing and deriving hemodynamic information from the cardiogenic artifact in an impedance pneumography (IP). Results: The instantaneous heart rate is extracted, and the cardiac and respiratory signals are reconstructed. Conclusions: The dsSST is suitable for generating useful hemodynamic information from the cardiogenic artifact in a single-channel IP. We propose that the usefulness of the dsSST as a recycling tool extends to other biomedical sensors exhibiting cardiogenic artifacts. △ Less

Submitted 26 February, 2019; v1 submitted 27 November, 2018; originally announced November 2018.

Comments: 21 pages, 6 figures

arXiv:1810.03599 [pdf, other]

SFV: Reinforcement Learning of Physical Skills from Videos

Authors: Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, Pieter Abbeel, Sergey Levine

Abstract: Data-driven character animation based on motion capture can produce highly naturalistic behaviors and, when combined with physics simulation, can provide for natural procedural responses to physical perturbations, environmental changes, and morphological discrepancies. Motion capture remains the most popular source of motion data, but collecting mocap data typically requires heavily instrumented e… ▽ More Data-driven character animation based on motion capture can produce highly naturalistic behaviors and, when combined with physics simulation, can provide for natural procedural responses to physical perturbations, environmental changes, and morphological discrepancies. Motion capture remains the most popular source of motion data, but collecting mocap data typically requires heavily instrumented environments and actors. In this paper, we propose a method that enables physically simulated characters to learn skills from videos (SFV). Our approach, based on deep pose estimation and deep reinforcement learning, allows data-driven animation to leverage the abundance of publicly available video clips from the web, such as those from YouTube. This has the potential to enable fast and easy design of character controllers simply by querying for video recordings of the desired behavior. The resulting controllers are robust to perturbations, can be adapted to new settings, can perform basic object interactions, and can be retargeted to new morphologies via reinforcement learning. We further demonstrate that our method can predict potential human motions from still images, by forward simulation of learned controllers initialized from the observed pose. Our framework is able to learn a broad range of dynamic skills, including locomotion, acrobatics, and martial arts. △ Less

Submitted 15 October, 2018; v1 submitted 8 October, 2018; originally announced October 2018.

arXiv:1810.01406 [pdf, other]

Super-Resolution via Conditional Implicit Maximum Likelihood Estimation

Authors: Ke Li, Shichong Peng, Jitendra Malik

Abstract: Single-image super-resolution (SISR) is a canonical problem with diverse applications. Leading methods like SRGAN produce images that contain various artifacts, such as high-frequency noise, hallucinated colours and shape distortions, which adversely affect the realism of the result. In this paper, we propose an alternative approach based on an extension of the method of Implicit Maximum Likelihoo… ▽ More Single-image super-resolution (SISR) is a canonical problem with diverse applications. Leading methods like SRGAN produce images that contain various artifacts, such as high-frequency noise, hallucinated colours and shape distortions, which adversely affect the realism of the result. In this paper, we propose an alternative approach based on an extension of the method of Implicit Maximum Likelihood Estimation (IMLE). We demonstrate greater effectiveness at noise reduction and preservation of the original colours and shapes, yielding more realistic super-resolved images. △ Less

Submitted 2 October, 2018; originally announced October 2018.

Comments: 12 pages, 7 figures

arXiv:1809.09087 [pdf, other]

Implicit Maximum Likelihood Estimation

Authors: Ke Li, Jitendra Malik

Abstract: Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood un… ▽ More Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood under some conditions. Our result holds in the non-asymptotic parametric setting, where both the capacity of the model and the number of data examples are finite. We also demonstrate encouraging experimental results. △ Less

Submitted 22 October, 2018; v1 submitted 24 September, 2018; originally announced September 2018.

Comments: 21 pages, 4 figures. In the interest of promoting discussion, we make the reviews available at https://people.eecs.berkeley.edu/~ke.li/papers/imle_reviews.pdf

arXiv:1809.02882 [pdf, other]

Cost-Sensitive Active Learning for Intracranial Hemorrhage Detection

Authors: Weicheng Kuo, Christian Häne, Esther Yuh, Pratik Mukherjee, Jitendra Malik

Abstract: Deep learning for clinical applications is subject to stringent performance requirements, which raises a need for large labeled datasets. However, the enormous cost of labeling medical data makes this challenging. In this paper, we build a cost-sensitive active learning system for the problem of intracranial hemorrhage detection and segmentation on head computed tomography (CT). We show that our e… ▽ More Deep learning for clinical applications is subject to stringent performance requirements, which raises a need for large labeled datasets. However, the enormous cost of labeling medical data makes this challenging. In this paper, we build a cost-sensitive active learning system for the problem of intracranial hemorrhage detection and segmentation on head computed tomography (CT). We show that our ensemble method compares favorably with the state-of-the-art, while running faster and using less memory. Moreover, our experiments are done using a substantially larger dataset than earlier papers on this topic. Since the labeling time could vary tremendously across examples, we model the labeling time and optimize the return on investment. We validate this idea by core-set selection on our large labeled dataset and by growing it with data from the wild. △ Less

Submitted 8 September, 2018; originally announced September 2018.

arXiv:1808.10654 [pdf, other]

Gibson Env: Real-World Perception for Embodied Agents

Authors: Fei Xia, Amir Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, Silvio Savarese

Abstract: Develo** visual perception models for active agents and sensorimotor control are cumbersome to be done in the physical world, as existing algorithms are too slow to efficiently learn in real-time and robots are fragile and costly. This has given rise to learning-in-simulation which consequently casts a question on whether the results transfer to real-world. In this paper, we are concerned with t… ▽ More Develo** visual perception models for active agents and sensorimotor control are cumbersome to be done in the physical world, as existing algorithms are too slow to efficiently learn in real-time and robots are fragile and costly. This has given rise to learning-in-simulation which consequently casts a question on whether the results transfer to real-world. In this paper, we are concerned with the problem of develo** real-world perception for active agents, propose Gibson Virtual Environment for this purpose, and showcase sample perceptual tasks learned therein. Gibson is based on virtualizing real spaces, rather than using artificially designed ones, and currently includes over 1400 floor spaces from 572 full buildings. The main characteristics of Gibson are: I. being from the real-world and reflecting its semantic complexity, II. having an internal synthesis mechanism, "Goggles", enabling deploying the trained models in real-world without needing further domain adaptation, III. embodiment of agents and making them subject to constraints of physics and space. △ Less

Submitted 31 August, 2018; originally announced August 2018.

Comments: Access the code, dataset, and project website at http://gibsonenv.vision/ . CVPR 2018

Journal ref: CVPR 2018

arXiv:1808.09208 [pdf, other]

DeepHPS: End-to-end Estimation of 3D Hand Pose and Shape by Learning from Synthetic Depth

Authors: Jameel Malik, Ahmed Elhayek, Fabrizio Nunnari, Kiran Varanasi, Kiarash Tamaddon, Alexis Heloir, Didier Stricker

Abstract: Articulated hand pose and shape estimation is an important problem for vision-based applications such as augmented reality and animation. In contrast to the existing methods which optimize only for joint positions, we propose a fully supervised deep network which learns to jointly estimate a full 3D hand mesh representation and pose from a single depth image. To this end, a CNN architecture is emp… ▽ More Articulated hand pose and shape estimation is an important problem for vision-based applications such as augmented reality and animation. In contrast to the existing methods which optimize only for joint positions, we propose a fully supervised deep network which learns to jointly estimate a full 3D hand mesh representation and pose from a single depth image. To this end, a CNN architecture is employed to estimate parametric representations i.e. hand pose, bone scales and complex shape parameters. Then, a novel hand pose and shape layer, embedded inside our deep framework, produces 3D joint positions and hand mesh. Lack of sufficient training data with varying hand shapes limits the generalized performance of learning based methods. Also, manually annotating real data is suboptimal. Therefore, we present SynHand5M: a million-scale synthetic dataset with accurate joint annotations, segmentation masks and mesh files of depth maps. Among model based learning (hybrid) methods, we show improved results on our dataset and two of the public benchmarks i.e. NYU and ICVL. Also, by employing a joint training strategy with real and synthetic data, we recover 3D hand mesh and pose from real images in 3.7ms. △ Less

Submitted 28 August, 2018; originally announced August 2018.

Comments: Accepted for publication in 3DV-2018 (http://3dv18.uniud.it/)

arXiv:1808.00142 [pdf, other]

doi 10.1088/1361-6579/aad5a9

Sleep-wake classification via quantifying heart rate variability by convolutional neural network

Authors: John Malik, Yu-Lun Lo, Hau-tieng Wu

Abstract: Fluctuations in heart rate are intimately tied to changes in the physiological state of the organism. We examine and exploit this relationship by classifying a human subject's wake/sleep status using his instantaneous heart rate (IHR) series. We use a convolutional neural network (CNN) to build features from the IHR series extracted from a whole-night electrocardiogram (ECG) and predict every 30 s… ▽ More Fluctuations in heart rate are intimately tied to changes in the physiological state of the organism. We examine and exploit this relationship by classifying a human subject's wake/sleep status using his instantaneous heart rate (IHR) series. We use a convolutional neural network (CNN) to build features from the IHR series extracted from a whole-night electrocardiogram (ECG) and predict every 30 seconds whether the subject is awake or asleep. Our training database consists of 56 normal subjects, and we consider three different databases for validation; one is private, and two are public with different races and apnea severities. On our private database of 27 subjects, our accuracy, sensitivity, specificity, and AUC values for predicting the wake stage are 83.1%, 52.4%, 89.4%, and 0.83, respectively. Validation performance is similar on our two public databases. When we use the photoplethysmography instead of the ECG to obtain the IHR series, the performance is also comparable. A robustness check is carried out to confirm the obtained performance statistics. This result advocates for an effective and scalable method for recognizing changes in physiological state using non-invasive heart rate monitoring. The CNN model adaptively quantifies IHR fluctuation as well as its location in time and is suitable for differentiating between the wake and sleep stages. △ Less

Submitted 31 July, 2018; originally announced August 2018.

arXiv:1807.06757 [pdf, other]

On Evaluation of Embodied Navigation Agents

Authors: Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, Amir R. Zamir

Abstract: Skillful mobile operation in three-dimensional environments is a primary topic of study in Artificial Intelligence. The past two years have seen a surge of creative work on navigation. This creative output has produced a plethora of sometimes incompatible task definitions and evaluation protocols. To coordinate ongoing and future research in this area, we have convened a working group to study emp… ▽ More Skillful mobile operation in three-dimensional environments is a primary topic of study in Artificial Intelligence. The past two years have seen a surge of creative work on navigation. This creative output has produced a plethora of sometimes incompatible task definitions and evaluation protocols. To coordinate ongoing and future research in this area, we have convened a working group to study empirical methodology in navigation research. The present document summarizes the consensus recommendations of this working group. We discuss different problem statements and the role of generalization, present evaluation measures, and provide standard scenarios that can be used for benchmarking. △ Less

Submitted 17 July, 2018; originally announced July 2018.

Comments: Report of a working group on empirical methodology in navigation research. Authors are listed in alphabetical order

arXiv:1806.08354 [pdf, other]

Learning Instance Segmentation by Interaction

Authors: Deepak Pathak, Yide Shentu, Dian Chen, Pulkit Agrawal, Trevor Darrell, Sergey Levine, Jitendra Malik

Abstract: We present an approach for building an active agent that learns to segment its visual observations into individual objects by interacting with its environment in a completely self-supervised manner. The agent uses its current segmentation model to infer pixels that constitute objects and refines the segmentation model by interacting with these pixels. The model learned from over 50K interactions g… ▽ More We present an approach for building an active agent that learns to segment its visual observations into individual objects by interacting with its environment in a completely self-supervised manner. The agent uses its current segmentation model to infer pixels that constitute objects and refines the segmentation model by interacting with these pixels. The model learned from over 50K interactions generalizes to novel objects and backgrounds. To deal with noisy training signal for segmenting objects obtained by self-supervised interactions, we propose robust set loss. A dataset of robot's interactions along-with a few human labeled examples is provided as a benchmark for future research. We test the utility of the learned segmentation model by providing results on a downstream vision-based control task of rearranging multiple objects into target configurations from visual inputs alone. Videos, code, and robotic interaction dataset are available at https://pathak22.github.io/seg-by-interaction/ △ Less

Submitted 21 June, 2018; originally announced June 2018.

Comments: Website at https://pathak22.github.io/seg-by-interaction/

arXiv:1806.03265 [pdf, other]

PatchFCN for Intracranial Hemorrhage Detection

Authors: Weicheng Kuo, Christian Häne, Esther Yuh, Pratik Mukherjee, Jitendra Malik

Abstract: This paper studies the problem of detecting and segmenting acute intracranial hemorrhage on head computed tomography (CT) scans. We propose to solve both tasks as a semantic segmentation problem using a patch-based fully convolutional network (PatchFCN). This formulation allows us to accurately localize hemorrhages while bypassing the complexity of object detection. Our system demonstrates competi… ▽ More This paper studies the problem of detecting and segmenting acute intracranial hemorrhage on head computed tomography (CT) scans. We propose to solve both tasks as a semantic segmentation problem using a patch-based fully convolutional network (PatchFCN). This formulation allows us to accurately localize hemorrhages while bypassing the complexity of object detection. Our system demonstrates competitive performance with a human expert and the state-of-the-art on classification tasks (0.976, 0.966 AUC of ROC on retrospective and prospective test sets) and on segmentation tasks (0.785 pixel AP, 0.766 Dice score), while using much less data and a simpler system. In addition, we conduct a series of controlled experiments to understand "why" PatchFCN outperforms standard FCN. Our studies show that PatchFCN finds a good trade-off between batch diversity and the amount of context during training. These findings may also apply to other medical segmentation tasks. △ Less

Submitted 14 April, 2019; v1 submitted 8 June, 2018; originally announced June 2018.

arXiv:1805.11085 [pdf, other]

doi 10.1109/LRA.2018.2852779

More Than a Feeling: Learning to Grasp and Regrasp using Vision and Touch

Authors: Roberto Calandra, Andrew Owens, Dinesh Jayaraman, Justin Lin, Wenzhen Yuan, Jitendra Malik, Edward H. Adelson, Sergey Levine

Abstract: For humans, the process of gras** an object relies heavily on rich tactile feedback. Most recent robotic gras** work, however, has been based only on visual input, and thus cannot easily benefit from feedback after initiating contact. In this paper, we investigate how a robot can learn to use tactile information to iteratively and efficiently adjust its grasp. To this end, we propose an end-to… ▽ More For humans, the process of gras** an object relies heavily on rich tactile feedback. Most recent robotic gras** work, however, has been based only on visual input, and thus cannot easily benefit from feedback after initiating contact. In this paper, we investigate how a robot can learn to use tactile information to iteratively and efficiently adjust its grasp. To this end, we propose an end-to-end action-conditional model that learns regras** policies from raw visuo-tactile data. This model -- a deep, multimodal convolutional network -- predicts the outcome of a candidate grasp adjustment, and then executes a grasp by iteratively selecting the most promising actions. Our approach requires neither calibration of the tactile sensors, nor any analytical modeling of contact forces, thus reducing the engineering effort required to obtain efficient gras** policies. We train our model with data from about 6,450 gras** trials on a two-finger gripper equipped with GelSight high-resolution tactile sensors on each finger. Across extensive experiments, our approach outperforms a variety of baselines at (i) estimating grasp adjustment outcomes, (ii) selecting efficient grasp adjustments for quick gras**, and (iii) reducing the amount of force applied at the fingers, while maintaining competitive performance. Finally, we study the choices made by our model and show that it has successfully acquired useful and interpretable gras** behaviors. △ Less

Submitted 26 July, 2018; v1 submitted 28 May, 2018; originally announced May 2018.

Comments: 8 pages. Published on IEEE Robotics and Automation Letters (RAL). Website: https://sites.google.com/view/more-than-a-feeling

arXiv:1804.11091 [pdf, other]

doi 10.1007/s00453-020-00675-w

Colouring $(P_r+P_s)$-Free Graphs

Authors: Tereza Klimošová, Josef Malík, Tomáš Masařík, Jana Novotná, Daniël Paulusma, Veronika Slívová

Abstract: The $k$-Colouring problem is to decide if the vertices of a graph can be coloured with at most $k$ colours for a fixed integer $k$ such that no two adjacent vertices are coloured alike. If each vertex u must be assigned a colour from a prescribed list $L(u) \subseteq \{1,\cdots, k\}$, then we obtain the List $k$-Colouring problem. A graph $G$ is $H$-free if $G$ does not contain $H$ as an induced s… ▽ More The $k$-Colouring problem is to decide if the vertices of a graph can be coloured with at most $k$ colours for a fixed integer $k$ such that no two adjacent vertices are coloured alike. If each vertex u must be assigned a colour from a prescribed list $L(u) \subseteq \{1,\cdots, k\}$, then we obtain the List $k$-Colouring problem. A graph $G$ is $H$-free if $G$ does not contain $H$ as an induced subgraph. We continue an extensive study into the complexity of these two problems for $H$-free graphs. The graph $P_r+P_s$ is the disjoint union of the $r$-vertex path $P_r$ and the $s$-vertex path $P_s$. We prove that List $3$-Colouring is polynomial-time solvable for $(P_2+P_5)$-free graphs and for $(P_3+P_4)$-free graphs. Combining our results with known results yields complete complexity classifications of $3$-Colouring and List $3$-Colouring on $H$-free graphs for all graphs $H$ up to seven vertices. △ Less

Submitted 16 March, 2021; v1 submitted 30 April, 2018; originally announced April 2018.

Comments: 20 pages, 6 figures. An extended abstract of this paper appeared in the proceedings of ISAAC 2018

Journal ref: Algorithmica 82(7), 1833-1858 (2020)

arXiv:1804.08606 [pdf, other]

Zero-Shot Visual Imitation

Authors: Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A. Efros, Trevor Darrell

Abstract: The current dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both 'what' and 'how' to imitate. We pursue an alternative paradigm wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss. In our framework, the role of the expert… ▽ More The current dominant paradigm for imitation learning relies on strong supervision of expert actions to learn both 'what' and 'how' to imitate. We pursue an alternative paradigm wherein an agent first explores the world without any expert supervision and then distills its experience into a goal-conditioned skill policy with a novel forward consistency loss. In our framework, the role of the expert is only to communicate the goals (i.e., what to imitate) during inference. The learned policy is then employed to mimic the expert (i.e., how to imitate) after seeing just a sequence of images demonstrating the desired task. Our method is 'zero-shot' in the sense that the agent never has access to expert actions during training or for the task demonstration at inference. We evaluate our zero-shot imitator in two real-world settings: complex rope manipulation with a Baxter robot and navigation in previously unseen office environments with a TurtleBot. Through further experiments in VizDoom simulation, we provide evidence that better mechanisms for exploration lead to learning a more capable policy which in turn improves end task performance. Videos, models, and more details are available at https://pathak22.github.io/zeroshot-imitation/ △ Less

Submitted 23 April, 2018; originally announced April 2018.

Comments: Oral presentation at ICLR 2018. Website at https://pathak22.github.io/zeroshot-imitation/

arXiv:1804.08328 [pdf, other]

Taskonomy: Disentangling Task Transfer Learning

Authors: Amir Zamir, Alexander Sax, William Shen, Leonidas Guibas, Jitendra Malik, Silvio Savarese

Abstract: Do visual tasks have a relationship, or are they unrelated? For instance, could having surface normals simplify estimating the depth of an image? Intuition answers these questions positively, implying existence of a structure among visual tasks. Knowing this structure has notable values; it is the concept underlying transfer learning and provides a principled way for identifying redundancies acros… ▽ More Do visual tasks have a relationship, or are they unrelated? For instance, could having surface normals simplify estimating the depth of an image? Intuition answers these questions positively, implying existence of a structure among visual tasks. Knowing this structure has notable values; it is the concept underlying transfer learning and provides a principled way for identifying redundancies across tasks, e.g., to seamlessly reuse supervision among related tasks or solve many tasks in one system without piling up the complexity. We proposes a fully computational approach for modeling the structure of space of visual tasks. This is done via finding (first and higher-order) transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D, and semantic tasks in a latent space. The product is a computational taxonomic map for task transfer learning. We study the consequences of this structure, e.g. nontrivial emerged relationships, and exploit them to reduce the demand for labeled data. For example, we show that the total number of labeled datapoints needed for solving a set of 10 tasks can be reduced by roughly 2/3 (compared to training independently) while kee** the performance nearly the same. We provide a set of tools for computing and probing this taxonomical structure including a solver that users can employ to devise efficient supervision policies for their use cases. △ Less

Submitted 23 April, 2018; originally announced April 2018.

Comments: CVPR 2018 (Oral). See project website and live demos at http://taskonomy.vision/

arXiv:1804.02811 [pdf, other]

Connecting Dots -- from Local Covariance to Empirical Intrinsic Geometry and Locally Linear Embedding

Authors: John Malik, Chao Shen, Hau-Tieng Wu, Nan Wu

Abstract: Local covariance structure under the manifold setup has been widely applied in the machine learning society. Based on the established theoretical results, we provide an extensive study of two relevant manifold learning algorithms, empirical intrinsic geometry (EIG) and the locally linear embedding (LLE) under the manifold setup. Particularly, we show that without an accurate dimension estimation,… ▽ More Local covariance structure under the manifold setup has been widely applied in the machine learning society. Based on the established theoretical results, we provide an extensive study of two relevant manifold learning algorithms, empirical intrinsic geometry (EIG) and the locally linear embedding (LLE) under the manifold setup. Particularly, we show that without an accurate dimension estimation, the geodesic distance estimation by EIG might be corrupted. Furthermore, we show that by taking the local covariance matrix into account, we can more accurately estimate the local geodesic distance. When understanding LLE based on the local covariance structure, its intimate relationship with the curvature suggests a variation of LLE depending on the "truncation scheme". We provide a theoretical analysis of the variation. △ Less

Submitted 8 February, 2019; v1 submitted 9 April, 2018; originally announced April 2018.

Comments: 25pages, 4 figures

MSC Class: 62-07

arXiv:1803.07549 [pdf, other]

Learning Category-Specific Mesh Reconstruction from Image Collections

Authors: Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, Jitendra Malik

Abstract: We present a learning framework for recovering the 3D shape, camera, and texture of an object from a single image. The shape is represented as a deformable 3D mesh model of an object category where a shape is parameterized by a learned mean shape and per-instance predicted deformation. Our approach allows leveraging an annotated image collection for training, where the deformable model and the 3D… ▽ More We present a learning framework for recovering the 3D shape, camera, and texture of an object from a single image. The shape is represented as a deformable 3D mesh model of an object category where a shape is parameterized by a learned mean shape and per-instance predicted deformation. Our approach allows leveraging an annotated image collection for training, where the deformable model and the 3D prediction mechanism are learned without relying on ground-truth 3D or multi-view supervision. Our representation enables us to go beyond existing 3D prediction approaches by incorporating texture inference as prediction of an image in a canonical appearance space. Additionally, we show that semantic keypoints can be easily associated with the predicted shapes. We present qualitative and quantitative results of our approach on CUB and PASCAL3D datasets and show that we can learn to predict diverse shapes and textures across objects using only annotated image collections. The project website can be found at https://akanazawa.github.io/cmr/. △ Less

Submitted 30 July, 2018; v1 submitted 20 March, 2018; originally announced March 2018.

Comments: Project URL: https://akanazawa.github.io/cmr/

arXiv:1803.01710 [pdf, other]

Diffuse to fuse EEG spectra -- intrinsic geometry of sleep dynamics for classification

Authors: Gi-Ren Liu, Yu-Lun Lo, John Malik, Yuan-Chung Sheu, Hau-tieng Wu

Abstract: We propose a novel algorithm for sleep dynamics visualization and automatic annotation by applying diffusion geometry based sensor fusion algorithm to fuse spectral information from two electroencephalograms (EEG). The diffusion geometry approach helps organize the nonlinear dynamical structure hidden in the EEG signal. The visualization is achieved by the nonlinear dimension reduction capability… ▽ More We propose a novel algorithm for sleep dynamics visualization and automatic annotation by applying diffusion geometry based sensor fusion algorithm to fuse spectral information from two electroencephalograms (EEG). The diffusion geometry approach helps organize the nonlinear dynamical structure hidden in the EEG signal. The visualization is achieved by the nonlinear dimension reduction capability of the chosen diffusion geometry algorithms. For the automatic annotation purpose, the {support vector machine} is trained to predict the sleep stage. The prediction performance is validated on a publicly available benchmark database, Physionet Sleep-EDF [extended] SC$^*$ {(SC = Sleep Cassette)} and ST$^*$ {(ST = Sleep Telemetry)}, with the leave-one-subject-out cross validation. When we have a single EEG channel (Fpz-Cz), the overall accuracy, macro F1 and Cohen's kappa achieve $82.72\%$,$75.91\%$ and $76.1\%$ respectively in Sleep-EDF SC$^*$ and $78.63\%$, $73.58\%$ and $69.48\%$ in Sleep-EDF ST$^*$. This performance is compatible {with} the state-of-the-art results. When we have two EEG channels (Fpz-Cz and Pz-Oz), the overall accuracy, macro F1 and Cohen's kappa achieve $84.44\%$,$78.25\%$ and $78.36\%$ respectively in Sleep-EDF SC$^*$ and $79.05\%$, $74.73\%$ and $70.31\%$ in Sleep-EDF ST$^*$. The results suggest the potential of the proposed algorithm in practical applications. △ Less

Submitted 6 May, 2019; v1 submitted 28 February, 2018; originally announced March 2018.

arXiv:1801.03910 [pdf, other]

Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction

Authors: Shubham Tulsiani, Alexei A. Efros, Jitendra Malik

Abstract: We present a framework for learning single-view shape and pose prediction without using direct supervision for either. Our approach allows leveraging multi-view observations from unknown poses as supervisory signal during training. Our proposed training setup enforces geometric consistency between the independently predicted shape and pose from two views of the same instance. We consequently learn… ▽ More We present a framework for learning single-view shape and pose prediction without using direct supervision for either. Our approach allows leveraging multi-view observations from unknown poses as supervisory signal during training. Our proposed training setup enforces geometric consistency between the independently predicted shape and pose from two views of the same instance. We consequently learn to predict shape in an emergent canonical (view-agnostic) frame along with a corresponding pose predictor. We show empirical and qualitative results using the ShapeNet dataset and observe encouragingly competitive performance to previous techniques which rely on stronger forms of supervision. We also demonstrate the applicability of our framework in a realistic setting which is beyond the scope of existing techniques: using a training dataset comprised of online product images where the underlying shape and pose are unknown. △ Less

Submitted 24 April, 2018; v1 submitted 11 January, 2018; originally announced January 2018.

Comments: Project url with code: https://shubhtuls.github.io/mvcSnP/

arXiv:1712.08125 [pdf, other]

Unifying Map and Landmark Based Representations for Visual Navigation

Authors: Saurabh Gupta, David Fouhey, Sergey Levine, Jitendra Malik

Abstract: This works presents a formulation for visual navigation that unifies map based spatial reasoning and path planning, with landmark based robust plan execution in noisy environments. Our proposed formulation is learned from data and is thus able to leverage statistical regularities of the world. This allows it to efficiently navigate in novel environments given only a sparse set of registered images… ▽ More This works presents a formulation for visual navigation that unifies map based spatial reasoning and path planning, with landmark based robust plan execution in noisy environments. Our proposed formulation is learned from data and is thus able to leverage statistical regularities of the world. This allows it to efficiently navigate in novel environments given only a sparse set of registered images as input for building representations for space. Our formulation is based on three key ideas: a learned path planner that outputs path plans to reach the goal, a feature synthesis engine that predicts features for locations along the planned path, and a learned goal-driven closed loop controller that can follow plans given these synthesized features. We test our approach for goal-driven navigation in simulated real world environments and report performance gains over competitive baseline approaches. △ Less

Submitted 21 December, 2017; originally announced December 2017.

Comments: Project page with videos: https://s-gupta.github.io/cmpl/

arXiv:1712.06584 [pdf, other]

End-to-end Recovery of Human Shape and Pose

Authors: Angjoo Kanazawa, Michael J. Black, David W. Jacobs, Jitendra Malik

Abstract: We describe Human Mesh Recovery (HMR), an end-to-end framework for reconstructing a full 3D mesh of a human body from a single RGB image. In contrast to most current methods that compute 2D or 3D joint locations, we produce a richer and more useful mesh representation that is parameterized by shape and 3D joint angles. The main objective is to minimize the reprojection loss of keypoints, which all… ▽ More We describe Human Mesh Recovery (HMR), an end-to-end framework for reconstructing a full 3D mesh of a human body from a single RGB image. In contrast to most current methods that compute 2D or 3D joint locations, we produce a richer and more useful mesh representation that is parameterized by shape and 3D joint angles. The main objective is to minimize the reprojection loss of keypoints, which allow our model to be trained using images in-the-wild that only have ground truth 2D annotations. However, the reprojection loss alone leaves the model highly under constrained. In this work we address this problem by introducing an adversary trained to tell whether a human body parameter is real or not using a large database of 3D human meshes. We show that HMR can be trained with and without using any paired 2D-to-3D supervision. We do not rely on intermediate 2D keypoint detections and infer 3D pose and shape parameters directly from image pixels. Our model runs in real-time given a bounding box containing the person. We demonstrate our approach on various images in-the-wild and out-perform previous optimization based methods that output 3D meshes and show competitive results on tasks such as 3D joint location estimation and part segmentation. △ Less

Submitted 23 June, 2018; v1 submitted 18 December, 2017; originally announced December 2017.

Comments: CVPR 2018, Project page with code: https://akanazawa.github.io/hmr/

arXiv:1712.03121 [pdf, other]

Simultaneous Hand Pose and Skeleton Bone-Lengths Estimation from a Single Depth Image

Authors: Jameel Malik, Ahmed Elhayek, Didier Stricker

Abstract: Articulated hand pose estimation is a challenging task for human-computer interaction. The state-of-the-art hand pose estimation algorithms work only with one or a few subjects for which they have been calibrated or trained. Particularly, the hybrid methods based on learning followed by model fitting or model based deep learning do not explicitly consider varying hand shapes and sizes. In this wor… ▽ More Articulated hand pose estimation is a challenging task for human-computer interaction. The state-of-the-art hand pose estimation algorithms work only with one or a few subjects for which they have been calibrated or trained. Particularly, the hybrid methods based on learning followed by model fitting or model based deep learning do not explicitly consider varying hand shapes and sizes. In this work, we introduce a novel hybrid algorithm for estimating the 3D hand pose as well as bone-lengths of the hand skeleton at the same time, from a single depth image. The proposed CNN architecture learns hand pose parameters and scale parameters associated with the bone-lengths simultaneously. Subsequently, a new hybrid forward kinematics layer employs both parameters to estimate 3D joint positions of the hand. For end-to-end training, we combine three public datasets NYU, ICVL and MSRA-2015 in one unified format to achieve large variation in hand shapes and sizes. Among hybrid methods, our method shows improved accuracy over the state-of-the-art on the combined dataset and the ICVL dataset that contain multiple subjects. Also, our algorithm is demonstrated to work well with unseen images. △ Less

Submitted 8 December, 2017; originally announced December 2017.

Comments: This paper has been accepted and presented in 3DV-2017 conference held at Qingdao, China. http://irc.cs.sdu.edu.cn/3dv/

arXiv:1712.02310 [pdf, other]

From Lifestyle Vlogs to Everyday Interactions

Authors: David F. Fouhey, Wei-cheng Kuo, Alexei A. Efros, Jitendra Malik

Abstract: A major stumbling block to progress in understanding basic human interactions, such as getting out of bed or opening a refrigerator, is lack of good training data. Most past efforts have gathered this data explicitly: starting with a laundry list of action labels, and then querying search engines for videos tagged with each label. In this work, we do the reverse and search implicitly: we start wit… ▽ More A major stumbling block to progress in understanding basic human interactions, such as getting out of bed or opening a refrigerator, is lack of good training data. Most past efforts have gathered this data explicitly: starting with a laundry list of action labels, and then querying search engines for videos tagged with each label. In this work, we do the reverse and search implicitly: we start with a large collection of interaction-rich video data and then annotate and analyze it. We use Internet Lifestyle Vlogs as the source of surprisingly large and diverse interaction data. We show that by collecting the data first, we are able to achieve greater scale and far greater diversity in terms of actions and actors. Additionally, our data exposes biases built into common explicitly gathered data. We make sense of our data by analyzing the central component of interaction -- hands. We benchmark two tasks: identifying semantic object contact at the video level and non-semantic contact state at the frame level. We additionally demonstrate future prediction of hands. △ Less

Submitted 6 December, 2017; originally announced December 2017.

Comments: Project page at: http://people.eecs.berkeley.edu/~dfouhey/2017/VLOG/

arXiv:1712.01812 [pdf, other]

Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene

Authors: Shubham Tulsiani, Saurabh Gupta, David Fouhey, Alexei A. Efros, Jitendra Malik

Abstract: The goal of this paper is to take a single 2D image of a scene and recover the 3D structure in terms of a small set of factors: a layout representing the enclosing surfaces as well as a set of objects represented in terms of shape and pose. We propose a convolutional neural network-based approach to predict this representation and benchmark it on a large dataset of indoor scenes. Our experiments e… ▽ More The goal of this paper is to take a single 2D image of a scene and recover the 3D structure in terms of a small set of factors: a layout representing the enclosing surfaces as well as a set of objects represented in terms of shape and pose. We propose a convolutional neural network-based approach to predict this representation and benchmark it on a large dataset of indoor scenes. Our experiments evaluate a number of practical design questions, demonstrate that we can infer this representation, and quantitatively and qualitatively demonstrate its merits compared to alternate representations. △ Less

Submitted 24 April, 2018; v1 submitted 5 December, 2017; originally announced December 2017.

Comments: Project url with code: https://shubhtuls.github.io/factored3d

arXiv:1710.08247 [pdf, other]

doi 10.1007/978-3-319-46487-9_33

Generic 3D Representation via Pose Estimation and Matching

Authors: Amir R. Zamir, Tilman Wekel, Pulkit Argrawal, Colin Weil, Jitendra Malik, Silvio Savarese

Abstract: Though a large body of computer vision research has investigated develo** generic semantic representations, efforts towards develo** a similar representation for 3D has been limited. In this paper, we learn a generic 3D representation through solving a set of foundational proxy 3D tasks: object-centric camera pose estimation and wide baseline feature matching. Our method is based upon the prem… ▽ More Though a large body of computer vision research has investigated develo** generic semantic representations, efforts towards develo** a similar representation for 3D has been limited. In this paper, we learn a generic 3D representation through solving a set of foundational proxy 3D tasks: object-centric camera pose estimation and wide baseline feature matching. Our method is based upon the premise that by providing supervision over a set of carefully selected foundational tasks, generalization to novel tasks and abstraction capabilities can be achieved. We empirically show that the internal representation of a multi-task ConvNet trained to solve the above core problems generalizes to novel 3D tasks (e.g., scene layout estimation, object pose estimation, surface normal estimation) without the need for fine-tuning and shows traits of abstraction abilities (e.g., cross-modality pose estimation). In the context of the core supervised tasks, we demonstrate our representation achieves state-of-the-art wide baseline feature matching results without requiring apriori rectification (unlike SIFT and the majority of learned features). We also show 6DOF camera pose estimation given a pair local image patches. The accuracy of both supervised tasks come comparable to humans. Finally, we contribute a large-scale dataset composed of object-centric street view scenes along with point correspondences and camera pose information, and conclude with a discussion on the learned representation and open research questions. △ Less

Submitted 23 October, 2017; originally announced October 2017.

Comments: Published in ECCV16. See the project website http://3drepresentation.stanford.edu/ and dataset website https://github.com/amir32002/3D_Street_View

Journal ref: ECCV 2016 535-553

arXiv:1710.06104 [pdf, other]

Large-Scale 3D Shape Reconstruction and Segmentation from ShapeNet Core55

Authors: Li Yi, Lin Shao, Manolis Savva, Haibin Huang, Yang Zhou, Qirui Wang, Benjamin Graham, Martin Engelcke, Roman Klokov, Victor Lempitsky, Yuan Gan, Pengyu Wang, Kun Liu, Fenggen Yu, Panpan Shui, Bingyang Hu, Yan Zhang, Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Minki Jeong, Jaehoon Choi, Changick Kim, Angom Geetchandra , et al. (25 additional authors not shown)

Abstract: We introduce a large-scale 3D shape understanding benchmark using data and annotation from ShapeNet 3D object database. The benchmark consists of two tasks: part-level segmentation of 3D shapes and 3D reconstruction from single view images. Ten teams have participated in the challenge and the best performing teams have outperformed state-of-the-art approaches on both tasks. A few novel deep learni… ▽ More We introduce a large-scale 3D shape understanding benchmark using data and annotation from ShapeNet 3D object database. The benchmark consists of two tasks: part-level segmentation of 3D shapes and 3D reconstruction from single view images. Ten teams have participated in the challenge and the best performing teams have outperformed state-of-the-art approaches on both tasks. A few novel deep learning architectures have been proposed on various 3D representations on both tasks. We report the techniques used by each team and the corresponding performances. In addition, we summarize the major discoveries from the reported results and possible trends for the future work in the field. △ Less

Submitted 27 October, 2017; v1 submitted 17 October, 2017; originally announced October 2017.

arXiv:1709.00832 [pdf]

Extended Phase Graph formalism for systems with Magnetization Transfer and Chemical Exchange

Authors: Shaihan J. Malik, Rui P. A. G. Teixeira, Joseph V. Hajnal

Abstract: An Extended Phase Graph framework for modelling systems with exchange or magnetization transfer (MT) is proposed. The framework, referred to as EPG-X, models coupled two-compartment systems by describing each compartment with separate phase graphs that exchange during evolution periods. There are two variants: EPG-X(BM) for systems governed by the Bloch-McConnell equations; and EPG-X(MT) for the p… ▽ More An Extended Phase Graph framework for modelling systems with exchange or magnetization transfer (MT) is proposed. The framework, referred to as EPG-X, models coupled two-compartment systems by describing each compartment with separate phase graphs that exchange during evolution periods. There are two variants: EPG-X(BM) for systems governed by the Bloch-McConnell equations; and EPG-X(MT) for the pulsed MT formalism. For the MT case the "bound" protons have no transverse components so their phase graph consists only longitudinal states. EPG-X was used to model steady-state gradient echo imaging, MT effects in multislice Turbo Spin Echo imaging, multiecho CPMG for multicomponent T2 relaxometry and transient variable flip angle gradient echo imaging of the type used for MR Fingerprinting. Experimental data were also collected for the final case. Steady-state predictions from EPG-X closely match directly derived steady-state solutions which differ substantially from classic "single pool" EPG predictions. EPG-X(MT) predicts similar MT related levels of signal attenuation in white matter as have been reported elsewhere in the literature. Modelling of CPMG echo trains with EPG-X(BM) suggests that exchange processes can lead to an underestimate of the fraction of short T2 species. Modelling of transient gradient echo sequences with EPG-X(MT) suggests that measurable MT effects result from variable saturation of bound protons, particularly after inversion pulses. In conclusion, EPG-X can be used for modelling of the transient signal response of systems exhibiting chemical exchange or MT. This may be particularly beneficial for relaxometry approaches that rely on characterising transient rather than steady-state sequences. △ Less

Submitted 4 September, 2017; originally announced September 2017.

Comments: For associated code see https://github.com/mriphysics/EPG-X

arXiv:1708.05375 [pdf, other]

Learning a Multi-View Stereo Machine

Authors: Abhishek Kar, Christian Häne, Jitendra Malik

Abstract: We present a learnt system for multi-view stereopsis. In contrast to recent learning based methods for 3D reconstruction, we leverage the underlying 3D geometry of the problem through feature projection and unprojection along viewing rays. By formulating these operations in a differentiable manner, we are able to learn the system end-to-end for the task of metric 3D reconstruction. End-to-end lear… ▽ More We present a learnt system for multi-view stereopsis. In contrast to recent learning based methods for 3D reconstruction, we leverage the underlying 3D geometry of the problem through feature projection and unprojection along viewing rays. By formulating these operations in a differentiable manner, we are able to learn the system end-to-end for the task of metric 3D reconstruction. End-to-end learning allows us to jointly reason about shape priors while conforming geometric constraints, enabling reconstruction from much fewer images (even a single image) than required by classical approaches as well as completion of unseen surfaces. We thoroughly evaluate our approach on the ShapeNet dataset and demonstrate the benefits over classical approaches as well as recent learning based methods. △ Less

Submitted 17 August, 2017; originally announced August 2017.

arXiv:1705.08421 [pdf, other]

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

Authors: Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

Abstract: This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual… ▽ More This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for develo** new approaches for video understanding. △ Less

Submitted 30 April, 2018; v1 submitted 23 May, 2017; originally announced May 2017.

Comments: To appear in CVPR 2018. Check dataset page https://research.google.com/ava/ for details

arXiv:1704.06254 [pdf, other]

Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency

Authors: Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, Jitendra Malik

Abstract: We study the notion of consistency between a 3D shape and a 2D observation and propose a differentiable formulation which allows computing gradients of the 3D shape given an observation from an arbitrary view. We do so by reformulating view consistency using a differentiable ray consistency (DRC) term. We show that this formulation can be incorporated in a learning framework to leverage different… ▽ More We study the notion of consistency between a 3D shape and a 2D observation and propose a differentiable formulation which allows computing gradients of the 3D shape given an observation from an arbitrary view. We do so by reformulating view consistency using a differentiable ray consistency (DRC) term. We show that this formulation can be incorporated in a learning framework to leverage different types of multi-view observations e.g. foreground masks, depth, color images, semantics etc. as supervision for learning single-view 3D prediction. We present empirical analysis of our technique in a controlled setting. We also show that this approach allows us to improve over existing techniques for single-view reconstruction of objects from the PASCAL VOC dataset. △ Less

Submitted 20 April, 2017; originally announced April 2017.

Comments: To appear at CVPR 2017. Project webpage : https://shubhtuls.github.io/drc/

arXiv:1704.00710 [pdf, other]

Hierarchical Surface Prediction for 3D Object Reconstruction

Authors: Christian Häne, Shubham Tulsiani, Jitendra Malik

Abstract: Recently, Convolutional Neural Networks have shown promising results for 3D geometry prediction. They can make predictions from very little input data such as a single color image. A major limitation of such approaches is that they only predict a coarse resolution voxel grid, which does not capture the surface of the objects well. We propose a general framework, called hierarchical surface predict… ▽ More Recently, Convolutional Neural Networks have shown promising results for 3D geometry prediction. They can make predictions from very little input data such as a single color image. A major limitation of such approaches is that they only predict a coarse resolution voxel grid, which does not capture the surface of the objects well. We propose a general framework, called hierarchical surface prediction (HSP), which facilitates prediction of high resolution voxel grids. The main insight is that it is sufficient to predict high resolution voxels around the predicted surfaces. The exterior and interior of the objects can be represented with coarse resolution voxels. Our approach is not dependent on a specific input type. We show results for geometry prediction from color images, depth images and shape completion from partial voxel grids. Our analysis shows that our high resolution predictions are more accurate than low resolution predictions. △ Less

Submitted 6 November, 2017; v1 submitted 3 April, 2017; originally announced April 2017.

Comments: 3DV 2017

arXiv:1703.02018 [pdf, other]

Combining Self-Supervised Learning and Imitation for Vision-Based Rope Manipulation

Authors: Ashvin Nair, Dian Chen, Pulkit Agrawal, Phillip Isola, Pieter Abbeel, Jitendra Malik, Sergey Levine

Abstract: Manipulation of deformable objects, such as ropes and cloth, is an important but challenging problem in robotics. We present a learning-based system where a robot takes as input a sequence of images of a human manipulating a rope from an initial to goal configuration, and outputs a sequence of actions that can reproduce the human demonstration, using only monocular images as input. To perform this… ▽ More Manipulation of deformable objects, such as ropes and cloth, is an important but challenging problem in robotics. We present a learning-based system where a robot takes as input a sequence of images of a human manipulating a rope from an initial to goal configuration, and outputs a sequence of actions that can reproduce the human demonstration, using only monocular images as input. To perform this task, the robot learns a pixel-level inverse dynamics model of rope manipulation directly from images in a self-supervised manner, using about 60K interactions with the rope collected autonomously by the robot. The human demonstration provides a high-level plan of what to do and the low-level inverse model is used to execute the plan. We show that by combining the high and low-level plans, the robot can successfully manipulate a rope into a variety of target shapes using only a sequence of human-provided images for direction. △ Less

Submitted 6 March, 2017; originally announced March 2017.

Comments: 8 pages, accepted to International Conference on Robotics and Automation (ICRA) 2017

arXiv:1703.00441 [pdf, other]

Learning to Optimize Neural Nets

Authors: Ke Li, Jitendra Malik

Abstract: Learning to Optimize is a recently proposed framework for learning optimization algorithms using reinforcement learning. In this paper, we explore learning an optimization algorithm for training shallow neural nets. Such high-dimensional stochastic optimization problems present interesting challenges for existing reinforcement learning algorithms. We develop an extension that is suited to learning… ▽ More Learning to Optimize is a recently proposed framework for learning optimization algorithms using reinforcement learning. In this paper, we explore learning an optimization algorithm for training shallow neural nets. Such high-dimensional stochastic optimization problems present interesting challenges for existing reinforcement learning algorithms. We develop an extension that is suited to learning optimization algorithms in this setting and demonstrate that the learned optimization algorithm consistently outperforms other known optimization algorithms even on unseen tasks and is robust to changes in stochasticity of gradients and the neural net architecture. More specifically, we show that an optimization algorithm trained with the proposed method on the problem of training a neural net on MNIST generalizes to the problems of training neural nets on the Toronto Faces Dataset, CIFAR-10 and CIFAR-100. △ Less

Submitted 30 November, 2017; v1 submitted 1 March, 2017; originally announced March 2017.

Comments: 10 pages, 15 figures

arXiv:1703.00440 [pdf, other]

Fast k-Nearest Neighbour Search via Prioritized DCI

Authors: Ke Li, Jitendra Malik

Abstract: Most exact methods for k-nearest neighbour search suffer from the curse of dimensionality; that is, their query times exhibit exponential dependence on either the ambient or the intrinsic dimensionality. Dynamic Continuous Indexing (DCI) offers a promising way of circumventing the curse and successfully reduces the dependence of query time on intrinsic dimensionality from exponential to sublinear.… ▽ More Most exact methods for k-nearest neighbour search suffer from the curse of dimensionality; that is, their query times exhibit exponential dependence on either the ambient or the intrinsic dimensionality. Dynamic Continuous Indexing (DCI) offers a promising way of circumventing the curse and successfully reduces the dependence of query time on intrinsic dimensionality from exponential to sublinear. In this paper, we propose a variant of DCI, which we call Prioritized DCI, and show a remarkable improvement in the dependence of query time on intrinsic dimensionality. In particular, a linear increase in intrinsic dimensionality, or equivalently, an exponential increase in the number of points near a query, can be mostly counteracted with just a linear increase in space. We also demonstrate empirically that Prioritized DCI significantly outperforms prior methods. In particular, relative to Locality-Sensitive Hashing (LSH), Prioritized DCI reduces the number of distance evaluations by a factor of 14 to 116 and the memory consumption by a factor of 21. △ Less

Submitted 20 July, 2017; v1 submitted 1 March, 2017; originally announced March 2017.

Comments: 14 pages, 6 figures; International Conference on Machine Learning (ICML), 2017

arXiv:1702.08638 [pdf, other]

doi 10.1088/1361-6579/aa707c

Single-lead f-wave extraction using diffusion geometry

Authors: John Malik, Neil Reed, Chun-Li Wang, Hautieng Wu

Abstract: A novel single-lead f-wave extraction algorithm based on the modern diffusion geometry data analysis framework is proposed. The algorithm is essentially an averaged beat subtraction algorithm, where the ventricular activity template is estimated by combining a newly designed metric, the "diffusion distance," and the non-local Euclidean median based on the non-linear manifold setup. We coined the a… ▽ More A novel single-lead f-wave extraction algorithm based on the modern diffusion geometry data analysis framework is proposed. The algorithm is essentially an averaged beat subtraction algorithm, where the ventricular activity template is estimated by combining a newly designed metric, the "diffusion distance," and the non-local Euclidean median based on the non-linear manifold setup. We coined the algorithm DD-NLEM. Two simulation schemes are considered, and the new algorithm DD-NLEM outperforms traditional algorithms, including the average beat subtraction, principal component analysis, and adaptive singular value cancellation, in different evaluation metrics with statistical significance. The clinical potential is shown in the real Holter signal, and we introduce a new score to evaluate the performance of the algorithm. △ Less

Submitted 28 April, 2017; v1 submitted 27 February, 2017; originally announced February 2017.

Comments: 31 pages, 8 figures

arXiv:1702.03920 [pdf, other]

Cognitive Map** and Planning for Visual Navigation

Authors: Saurabh Gupta, Varun Tolani, James Davidson, Sergey Levine, Rahul Sukthankar, Jitendra Malik

Abstract: We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person views and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for map** and planning, such that the map** is driven by the needs of the task, and b) a spatia… ▽ More We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person views and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for map** and planning, such that the map** is driven by the needs of the task, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a top-down belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. We train and test CMP on navigation problems in simulation environments derived from scans of real world buildings. Our experiments demonstrate that CMP outperforms alternate learning-based architectures, as well as, classical map** and path planning approaches in many cases. Furthermore, it naturally extends to semantically specified goals, such as 'going to a chair'. We also deploy CMP on physical robots in indoor environments, where it achieves reasonable performance, even though it is trained entirely in simulation. △ Less

Submitted 7 February, 2019; v1 submitted 13 February, 2017; originally announced February 2017.

Comments: Extended IJCV Version of the original paper at CVPR17. Project website with code, models, simulation environment and videos: https://sites.google.com/view/cognitive-map**-and-planning/

arXiv:1612.09508 [pdf, other]

Feedback Networks

Authors: Amir R. Zamir, Te-Lin Wu, Lin Sun, William Shen, Jitendra Malik, Silvio Savarese

Abstract: Currently, the most successful learning models in computer vision are based on learning successive representations followed by a decision layer. This is usually actualized through feedforward multilayer neural networks, e.g. ConvNets, where each layer forms one of such successive representations. However, an alternative that can achieve the same goal is a feedback based approach in which the repre… ▽ More Currently, the most successful learning models in computer vision are based on learning successive representations followed by a decision layer. This is usually actualized through feedforward multilayer neural networks, e.g. ConvNets, where each layer forms one of such successive representations. However, an alternative that can achieve the same goal is a feedback based approach in which the representation is formed in an iterative manner based on a feedback received from previous iteration's output. We establish that a feedback based approach has several fundamental advantages over feedforward: it enables making early predictions at the query time, its output naturally conforms to a hierarchical structure in the label space (e.g. a taxonomy), and it provides a new basis for Curriculum Learning. We observe that feedback networks develop a considerably different representation compared to feedforward counterparts, in line with the aforementioned advantages. We put forth a general feedback based learning architecture with the endpoint results on par or better than existing feedforward networks with the addition of the above advantages. We also investigate several mechanisms in feedback architectures (e.g. skip connections in time) and design choices (e.g. feedback length). We hope this study offers new perspectives in quest for more natural and practical learning models. △ Less

Submitted 20 August, 2017; v1 submitted 30 December, 2016; originally announced December 2016.

Comments: See a video describing the method at https://youtu.be/MY5Uhv38Ttg and the website at http://feedbacknet.stanford.edu/

arXiv:1612.06851 [pdf, other]

Beyond Skip Connections: Top-Down Modulation for Object Detection

Authors: Abhinav Shrivastava, Rahul Sukthankar, Jitendra Malik, Abhinav Gupta

Abstract: In recent years, we have seen tremendous progress in the field of object detection. Most of the recent improvements have been achieved by targeting deeper feedforward networks. However, many hard object categories such as bottle, remote, etc. require representation of fine details and not just coarse, semantic representations. But most of these fine details are lost in the early convolutional laye… ▽ More In recent years, we have seen tremendous progress in the field of object detection. Most of the recent improvements have been achieved by targeting deeper feedforward networks. However, many hard object categories such as bottle, remote, etc. require representation of fine details and not just coarse, semantic representations. But most of these fine details are lost in the early convolutional layers. What we need is a way to incorporate finer details from lower layers into the detection architecture. Skip connections have been proposed to combine high-level and low-level features, but we argue that selecting the right features from low-level requires top-down contextual information. Inspired by the human visual pathway, in this paper we propose top-down modulations as a way to incorporate fine details into the detection framework. Our approach supplements the standard bottom-up, feedforward ConvNet with a top-down modulation (TDM) network, connected using lateral connections. These connections are responsible for the modulation of lower layer filters, and the top-down network handles the selection and integration of contextual information and low-level features. The proposed TDM architecture provides a significant boost on the COCO testdev benchmark, achieving 28.6 AP for VGG16, 35.2 AP for ResNet101, and 37.3 for InceptionResNetv2 network, without any bells and whistles (e.g., multi-scale, iterative box refinement, etc.). △ Less

Submitted 19 September, 2017; v1 submitted 20 December, 2016; originally announced December 2016.

arXiv:1612.00404 [pdf, other]

Learning Shape Abstractions by Assembling Volumetric Primitives

Authors: Shubham Tulsiani, Hao Su, Leonidas J. Guibas, Alexei A. Efros, Jitendra Malik

Abstract: We present a learning framework for abstracting complex shapes by learning to assemble objects using 3D volumetric primitives. In addition to generating simple and geometrically interpretable explanations of 3D objects, our framework also allows us to automatically discover and exploit consistent structure in the data. We demonstrate that using our method allows predicting shape representations wh… ▽ More We present a learning framework for abstracting complex shapes by learning to assemble objects using 3D volumetric primitives. In addition to generating simple and geometrically interpretable explanations of 3D objects, our framework also allows us to automatically discover and exploit consistent structure in the data. We demonstrate that using our method allows predicting shape representations which can be leveraged for obtaining a consistent parsing across the instances of a shape collection and constructing an interpretable shape similarity measure. We also examine applications for image-based prediction as well as shape manipulation. △ Less

Submitted 2 August, 2018; v1 submitted 1 December, 2016; originally announced December 2016.

Comments: Project url: https://shubhtuls.github.io/volumetricPrimitives/

arXiv:1606.07419 [pdf, other]

Learning to Poke by Poking: Experiential Learning of Intuitive Physics

Authors: Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, Sergey Levine

Abstract: We investigate an experiential learning paradigm for acquiring an internal model of intuitive physics. Our model is evaluated on a real-world robotic manipulation task that requires displacing objects to target locations by poking. The robot gathered over 400 hours of experience by executing more than 100K pokes on different objects. We propose a novel approach based on deep neural networks for mo… ▽ More We investigate an experiential learning paradigm for acquiring an internal model of intuitive physics. Our model is evaluated on a real-world robotic manipulation task that requires displacing objects to target locations by poking. The robot gathered over 400 hours of experience by executing more than 100K pokes on different objects. We propose a novel approach based on deep neural networks for modeling the dynamics of robot's interactions directly from images, by jointly estimating forward and inverse models of dynamics. The inverse model objective provides supervision to construct informative visual features, which the forward model can then predict and in turn regularize the feature space for the inverse model. The interplay between these two objectives creates useful, accurate models that can then be used for multi-step decision making. This formulation has the additional benefit that it is possible to learn forward models in an abstract feature space and thus alleviate the need of predicting pixels. Our experiments show that this joint modeling approach outperforms alternative methods. △ Less

Submitted 15 February, 2017; v1 submitted 23 June, 2016; originally announced June 2016.

Journal ref: NIPS 2016

arXiv:1606.01885 [pdf, other]

Learning to Optimize

Authors: Ke Li, Jitendra Malik

Abstract: Algorithm design is a laborious process and often requires many iterations of ideation and validation. In this paper, we explore automating algorithm design and present a method to learn an optimization algorithm, which we believe to be the first method that can automatically discover a better algorithm. We approach this problem from a reinforcement learning perspective and represent any particula… ▽ More Algorithm design is a laborious process and often requires many iterations of ideation and validation. In this paper, we explore automating algorithm design and present a method to learn an optimization algorithm, which we believe to be the first method that can automatically discover a better algorithm. We approach this problem from a reinforcement learning perspective and represent any particular optimization algorithm as a policy. We learn an optimization algorithm using guided policy search and demonstrate that the resulting algorithm outperforms existing hand-engineered algorithms in terms of convergence speed and/or the final objective value. △ Less

Submitted 6 June, 2016; originally announced June 2016.

Comments: 9 pages, 3 figures

arXiv:1605.03557 [pdf, other]

View Synthesis by Appearance Flow

Authors: Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, Alexei A. Efros

Abstract: We address the problem of novel view synthesis: given an input image, synthesizing new images of the same object or scene observed from arbitrary viewpoints. We approach this as a learning task but, critically, instead of learning to synthesize pixels from scratch, we learn to copy them from the input image. Our approach exploits the observation that the visual appearance of different views of the… ▽ More We address the problem of novel view synthesis: given an input image, synthesizing new images of the same object or scene observed from arbitrary viewpoints. We approach this as a learning task but, critically, instead of learning to synthesize pixels from scratch, we learn to copy them from the input image. Our approach exploits the observation that the visual appearance of different views of the same instance is highly correlated, and such correlation could be explicitly learned by training a convolutional neural network (CNN) to predict appearance flows -- 2-D coordinate vectors specifying which pixels in the input view could be used to reconstruct the target view. Furthermore, the proposed framework easily generalizes to multiple input views by learning how to optimally combine single-view predictions. We show that for both objects and scenes, our approach is able to synthesize novel views of higher perceptual quality than previous CNN-based techniques. △ Less

Submitted 11 February, 2017; v1 submitted 11 May, 2016; originally announced May 2016.

arXiv:1604.08202 [pdf, other]

Amodal Instance Segmentation

Authors: Ke Li, Jitendra Malik

Abstract: We consider the problem of amodal instance segmentation, the objective of which is to predict the region encompassing both visible and occluded parts of each object. Thus far, the lack of publicly available amodal segmentation annotations has stymied the development of amodal segmentation methods. In this paper, we sidestep this issue by relying solely on standard modal instance segmentation annot… ▽ More We consider the problem of amodal instance segmentation, the objective of which is to predict the region encompassing both visible and occluded parts of each object. Thus far, the lack of publicly available amodal segmentation annotations has stymied the development of amodal segmentation methods. In this paper, we sidestep this issue by relying solely on standard modal instance segmentation annotations to train our model. The result is a new method for amodal instance segmentation, which represents the first such method to the best of our knowledge. We demonstrate the proposed method's effectiveness both qualitatively and quantitatively. △ Less

Submitted 17 August, 2016; v1 submitted 27 April, 2016; originally announced April 2016.

Comments: 23 pages, 14 figures; European Conference on Computer Vision (ECCV), 2016

Showing 151–200 of 233 results for author: Malik, J