Search | arXiv e-print repository

Multi-Object Hallucination in Vision-Language Models

Authors: Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David F. Fouhey, Joyce Chai

Abstract: Large vision language models (LVLMs) often suffer from object hallucination, producing objects not present in the given images. While current benchmarks for object hallucination primarily concentrate on the presence of a single object class rather than individual entities, this work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent o… ▽ More Large vision language models (LVLMs) often suffer from object hallucination, producing objects not present in the given images. While current benchmarks for object hallucination primarily concentrate on the presence of a single object class rather than individual entities, this work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent objects or become distracted) when tasked with focusing on multiple objects simultaneously. We introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity. With comprehensive empirical studies and analysis of potential factors leading to multi-object hallucination, we found that (1) LVLMs suffer more hallucinations when focusing on multiple objects compared to a single object. (2) The tested object class distribution affects hallucination behaviors, indicating that LVLMs may follow shortcuts and spurious correlations.(3) Hallucinatory behaviors are influenced by data-specific factors, salience and frequency, and model intrinsic behaviors. We hope to enable LVLMs to recognize and reason about multiple objects that often occur in realistic visual scenes, provide insights, and quantify our progress towards mitigating the issues. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: Accepted to ALVR @ ACL 2024 | Project page: https://multi-object-hallucination.github.io/

arXiv:2406.18158 [pdf, other]

3D-MVP: 3D Multiview Pretraining for Robotic Manipulation

Authors: Shengyi Qian, Kaichun Mo, Valts Blukis, David F. Fouhey, Dieter Fox, Ankit Goyal

Abstract: Recent works have shown that visual pretraining on egocentric datasets using masked autoencoders (MAE) can improve generalization for downstream robotics tasks. However, these approaches pretrain only on 2D images, while many robotics applications require 3D scene understanding. In this work, we propose 3D-MVP, a novel approach for 3D multi-view pretraining using masked autoencoders. We leverage R… ▽ More Recent works have shown that visual pretraining on egocentric datasets using masked autoencoders (MAE) can improve generalization for downstream robotics tasks. However, these approaches pretrain only on 2D images, while many robotics applications require 3D scene understanding. In this work, we propose 3D-MVP, a novel approach for 3D multi-view pretraining using masked autoencoders. We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict gripper pose actions. We split RVT's multi-view transformer into visual encoder and action decoder, and pretrain its visual encoder using masked autoencoding on large-scale 3D datasets such as Objaverse. We evaluate 3D-MVP on a suite of virtual robot manipulation tasks and demonstrate improved performance over baselines. We also show promising results on a real robot platform with minimal finetuning. Our results suggest that 3D-aware pretraining is a promising approach to improve sample efficiency and generalization of vision-based robotic manipulation policies. We will release code and pretrained models for 3D-MVP to facilitate future research. Project site: https://jasonqsy.github.io/3DMVP △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.05132 [pdf, other]

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

Authors: Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F. Fouhey, Joyce Chai

Abstract: The integration of language and 3D perception is crucial for develo** embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datase… ▽ More The integration of language and 3D perception is crucial for develo** embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io △ Less

Submitted 12 June, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

Comments: Project website: https://3d-grand.github.io

arXiv:2403.08768 [pdf, other]

3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surface

Authors: Linyi **, Nilesh Kulkarni, David Fouhey

Abstract: This paper introduces 3DFIRES, a novel system for scene-level 3D reconstruction from posed images. Designed to work with as few as one view, 3DFIRES reconstructs the complete geometry of unseen scenes, including hidden surfaces. With multiple view inputs, our method produces full reconstruction within all camera frustums. A key feature of our approach is the fusion of multi-view information at the… ▽ More This paper introduces 3DFIRES, a novel system for scene-level 3D reconstruction from posed images. Designed to work with as few as one view, 3DFIRES reconstructs the complete geometry of unseen scenes, including hidden surfaces. With multiple view inputs, our method produces full reconstruction within all camera frustums. A key feature of our approach is the fusion of multi-view information at the feature level, enabling the production of coherent and comprehensive 3D reconstruction. We train our system on non-watertight scans from large-scale real scene dataset. We show it matches the efficacy of single-view reconstruction methods with only one input and surpasses existing techniques in both quantitative and qualitative measures for sparse-view 3D reconstruction. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR 2024. Project Page https://**linyi.github.io/3DFIRES/

arXiv:2403.03221 [pdf, other]

FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation

Authors: Chris Rockwell, Nilesh Kulkarni, Linyi **, Jeong Joon Park, Justin Johnson, David F. Fouhey

Abstract: Estimating relative camera poses between images has been a central problem in computer vision. Methods that find correspondences and solve for the fundamental matrix offer high precision in most cases. Conversely, methods predicting pose directly using neural networks are more robust to limited overlap and can infer absolute translation scale, but at the expense of reduced precision. We show how t… ▽ More Estimating relative camera poses between images has been a central problem in computer vision. Methods that find correspondences and solve for the fundamental matrix offer high precision in most cases. Conversely, methods predicting pose directly using neural networks are more robust to limited overlap and can infer absolute translation scale, but at the expense of reduced precision. We show how to combine the best of both methods; our approach yields results that are both precise and robust, while also accurately inferring translation scales. At the heart of our model lies a Transformer that (1) learns to balance between solved and learned pose estimations, and (2) provides a prior to guide a solver. A comprehensive analysis supports our design choices and demonstrates that our method adapts flexibly to various feature extractors and correspondence estimators, showing state-of-the-art performance in 6DoF pose estimation on Matterport3D, InteriorNet, StreetLearn, and Map-free Relocalization. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR 2024. Project Page: https://crockwell.github.io/far/

arXiv:2312.05251 [pdf, other]

Reconstructing Hands in 3D with Transformers

Authors: Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, Jitendra Malik

Abstract: We present an approach that can reconstruct hands in 3D from monocular input. Our approach for Hand Mesh Recovery, HaMeR, follows a fully transformer-based architecture and can analyze hands with significantly increased accuracy and robustness compared to previous work. The key to HaMeR's success lies in scaling up both the data used for training and the capacity of the deep network for hand recon… ▽ More We present an approach that can reconstruct hands in 3D from monocular input. Our approach for Hand Mesh Recovery, HaMeR, follows a fully transformer-based architecture and can analyze hands with significantly increased accuracy and robustness compared to previous work. The key to HaMeR's success lies in scaling up both the data used for training and the capacity of the deep network for hand reconstruction. For training data, we combine multiple datasets that contain 2D or 3D hand annotations. For the deep model, we use a large scale Vision Transformer architecture. Our final model consistently outperforms the previous baselines on popular 3D hand pose benchmarks. To further evaluate the effect of our design in non-controlled settings, we annotate existing in-the-wild datasets with 2D hand keypoint annotations. On this newly collected dataset of annotations, HInt, we demonstrate significant improvements over existing baselines. We make our code, data and models available on the project website: https://geopavlakos.github.io/hamer/. △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2309.12311 [pdf, other]

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

Authors: Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F. Fouhey, Joyce Chai

Abstract: 3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in handling complex language queries, we propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipe… ▽ More 3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in handling complex language queries, we propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. LLM-Grounder utilizes an LLM to decompose complex natural language queries into semantic constituents and employs a visual grounding tool, such as OpenScene or LERF, to identify objects in a 3D scene. The LLM then evaluates the spatial and commonsense relations among the proposed objects to make a final grounding decision. Our method does not require any labeled training data and can generalize to novel 3D scenes and arbitrary text queries. We evaluate LLM-Grounder on the ScanRefer benchmark and demonstrate state-of-the-art zero-shot grounding accuracy. Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries, making LLM-Grounder an effective approach for 3D vision-language tasks in robotics. Videos and interactive demos can be found on the project website https://chat-with-nerf.github.io/ . △ Less

Submitted 21 September, 2023; originally announced September 2023.

Comments: Project website: https://chat-with-nerf.github.io/

arXiv:2307.07511 [pdf, other]

NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis

Authors: Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhijit Kundu, Justin Johnson, David Fouhey, Leonidas Guibas

Abstract: We address the problem of generating realistic 3D motions of humans interacting with objects in a scene. Our key idea is to create a neural interaction field attached to a specific object, which outputs the distance to the valid interaction manifold given a human pose as input. This interaction field guides the sampling of an object-conditioned human motion diffusion model, so as to encourage plau… ▽ More We address the problem of generating realistic 3D motions of humans interacting with objects in a scene. Our key idea is to create a neural interaction field attached to a specific object, which outputs the distance to the valid interaction manifold given a human pose as input. This interaction field guides the sampling of an object-conditioned human motion diffusion model, so as to encourage plausible contacts and affordance semantics. To support interactions with scarcely available data, we propose an automated synthetic data pipeline. For this, we seed a pre-trained motion model, which has priors for the basics of human movement, with interaction-specific anchor poses extracted from limited motion capture data. Using our guided diffusion model trained on generated synthetic data, we synthesize realistic motions for sitting and lifting with several objects, outperforming alternative approaches in terms of motion quality and successful action completion. We call our framework NIFTY: Neural Interaction Fields for Trajectory sYnthesis. △ Less

Submitted 14 July, 2023; originally announced July 2023.

Comments: Project Page with additional results available https://nileshkulkarni.github.io/nifty

arXiv:2306.08731 [pdf, other]

EPIC Fields: Marrying 3D Geometry and Video Understanding

Authors: Vadim Tschernezki, Ahmad Darkhalil, Zhifan Zhu, David Fouhey, Iro Laina, Diane Larlus, Dima Damen, Andrea Vedaldi

Abstract: Neural rendering is fuelling a unification of learning, 3D geometry and video understanding that has been waiting for more than two decades. Progress, however, is still hampered by a lack of suitable datasets and benchmarks. To address this gap, we introduce EPIC Fields, an augmentation of EPIC-KITCHENS with 3D camera information. Like other datasets for neural rendering, EPIC Fields removes the c… ▽ More Neural rendering is fuelling a unification of learning, 3D geometry and video understanding that has been waiting for more than two decades. Progress, however, is still hampered by a lack of suitable datasets and benchmarks. To address this gap, we introduce EPIC Fields, an augmentation of EPIC-KITCHENS with 3D camera information. Like other datasets for neural rendering, EPIC Fields removes the complex and expensive step of reconstructing cameras using photogrammetry, and allows researchers to focus on modelling problems. We illustrate the challenge of photogrammetry in egocentric videos of dynamic actions and propose innovations to address them. Compared to other neural rendering datasets, EPIC Fields is better tailored to video understanding because it is paired with labelled action segments and the recent VISOR segment annotations. To further motivate the community, we also evaluate two benchmark tasks in neural rendering and segmenting dynamic objects, with strong baselines that showcase what is not possible today. We also highlight the advantage of geometry in semi-supervised video object segmentations on the VISOR annotations. EPIC Fields reconstructs 96% of videos in EPICKITCHENS, registering 19M frames in 99 hours recorded in 45 kitchens. △ Less

Submitted 1 February, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

Comments: Published at NeurIPS 2023. 24 pages, 15 figures. Project Webpage: http://epic-kitchens.github.io/epic-fields

arXiv:2306.08671 [pdf, other]

Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data

Authors: Nilesh Kulkarni, Linyi **, Justin Johnson, David F. Fouhey

Abstract: We introduce a method that can learn to predict scene-level implicit functions for 3D reconstruction from posed RGBD data. At test time, our system maps a previously unseen RGB image to a 3D reconstruction of a scene via implicit functions. While implicit functions for 3D reconstruction have often been tied to meshes, we show that we can train one using only a set of posed RGBD images. This settin… ▽ More We introduce a method that can learn to predict scene-level implicit functions for 3D reconstruction from posed RGBD data. At test time, our system maps a previously unseen RGB image to a 3D reconstruction of a scene via implicit functions. While implicit functions for 3D reconstruction have often been tied to meshes, we show that we can train one using only a set of posed RGBD images. This setting may help 3D reconstruction unlock the sea of accelerometer+RGBD data that is coming with new phones. Our system, D2-DRDF, can match and sometimes outperform current methods that use mesh supervision and shows better robustness to sparse data. △ Less

Submitted 14 June, 2023; originally announced June 2023.

Comments: Project page this https://nileshkulkarni.github.io/d2drdf/

arXiv:2305.09664 [pdf, other]

Understanding 3D Object Interaction from a Single Image

Authors: Shengyi Qian, David F. Fouhey

Abstract: Humans can easily understand a single image as depicting multiple potential objects permitting interaction. We use this skill to plan our interactions with the world and accelerate understanding new objects without engaging in interaction. In this paper, we would like to endow machines with the similar ability, so that intelligent agents can better explore the 3D scene or manipulate objects. Our a… ▽ More Humans can easily understand a single image as depicting multiple potential objects permitting interaction. We use this skill to plan our interactions with the world and accelerate understanding new objects without engaging in interaction. In this paper, we would like to endow machines with the similar ability, so that intelligent agents can better explore the 3D scene or manipulate objects. Our approach is a transformer-based model that predicts the 3D location, physical properties and affordance of objects. To power this model, we collect a dataset with Internet videos, egocentric videos and indoor images to train and validate our approach. Our model yields strong performance on our data, and generalizes well to robotics data. Project site: https://jasonqsy.github.io/3DOI/ △ Less

Submitted 4 August, 2023; v1 submitted 16 May, 2023; originally announced May 2023.

Comments: ICCV 2023

arXiv:2212.03239 [pdf, other]

Perspective Fields for Single Image Camera Calibration

Authors: Linyi **, Jianming Zhang, Yannick Hold-Geoffroy, Oliver Wang, Kevin Matzen, Matthew Sticha, David F. Fouhey

Abstract: Geometric camera calibration is often required for applications that understand the perspective of the image. We propose perspective fields as a representation that models the local perspective properties of an image. Perspective Fields contain per-pixel information about the camera view, parameterized as an up vector and a latitude value. This representation has a number of advantages as it makes… ▽ More Geometric camera calibration is often required for applications that understand the perspective of the image. We propose perspective fields as a representation that models the local perspective properties of an image. Perspective Fields contain per-pixel information about the camera view, parameterized as an up vector and a latitude value. This representation has a number of advantages as it makes minimal assumptions about the camera model and is invariant or equivariant to common image editing operations like crop**, war**, and rotation. It is also more interpretable and aligned with human perception. We train a neural network to predict Perspective Fields and the predicted Perspective Fields can be converted to calibration parameters easily. We demonstrate the robustness of our approach under various scenarios compared with camera calibration-based methods and show example applications in image compositing. △ Less

Submitted 16 March, 2023; v1 submitted 6 December, 2022; originally announced December 2022.

Comments: CVPR 2023 Camera Ready. Project Page https://**linyi.github.io/PerspectiveFields/

arXiv:2209.15036 [pdf, other]

doi 10.3847/1538-4365/aca539

Large-Scale Spatial Cross-Calibration of Hinode/SOT-SP and SDO/HMI

Authors: David F. Fouhey, Richard E. L. Higgins, Spiro K. Antiochos, Graham Barnes, Marc L. DeRosa, J. Todd Hoeksema, K. D. Leka, Yang Liu, Peter W. Schuck, Tamas I. Gombosi

Abstract: We investigate the cross-calibration of the Hinode/SOT-SP and SDO/HMI instrument meta-data, specifically the correspondence of the scaling and pointing information. Accurate calibration of these datasets gives the correspondence needed by inter-instrument studies and learning-based magnetogram systems, and is required for physically-meaningful photospheric magnetic field vectors. We approach the p… ▽ More We investigate the cross-calibration of the Hinode/SOT-SP and SDO/HMI instrument meta-data, specifically the correspondence of the scaling and pointing information. Accurate calibration of these datasets gives the correspondence needed by inter-instrument studies and learning-based magnetogram systems, and is required for physically-meaningful photospheric magnetic field vectors. We approach the problem by robustly fitting geometric models on correspondences between images from each instrument's pipeline. This technique is common in computer vision, but several critical details are required when using scanning slit spectrograph data like Hinode/SOT-SP. We apply this technique to data spanning a decade of the Hinode mission. Our results suggest corrections to the published Level 2 Hinode/SOT-SP data. First, an analysis on approximately 2,700 scans suggests that the reported pixel size in Hinode/SOT-SP Level 2 data is incorrect by around 1%. Second, analysis of over 12,000 scans show that the pointing information is often incorrect by dozens of arcseconds with a strong bias. Regression of these corrections indicates that thermal effects have caused secular and cyclic drift in Hinode/SOT-SP pointing data over its mission. We offer two solutions. First, direct co-alignment with SDO/HMI data via our procedure can improve alignments for many Hinode/SOT-SP scans. Second, since the pointing errors are predictable, simple post-hoc corrections can substantially improve the pointing. We conclude by illustrating the impact of this updated calibration on derived physical data products needed for research and interpretation. Among other things, our results suggest that the pointing errors induce a hemispheric bias in estimates of radial current density. △ Less

Submitted 29 September, 2022; originally announced September 2022.

Comments: Under revisions at ApJS

arXiv:2209.13064 [pdf, other]

EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations

Authors: Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, Dima Damen

Abstract: We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets. Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transf… ▽ More We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets. Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions, e.g. an onion is peeled, diced and cooked - where we aim to obtain accurate pixel-level annotations of the peel, onion pieces, chop** board, knife, pan, as well as the acting hands. VISOR introduces an annotation pipeline, AI-powered in parts, for scalability and quality. In total, we publicly release 272K manual semantic masks of 257 object classes, 9.9M interpolated dense masks, 67K hand-object relations, covering 36 hours of 179 untrimmed videos. Along with the annotations, we introduce three challenges in video object segmentation, interaction understanding and long-term reasoning. For data, code and leaderboards: http://epic-kitchens.github.io/VISOR △ Less

Submitted 26 September, 2022; originally announced September 2022.

Comments: 10 pages main, 38 pages appendix. Accepted at NeurIPS 2022 Track on Datasets and Benchmarks Data, code and leaderboards from: http://epic-kitchens.github.io/VISOR

arXiv:2208.08988 [pdf, other]

The 8-Point Algorithm as an Inductive Bias for Relative Pose Prediction by ViTs

Authors: Chris Rockwell, Justin Johnson, David F. Fouhey

Abstract: We present a simple baseline for directly estimating the relative pose (rotation and translation, including scale) between two images. Deep methods have recently shown strong progress but often require complex or multi-stage architectures. We show that a handful of modifications can be applied to a Vision Transformer (ViT) to bring its computations close to the Eight-Point Algorithm. This inductiv… ▽ More We present a simple baseline for directly estimating the relative pose (rotation and translation, including scale) between two images. Deep methods have recently shown strong progress but often require complex or multi-stage architectures. We show that a handful of modifications can be applied to a Vision Transformer (ViT) to bring its computations close to the Eight-Point Algorithm. This inductive bias enables a simple method to be competitive in multiple settings, often substantially improving over the state of the art with strong performance gains in limited data regimes. △ Less

Submitted 23 January, 2023; v1 submitted 18 August, 2022; originally announced August 2022.

Comments: Accepted to 3DV 2022; Project Page: https://crockwell.github.io/rel_pose/ Revision: Fixed Epipolar Lines in Figure 3, Figure 10

arXiv:2208.04307 [pdf, other]

PlaneFormers: From Sparse View Planes to 3D Reconstruction

Authors: Samir Agarwala, Linyi **, Chris Rockwell, David F. Fouhey

Abstract: We present an approach for the planar surface reconstruction of a scene from images with limited overlap. This reconstruction task is challenging since it requires jointly reasoning about single image 3D reconstruction, correspondence between images, and the relative camera pose between images. Past work has proposed optimization-based approaches. We introduce a simpler approach, the PlaneFormer,… ▽ More We present an approach for the planar surface reconstruction of a scene from images with limited overlap. This reconstruction task is challenging since it requires jointly reasoning about single image 3D reconstruction, correspondence between images, and the relative camera pose between images. Past work has proposed optimization-based approaches. We introduce a simpler approach, the PlaneFormer, that uses a transformer applied to 3D-aware plane tokens to perform 3D reasoning. Our experiments show that our approach is substantially more effective than prior work, and that several 3D-specific design decisions are crucial for its success. △ Less

Submitted 8 August, 2022; originally announced August 2022.

Comments: Accepted to ECCV 2022

arXiv:2204.12489 [pdf, other]

Sound Localization by Self-Supervised Time Delay Estimation

Authors: Ziyang Chen, David F. Fouhey, Andrew Owens

Abstract: Sounds reach one microphone in a stereo pair sooner than the other, resulting in an interaural time delay that conveys their directions. Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone. We propose to learn these correspondences through self-supervision, drawing on recent techniques from visual tracking. We adapt the contrastive rando… ▽ More Sounds reach one microphone in a stereo pair sooner than the other, resulting in an interaural time delay that conveys their directions. Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone. We propose to learn these correspondences through self-supervision, drawing on recent techniques from visual tracking. We adapt the contrastive random walk of Jabri et al. to learn a cycle-consistent representation from unlabeled stereo sounds, resulting in a model that performs on par with supervised methods on "in the wild" internet recordings. We also propose a multimodal contrastive learning model that solves a visually-guided localization task: estimating the time delay for a particular person in a multi-speaker mixture, given a visual representation of their face. Project site: https://ificl.github.io/stereocrw/ △ Less

Submitted 28 January, 2023; v1 submitted 26 April, 2022; originally announced April 2022.

Comments: ECCV 2022

arXiv:2203.16531 [pdf, other]

Understanding 3D Object Articulation in Internet Videos

Authors: Shengyi Qian, Linyi **, Chris Rockwell, Siyi Chen, David F. Fouhey

Abstract: We propose to investigate detecting and characterizing the 3D planar articulation of objects from ordinary videos. While seemingly easy for humans, this problem poses many challenges for computers. We propose to approach this problem by combining a top-down detection system that finds planes that can be articulated along with an optimization approach that solves for a 3D plane that can explain a s… ▽ More We propose to investigate detecting and characterizing the 3D planar articulation of objects from ordinary videos. While seemingly easy for humans, this problem poses many challenges for computers. We propose to approach this problem by combining a top-down detection system that finds planes that can be articulated along with an optimization approach that solves for a 3D plane that can explain a sequence of observed articulations. We show that this system can be trained on a combination of videos and 3D scan datasets. When tested on a dataset of challenging Internet videos and the Charades dataset, our approach obtains strong performance. Project site: https://jasonqsy.github.io/Articulation3D △ Less

Submitted 30 March, 2022; originally announced March 2022.

Comments: CVPR 2022

arXiv:2112.04481 [pdf, other]

What's Behind the Couch? Directed Ray Distance Functions (DRDF) for 3D Scene Reconstruction

Authors: Nilesh Kulkarni, Justin Johnson, David F. Fouhey

Abstract: We present an approach for full 3D scene reconstruction from a single unseen image. We train on dataset of realistic non-watertight scans of scenes. Our approach predicts a distance function, since these have shown promise in handling complex topologies and large spaces. We identify and analyze two key challenges for predicting such image conditioned distance functions that have prevented their su… ▽ More We present an approach for full 3D scene reconstruction from a single unseen image. We train on dataset of realistic non-watertight scans of scenes. Our approach predicts a distance function, since these have shown promise in handling complex topologies and large spaces. We identify and analyze two key challenges for predicting such image conditioned distance functions that have prevented their success on real 3D scene data. First, we show that predicting a conventional scene distance from an image requires reasoning over a large receptive field. Second, we analytically show that the optimal output of the network trained to predict these distance functions does not obey all the distance function properties. We propose an alternate distance function, the Directed Ray Distance Function (DRDF), that tackles both challenges. We show that a deep network trained to predict DRDFs outperforms all other methods quantitatively and qualitatively on 3D reconstruction from single image on Matterport3D, 3DFront, and ScanNet. △ Less

Submitted 4 April, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

Comments: Updated illustrations for method section. Project Page see https://nileshkulkarni.github.io/scene_drdf

arXiv:2112.01520 [pdf, other]

Recognizing Scenes from Novel Viewpoints

Authors: Shengyi Qian, Alexander Kirillov, Nikhila Ravi, Devendra Singh Chaplot, Justin Johnson, David F. Fouhey, Georgia Gkioxari

Abstract: Humans can perceive scenes in 3D from a handful of 2D views. For AI agents, the ability to recognize a scene from any viewpoint given only a few images enables them to efficiently interact with the scene and its objects. In this work, we attempt to endow machines with this ability. We propose a model which takes as input a few RGB images of a new scene and recognizes the scene from novel viewpoint… ▽ More Humans can perceive scenes in 3D from a handful of 2D views. For AI agents, the ability to recognize a scene from any viewpoint given only a few images enables them to efficiently interact with the scene and its objects. In this work, we attempt to endow machines with this ability. We propose a model which takes as input a few RGB images of a new scene and recognizes the scene from novel viewpoints by segmenting it into semantic categories. All this without access to the RGB images from those views. We pair 2D scene recognition with an implicit 3D representation and learn from multi-view 2D annotations of hundreds of scenes without any 3D supervision beyond camera poses. We experiment on challenging datasets and demonstrate our model's ability to jointly capture semantics and geometry of novel scenes with diverse layouts, object types and shapes. △ Less

Submitted 2 December, 2021; originally announced December 2021.

arXiv:2108.12530 [pdf]

Combining chest X-rays and electronic health record (EHR) data using machine learning to diagnose acute respiratory failure

Authors: Sarah Jabbour, David Fouhey, Ella Kazerooni, Jenna Wiens, Michael W Sjoding

Abstract: Objective: When patients develop acute respiratory failure, accurately identifying the underlying etiology is essential for determining the best treatment. However, differentiating between common medical diagnoses can be challenging in clinical practice. Machine learning models could improve medical diagnosis by aiding in the diagnostic evaluation of these patients. Materials and Methods: Machine… ▽ More Objective: When patients develop acute respiratory failure, accurately identifying the underlying etiology is essential for determining the best treatment. However, differentiating between common medical diagnoses can be challenging in clinical practice. Machine learning models could improve medical diagnosis by aiding in the diagnostic evaluation of these patients. Materials and Methods: Machine learning models were trained to predict the common causes of acute respiratory failure (pneumonia, heart failure, and/or COPD). Models were trained using chest radiographs and clinical data from the electronic health record (EHR) and applied to an internal and external cohort. Results: The internal cohort of 1,618 patients included 508 (31%) with pneumonia, 363 (22%) with heart failure, and 137 (8%) with COPD based on physician chart review. A model combining chest radiographs and EHR data outperformed models based on each modality alone. Models had similar or better performance compared to a randomly selected physician reviewer. For pneumonia, the combined model area under the receiver operating characteristic curve (AUROC) was 0.79 (0.77-0.79), image model AUROC was 0.74 (0.72-0.75), and EHR model AUROC was 0.74 (0.70-0.76). For heart failure, combined: 0.83 (0.77-0.84), image: 0.80 (0.71-0.81), and EHR: 0.79 (0.75-0.82). For COPD, combined: AUROC = 0.88 (0.83-0.91), image: 0.83 (0.77-0.89), and EHR: 0.80 (0.76-0.84). In the external cohort, performance was consistent for heart failure and increased for COPD, but declined slightly for pneumonia. Conclusions: Machine learning models combining chest radiographs and EHR data can accurately differentiate between common causes of acute respiratory failure. Further work is needed to determine how these models could act as a diagnostic aid to clinicians in clinical settings. △ Less

Submitted 20 April, 2022; v1 submitted 27 August, 2021; originally announced August 2021.

arXiv:2108.12421 [pdf, other]

doi 10.3847/1538-4365/ac42d5

SynthIA: A Synthetic Inversion Approximation for the Stokes Vector Fusing SDO and Hinode into a Virtual Observatory

Authors: Richard E. L. Higgins, David F. Fouhey, Spiro K. Antiochos, Graham Barnes, Mark C. M. Cheung, J. Todd Hoeksema, KD Leka, Yang Liu, Peter W. Schuck, Tamas I. Gombosi

Abstract: Both NASA's Solar Dynamics Observatory (SDO) and the JAXA/NASA Hinode mission include spectropolarimetric instruments designed to measure the photospheric magnetic field. SDO's Helioseismic and Magnetic Imager (HMI) emphasizes full-disk high-cadence and good spatial resolution data acquisition while Hinode's Solar Optical Telescope Spectro-Polarimeter (SOT-SP) focuses on high spatial resolution an… ▽ More Both NASA's Solar Dynamics Observatory (SDO) and the JAXA/NASA Hinode mission include spectropolarimetric instruments designed to measure the photospheric magnetic field. SDO's Helioseismic and Magnetic Imager (HMI) emphasizes full-disk high-cadence and good spatial resolution data acquisition while Hinode's Solar Optical Telescope Spectro-Polarimeter (SOT-SP) focuses on high spatial resolution and spectral sampling at the cost of a limited field of view and slower temporal cadence. This work introduces a deep-learning system named SynthIA (Synthetic Inversion Approximation), that can enhance both missions by capturing the best of each instrument's characteristics. We use SynthIA to produce a new magnetogram data product, SynodeP (Synthetic Hinode Pipeline), that mimics magnetograms from the higher spectral resolution Hinode/SOT-SP pipeline, but is derived from full-disk, high-cadence, and lower spectral-resolution SDO/HMI Stokes observations. Results on held-out data show that SynodeP has good agreement with the Hinode/SOT-SP pipeline inversions, including magnetic fill fraction, which is not provided by the current SDO/HMI pipeline. SynodeP further shows a reduction in the magnitude of the 24-hour oscillations present in the SDO/HMI data. To demonstrate SynthIA's generality, we show the use of SDO/AIA data and subsets of the HMI data as inputs, which enables trade-offs between fidelity to the Hinode/SOT-SP inversions, number of observations used, and temporal artifacts. We discuss possible generalizations of SynthIA and its implications for space weather modeling. This work is part of the NASA Heliophysics DRIVE Science Center (SOLSTICE) at the University of Michigan under grant NASA 80NSSC20K0600E, and will be open-sourced. △ Less

Submitted 27 August, 2021; originally announced August 2021.

arXiv:2108.05892 [pdf, other]

PixelSynth: Generating a 3D-Consistent Experience from a Single Image

Authors: Chris Rockwell, David F. Fouhey, Justin Johnson

Abstract: Recent advancements in differentiable rendering and 3D reasoning have driven exciting results in novel view synthesis from a single image. Despite realistic results, methods are limited to relatively small view change. In order to synthesize immersive scenes, models must also be able to extrapolate. We present an approach that fuses 3D reasoning with autoregressive modeling to outpaint large view… ▽ More Recent advancements in differentiable rendering and 3D reasoning have driven exciting results in novel view synthesis from a single image. Despite realistic results, methods are limited to relatively small view change. In order to synthesize immersive scenes, models must also be able to extrapolate. We present an approach that fuses 3D reasoning with autoregressive modeling to outpaint large view changes in a 3D-consistent manner, enabling scene synthesis. We demonstrate considerable improvement in single image large-angle view synthesis results compared to a variety of methods and possible variants across simulated and real datasets. In addition, we show increased 3D consistency compared to alternative accumulation methods. Project website: https://crockwell.github.io/pixelsynth/ △ Less

Submitted 12 August, 2021; originally announced August 2021.

Comments: In ICCV 2021

arXiv:2105.01061 [pdf, other]

Collision Replay: What Does Bum** Into Things Tell You About Scene Geometry?

Authors: Alexander Raistrick, Nilesh Kulkarni, David F. Fouhey

Abstract: What does bum** into things in a scene tell you about scene geometry? In this paper, we investigate the idea of learning from collisions. At the heart of our approach is the idea of collision replay, where we use examples of a collision to provide supervision for observations at a past frame. We use collision replay to train convolutional neural networks to predict a distribution over collision… ▽ More What does bum** into things in a scene tell you about scene geometry? In this paper, we investigate the idea of learning from collisions. At the heart of our approach is the idea of collision replay, where we use examples of a collision to provide supervision for observations at a past frame. We use collision replay to train convolutional neural networks to predict a distribution over collision time from new images. This distribution conveys information about the navigational affordances (e.g., corridors vs open spaces) and, as we show, can be converted into the distance function for the scene geometry. We analyze this approach with an agent that has noisy actuation in a photorealistic simulator. △ Less

Submitted 3 May, 2021; originally announced May 2021.

arXiv:2103.17273 [pdf, other]

Fast and Accurate Emulation of the SDO/HMI Stokes Inversion with Uncertainty Quantification

Authors: Richard E. L. Higgins, David F. Fouhey, Dichang Zhang, Spiro K. Antiochos, Graham Barnes, J. Todd Hoeksema, K. D. Leka, Yang Liu, Peter W. Schuck, Tamas I. Gombosi

Abstract: The Helioseismic and Magnetic Imager (HMI) onboard NASA's Solar Dynamics Observatory (SDO) produces estimates of the photospheric magnetic field which are a critical input to many space weather modelling and forecasting systems. The magnetogram products produced by HMI and its analysis pipeline are the result of a per-pixel optimization that estimates solar atmospheric parameters and minimizes dis… ▽ More The Helioseismic and Magnetic Imager (HMI) onboard NASA's Solar Dynamics Observatory (SDO) produces estimates of the photospheric magnetic field which are a critical input to many space weather modelling and forecasting systems. The magnetogram products produced by HMI and its analysis pipeline are the result of a per-pixel optimization that estimates solar atmospheric parameters and minimizes disagreement between a synthesized and observed Stokes vector. In this paper, we introduce a deep learning-based approach that can emulate the existing HMI pipeline results two orders of magnitude faster than the current pipeline algorithms. Our system is a U-Net trained on input Stokes vectors and their accompanying optimization-based VFISV inversions. We demonstrate that our system, once trained, can produce high-fidelity estimates of the magnetic field and kinematic and thermodynamic parameters while also producing meaningful confidence intervals. We additionally show that despite penalizing only per-pixel loss terms, our system is able to faithfully reproduce known systematic oscillations in full-disk statistics produced by the pipeline. This emulation system could serve as an initialization for the full Stokes inversion or as an ultra-fast proxy inversion. This work is part of the NASA Heliophysics DRIVE Science Center (SOLSTICE) at the University of Michigan, under grant NASA 80NSSC20K0600E, and has been open sourced. △ Less

Submitted 27 August, 2021; v1 submitted 31 March, 2021; originally announced March 2021.

arXiv:2103.14644 [pdf, other]

Planar Surface Reconstruction from Sparse Views

Authors: Linyi **, Shengyi Qian, Andrew Owens, David F. Fouhey

Abstract: The paper studies planar surface reconstruction of indoor scenes from two views with unknown camera poses. While prior approaches have successfully created object-centric reconstructions of many scenes, they fail to exploit other structures, such as planes, which are typically the dominant components of indoor scenes. In this paper, we reconstruct planar surfaces from multiple views, while jointly… ▽ More The paper studies planar surface reconstruction of indoor scenes from two views with unknown camera poses. While prior approaches have successfully created object-centric reconstructions of many scenes, they fail to exploit other structures, such as planes, which are typically the dominant components of indoor scenes. In this paper, we reconstruct planar surfaces from multiple views, while jointly estimating camera pose. Our experiments demonstrate that our method is able to advance the state of the art of reconstruction from sparse views, on challenging scenes from Matterport3D. Project site: https://**linyi.github.io/SparsePlanes/ △ Less

Submitted 20 August, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

Comments: Accepted to ICCV 2021 (Oral Presentation)

arXiv:2009.10132 [pdf, other]

Deep Learning Applied to Chest X-Rays: Exploiting and Preventing Shortcuts

Authors: Sarah Jabbour, David Fouhey, Ella Kazerooni, Michael W. Sjoding, Jenna Wiens

Abstract: While deep learning has shown promise in improving the automated diagnosis of disease based on chest X-rays, deep networks may exhibit undesirable behavior related to shortcuts. This paper studies the case of spurious class skew in which patients with a particular attribute are spuriously more likely to have the outcome of interest. For instance, clinical protocols might lead to a dataset in which… ▽ More While deep learning has shown promise in improving the automated diagnosis of disease based on chest X-rays, deep networks may exhibit undesirable behavior related to shortcuts. This paper studies the case of spurious class skew in which patients with a particular attribute are spuriously more likely to have the outcome of interest. For instance, clinical protocols might lead to a dataset in which patients with pacemakers are disproportionately likely to have congestive heart failure. This skew can lead to models that take shortcuts by heavily relying on the biased attribute. We explore this problem across a number of attributes in the context of diagnosing the cause of acute hypoxemic respiratory failure. Applied to chest X-rays, we show that i) deep nets can accurately identify many patient attributes including sex (AUROC = 0.96) and age (AUROC >= 0.90), ii) they tend to exploit correlations between such attributes and the outcome label when learning to predict a diagnosis, leading to poor performance when such correlations do not hold in the test population (e.g., everyone in the test set is male), and iii) a simple transfer learning approach is surprisingly effective at preventing the shortcut and promoting good generalization performance. On the task of diagnosing congestive heart failure based on a set of chest X-rays skewed towards older patients (age >= 63), the proposed approach improves generalization over standard training from 0.66 (95% CI: 0.54-0.77) to 0.84 (95% CI: 0.73-0.92) AUROC. While simple, the proposed approach has the potential to improve the performance of models across populations by encouraging reliance on clinically relevant manifestations of disease, i.e., those that a clinician would use to make a diagnosis. △ Less

Submitted 21 September, 2020; originally announced September 2020.

Comments: 32 pages, 9 figures, 12 tables, MLHC 2020

arXiv:2008.06046 [pdf, other]

Full-Body Awareness from Partial Observations

Authors: Chris Rockwell, David F. Fouhey

Abstract: There has been great progress in human 3D mesh recovery and great interest in learning about the world from consumer video data. Unfortunately current methods for 3D human mesh recovery work rather poorly on consumer video data, since on the Internet, unusual camera viewpoints and aggressive truncations are the norm rather than a rarity. We study this problem and make a number of contributions to… ▽ More There has been great progress in human 3D mesh recovery and great interest in learning about the world from consumer video data. Unfortunately current methods for 3D human mesh recovery work rather poorly on consumer video data, since on the Internet, unusual camera viewpoints and aggressive truncations are the norm rather than a rarity. We study this problem and make a number of contributions to address it: (i) we propose a simple but highly effective self-training framework that adapts human 3D mesh recovery systems to consumer videos and demonstrate its application to two recent systems; (ii) we introduce evaluation protocols and keypoint annotations for 13K frames across four consumer video datasets for studying this task, including evaluations on out-of-image keypoints; and (iii) we show that our method substantially improves PCK and human-subject judgments compared to baselines, both on test videos from the dataset it was trained on, as well as on three other datasets without further adaptation. Project website: https://crockwell.github.io/partial_humans △ Less

Submitted 13 August, 2020; originally announced August 2020.

Comments: In ECCV 2020

arXiv:2007.13727 [pdf, other]

Associative3D: Volumetric Reconstruction from Sparse Views

Authors: Shengyi Qian, Linyi **, David F. Fouhey

Abstract: This paper studies the problem of 3D volumetric reconstruction from two views of a scene with an unknown camera. While seemingly easy for humans, this problem poses many challenges for computers since it requires simultaneously reconstructing objects in the two views while also figuring out their relationship. We propose a new approach that estimates reconstructions, distributions over the camera/… ▽ More This paper studies the problem of 3D volumetric reconstruction from two views of a scene with an unknown camera. While seemingly easy for humans, this problem poses many challenges for computers since it requires simultaneously reconstructing objects in the two views while also figuring out their relationship. We propose a new approach that estimates reconstructions, distributions over the camera/object and camera/camera transformations, as well as an inter-view object affinity matrix. This information is then jointly reasoned over to produce the most likely explanation of the scene. We train and test our approach on a dataset of indoor scenes, and rigorously evaluate the merits of our joint reasoning approach. Our experiments show that it is able to recover reasonable scenes from sparse views, while the problem is still challenging. Project site: https://jasonqsy.github.io/Associative3D △ Less

Submitted 27 July, 2020; originally announced July 2020.

Comments: ECCV 2020

arXiv:2006.06669 [pdf, other]

Understanding Human Hands in Contact at Internet Scale

Authors: Dandan Shan, Jiaqi Geng, Michelle Shu, David F. Fouhey

Abstract: Hands are the central means by which humans manipulate their world and being able to reliably extract hand state information from Internet videos of humans engaged in their hands has the potential to pave the way to systems that can learn from petabytes of video data. This paper proposes steps towards this by inferring a rich representation of hands engaged in interaction method that includes: han… ▽ More Hands are the central means by which humans manipulate their world and being able to reliably extract hand state information from Internet videos of humans engaged in their hands has the potential to pave the way to systems that can learn from petabytes of video data. This paper proposes steps towards this by inferring a rich representation of hands engaged in interaction method that includes: hand location, side, contact state, and a box around the object in contact. To support this effort, we gather a large-scale dataset of hands in contact with objects consisting of 131 days of footage as well as a 100K annotated hand-contact video frame dataset. The learned model on this dataset can serve as a foundation for hand-contact understanding in videos. We quantitatively evaluate it both on its own and in service of predicting and learning from 3D meshes of human hands. △ Less

Submitted 11 June, 2020; originally announced June 2020.

Comments: To appear at CVPR 2020 (Oral). Project and dataset webpage: http://fouheylab.eecs.umich.edu/~dandans/projects/100DOH/

arXiv:2006.03586 [pdf, other]

Novel Object Viewpoint Estimation through Reconstruction Alignment

Authors: Mohamed El Banani, Jason J. Corso, David F. Fouhey

Abstract: The goal of this paper is to estimate the viewpoint for a novel object. Standard viewpoint estimation approaches generally fail on this task due to their reliance on a 3D model for alignment or large amounts of class-specific training data and their corresponding canonical pose. We overcome those limitations by learning a reconstruct and align approach. Our key insight is that although we do not h… ▽ More The goal of this paper is to estimate the viewpoint for a novel object. Standard viewpoint estimation approaches generally fail on this task due to their reliance on a 3D model for alignment or large amounts of class-specific training data and their corresponding canonical pose. We overcome those limitations by learning a reconstruct and align approach. Our key insight is that although we do not have an explicit 3D model or a predefined canonical pose, we can still learn to estimate the object's shape in the viewer's frame and then use an image to provide our reference model or canonical pose. In particular, we propose learning two networks: the first maps images to a 3D geometry-aware feature bottleneck and is trained via an image-to-image translation loss; the second learns whether two instances of features are aligned. At test time, our model finds the relative transformation that best aligns the bottleneck features of our test image to a reference image. We evaluate our method on novel object viewpoint estimation by generalizing across different datasets, analyzing the impact of our different modules, and providing a qualitative analysis of the learned features to identify what representations are being learnt for alignment. △ Less

Submitted 5 June, 2020; originally announced June 2020.

Comments: To appear at CVPR 2020. Project page: https://mbanani.github.io/novelviewpoints/

arXiv:2004.00614 [pdf, other]

Articulation-aware Canonical Surface Map**

Authors: Nilesh Kulkarni, Abhinav Gupta, David F. Fouhey, Shubham Tulsiani

Abstract: We tackle the tasks of: 1) predicting a Canonical Surface Map** (CSM) that indicates the map** from 2D pixels to corresponding points on a canonical template shape, and 2) inferring the articulation and pose of the template corresponding to the input image. While previous approaches rely on keypoint supervision for learning, we present an approach that can learn without such annotations. Our k… ▽ More We tackle the tasks of: 1) predicting a Canonical Surface Map** (CSM) that indicates the map** from 2D pixels to corresponding points on a canonical template shape, and 2) inferring the articulation and pose of the template corresponding to the input image. While previous approaches rely on keypoint supervision for learning, we present an approach that can learn without such annotations. Our key insight is that these tasks are geometrically related, and we can obtain supervisory signal via enforcing consistency among the predictions. We present results across a diverse set of animal object categories, showing that our method can learn articulation and CSM prediction from image collections using only foreground mask labels for training. We empirically show that allowing articulation helps learn more accurate CSM prediction, and that enforcing the consistency with predicted CSM is similarly critical for learning meaningful articulation. △ Less

Submitted 26 May, 2020; v1 submitted 1 April, 2020; originally announced April 2020.

Comments: To appear at CVPR 2020, project page https://nileshkulkarni.github.io/acsm/

arXiv:1903.08225 [pdf, other]

Cross-task weakly supervised learning from instructional videos

Authors: Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, Josef Sivic

Abstract: In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations. At the heart of our approach is the observation that weakly supervised learning may be easier if a model shares components while learning different steps: `pour egg' should be tra… ▽ More In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations. At the heart of our approach is the observation that weakly supervised learning may be easier if a model shares components while learning different steps: `pour egg' should be trained jointly with other tasks involving `pour' and `egg'. We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps. Past data does not permit systematic studying of sharing and so we also gather a new dataset, CrossTask, aimed at assessing cross-task sharing. Our experiments demonstrate that sharing across tasks improves performance, especially when done at the component level and that our component model can parse previously unseen tasks by virtue of its compositionality. △ Less

Submitted 29 April, 2019; v1 submitted 19 March, 2019; originally announced March 2019.

Comments: 18 pages, 17 figures, to be published in proceedings of the CVPR, 2019

arXiv:1903.04538 [pdf, other]

doi 10.3847/1538-4365/ab1005

A Machine Learning Dataset Prepared From the NASA Solar Dynamics Observatory Mission

Authors: Richard Galvez, David F. Fouhey, Meng **, Alexandre Szenicer, Andrés Muñoz-Jaramillo, Mark C. M. Cheung, Paul J. Wright, Monica G. Bobra, Yang Liu, James Mason, Rajat Thomas

Abstract: In this paper we present a curated dataset from the NASA Solar Dynamics Observatory (SDO) mission in a format suitable for machine learning research. Beginning from level 1 scientific products we have processed various instrumental corrections, downsampled to manageable spatial and temporal resolutions, and synchronized observations spatially and temporally. We illustrate the use of this dataset w… ▽ More In this paper we present a curated dataset from the NASA Solar Dynamics Observatory (SDO) mission in a format suitable for machine learning research. Beginning from level 1 scientific products we have processed various instrumental corrections, downsampled to manageable spatial and temporal resolutions, and synchronized observations spatially and temporally. We illustrate the use of this dataset with two example applications: forecasting future EVE irradiance from present EVE irradiance and translating HMI observations into AIA observations. For each application we provide metrics and baselines for future model comparison. We anticipate this curated dataset will facilitate machine learning research in heliophysics and the physical sciences generally, increasing the scientific return of the SDO mission. This work is a direct result of the 2018 NASA Frontier Development Laboratory Program. Please see the appendix for access to the dataset. △ Less

Submitted 11 March, 2019; originally announced March 2019.

Comments: Accepted to The Astrophysical Journal Supplement Series; 11 pages, 8 figures

arXiv:1812.00940 [pdf, other]

Visual Memory for Robust Path Following

Authors: Ashish Kumar, Saurabh Gupta, David Fouhey, Sergey Levine, Jitendra Malik

Abstract: Humans routinely retrace paths in a novel environment both forwards and backwards despite uncertainty in their motion. This paper presents an approach for doing so. Given a demonstration of a path, a first network generates a path abstraction. Equipped with this abstraction, a second network observes the world and decides how to act to retrace the path under noisy actuation and a changing environm… ▽ More Humans routinely retrace paths in a novel environment both forwards and backwards despite uncertainty in their motion. This paper presents an approach for doing so. Given a demonstration of a path, a first network generates a path abstraction. Equipped with this abstraction, a second network observes the world and decides how to act to retrace the path under noisy actuation and a changing environment. The two networks are optimized end-to-end at training time. We evaluate the method in two realistic simulators, performing path following and homing under actuation noise and environmental changes. Our experiments show that our approach outperforms classical approaches and other learning based baselines. △ Less

Submitted 3 December, 2018; originally announced December 2018.

Comments: Neural Information Processing Systems (NeurIPS) 2018. Oral Presentation

arXiv:1712.08125 [pdf, other]

Unifying Map and Landmark Based Representations for Visual Navigation

Authors: Saurabh Gupta, David Fouhey, Sergey Levine, Jitendra Malik

Abstract: This works presents a formulation for visual navigation that unifies map based spatial reasoning and path planning, with landmark based robust plan execution in noisy environments. Our proposed formulation is learned from data and is thus able to leverage statistical regularities of the world. This allows it to efficiently navigate in novel environments given only a sparse set of registered images… ▽ More This works presents a formulation for visual navigation that unifies map based spatial reasoning and path planning, with landmark based robust plan execution in noisy environments. Our proposed formulation is learned from data and is thus able to leverage statistical regularities of the world. This allows it to efficiently navigate in novel environments given only a sparse set of registered images as input for building representations for space. Our formulation is based on three key ideas: a learned path planner that outputs path plans to reach the goal, a feature synthesis engine that predicts features for locations along the planned path, and a learned goal-driven closed loop controller that can follow plans given these synthesized features. We test our approach for goal-driven navigation in simulated real world environments and report performance gains over competitive baseline approaches. △ Less

Submitted 21 December, 2017; originally announced December 2017.

Comments: Project page with videos: https://s-gupta.github.io/cmpl/

arXiv:1712.02310 [pdf, other]

From Lifestyle Vlogs to Everyday Interactions

Authors: David F. Fouhey, Wei-cheng Kuo, Alexei A. Efros, Jitendra Malik

Abstract: A major stumbling block to progress in understanding basic human interactions, such as getting out of bed or opening a refrigerator, is lack of good training data. Most past efforts have gathered this data explicitly: starting with a laundry list of action labels, and then querying search engines for videos tagged with each label. In this work, we do the reverse and search implicitly: we start wit… ▽ More A major stumbling block to progress in understanding basic human interactions, such as getting out of bed or opening a refrigerator, is lack of good training data. Most past efforts have gathered this data explicitly: starting with a laundry list of action labels, and then querying search engines for videos tagged with each label. In this work, we do the reverse and search implicitly: we start with a large collection of interaction-rich video data and then annotate and analyze it. We use Internet Lifestyle Vlogs as the source of surprisingly large and diverse interaction data. We show that by collecting the data first, we are able to achieve greater scale and far greater diversity in terms of actions and actors. Additionally, our data exposes biases built into common explicitly gathered data. We make sense of our data by analyzing the central component of interaction -- hands. We benchmark two tasks: identifying semantic object contact at the video level and non-semantic contact state at the frame level. We additionally demonstrate future prediction of hands. △ Less

Submitted 6 December, 2017; originally announced December 2017.

Comments: Project page at: http://people.eecs.berkeley.edu/~dfouhey/2017/VLOG/

arXiv:1712.01812 [pdf, other]

Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene

Authors: Shubham Tulsiani, Saurabh Gupta, David Fouhey, Alexei A. Efros, Jitendra Malik

Abstract: The goal of this paper is to take a single 2D image of a scene and recover the 3D structure in terms of a small set of factors: a layout representing the enclosing surfaces as well as a set of objects represented in terms of shape and pose. We propose a convolutional neural network-based approach to predict this representation and benchmark it on a large dataset of indoor scenes. Our experiments e… ▽ More The goal of this paper is to take a single 2D image of a scene and recover the 3D structure in terms of a small set of factors: a layout representing the enclosing surfaces as well as a set of objects represented in terms of shape and pose. We propose a convolutional neural network-based approach to predict this representation and benchmark it on a large dataset of indoor scenes. Our experiments evaluate a number of practical design questions, demonstrate that we can infer this representation, and quantitatively and qualitatively demonstrate its merits compared to alternate representations. △ Less

Submitted 24 April, 2018; v1 submitted 5 December, 2017; originally announced December 2017.

Comments: Project url with code: https://shubhtuls.github.io/factored3d

arXiv:1612.06836 [pdf, other]

From Images to 3D Shape Attributes

Authors: David F. Fouhey, Abhinav Gupta, Andrew Zisserman

Abstract: Our goal in this paper is to investigate properties of 3D shape that can be determined from a single image. We define 3D shape attributes -- generic properties of the shape that capture curvature, contact and occupied space. Our first objective is to infer these 3D shape attributes from a single image. A second objective is to infer a 3D shape embedding -- a low dimensional vector representing the… ▽ More Our goal in this paper is to investigate properties of 3D shape that can be determined from a single image. We define 3D shape attributes -- generic properties of the shape that capture curvature, contact and occupied space. Our first objective is to infer these 3D shape attributes from a single image. A second objective is to infer a 3D shape embedding -- a low dimensional vector representing the 3D shape. We study how the 3D shape attributes and embedding can be obtained from a single image by training a Convolutional Neural Network (CNN) for this task. We start with synthetic images so that the contribution of various cues and nuisance parameters can be controlled. We then turn to real images and introduce a large scale image dataset of sculptures containing 143K images covering 2197 works from 242 artists. For the CNN trained on the sculpture dataset we show the following: (i) which regions of the imaged sculpture are used by the CNN to infer the 3D shape attributes; (ii) that the shape embedding can be used to match previously unseen sculptures largely independent of viewpoint; and (iii) that the 3D attributes generalize to images of other (non-sculpture) object classes. △ Less

Submitted 3 December, 2017; v1 submitted 20 December, 2016; originally announced December 2016.

Comments: Updated based on TPAMI reviews: title changed, sections reordered, moderate modifications throughout text

arXiv:1603.08637 [pdf, other]

Learning a Predictable and Generative Vector Representation for Objects

Authors: Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, Abhinav Gupta

Abstract: What is a good vector representation of an object? We believe that it should be generative in 3D, in the sense that it can produce new 3D objects; as well as be predictable from 2D, in the sense that it can be perceived from 2D images. We propose a novel architecture, called the TL-embedding network, to learn an embedding space with these properties. The network consists of two components: (a) an… ▽ More What is a good vector representation of an object? We believe that it should be generative in 3D, in the sense that it can produce new 3D objects; as well as be predictable from 2D, in the sense that it can be perceived from 2D images. We propose a novel architecture, called the TL-embedding network, to learn an embedding space with these properties. The network consists of two components: (a) an autoencoder that ensures the representation is generative; and (b) a convolutional network that ensures the representation is predictable. This enables tackling a number of tasks including voxel prediction from 2D images and 3D model retrieval. Extensive experimental analysis demonstrates the usefulness and versatility of this embedding. △ Less

Submitted 31 August, 2016; v1 submitted 29 March, 2016; originally announced March 2016.

Comments: To appear in ECCV 2016. Project webpage: rohitgirdhar.github.io/GenerativePredictableVoxels/

arXiv:1505.01085 [pdf, other]

In Defense of the Direct Perception of Affordances

Authors: David F. Fouhey, Xiaolong Wang, Abhinav Gupta

Abstract: The field of functional recognition or affordance estimation from images has seen a revival in recent years. As originally proposed by Gibson, the affordances of a scene were directly perceived from the ambient light: in other words, functional properties like sittable were estimated directly from incoming pixels. Recent work, however, has taken a mediated approach in which affordances are derived… ▽ More The field of functional recognition or affordance estimation from images has seen a revival in recent years. As originally proposed by Gibson, the affordances of a scene were directly perceived from the ambient light: in other words, functional properties like sittable were estimated directly from incoming pixels. Recent work, however, has taken a mediated approach in which affordances are derived by first estimating semantics or geometry and then reasoning about the affordances. In a tribute to Gibson, this paper explores his theory of affordances as originally proposed. We propose two approaches for direct perception of affordances and show that they obtain good results and can out-perform mediated approaches. We hope this paper can rekindle discussion around direct perception and its implications in the long term. △ Less

Submitted 5 May, 2015; originally announced May 2015.

arXiv:1411.4958 [pdf, other]

Designing Deep Networks for Surface Normal Estimation

Authors: Xiaolong Wang, David F. Fouhey, Abhinav Gupta

Abstract: In the past few years, convolutional neural nets (CNN) have shown incredible promise for learning visual representations. In this paper, we use CNNs for the task of predicting surface normals from a single image. But what is the right architecture we should use? We propose to build upon the decades of hard work in 3D scene understanding, to design new CNN architecture for the task of surface norma… ▽ More In the past few years, convolutional neural nets (CNN) have shown incredible promise for learning visual representations. In this paper, we use CNNs for the task of predicting surface normals from a single image. But what is the right architecture we should use? We propose to build upon the decades of hard work in 3D scene understanding, to design new CNN architecture for the task of surface normal estimation. We show by incorporating several constraints (man-made, manhattan world) and meaningful intermediate representations (room layout, edge labels) in the architecture leads to state of the art performance on surface normal estimation. We also show that our network is quite robust and show state of the art results on other datasets as well without any fine-tuning. △ Less

Submitted 18 November, 2014; originally announced November 2014.

Showing 1–42 of 42 results for author: Fouhey, D