Skip to main content

Showing 1–42 of 42 results for author: Hoiem, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.08252  [pdf, other

    cs.CV

    MonoPatchNeRF: Improving Neural Radiance Fields with Patch-based Monocular Guidance

    Authors: Yuqun Wu, Jae Yong Lee, Chuhang Zou, Shenlong Wang, Derek Hoiem

    Abstract: The latest regularized Neural Radiance Field (NeRF) approaches produce poor geometry and view extrapolation for multiview stereo (MVS) benchmarks such as ETH3D. In this paper, we aim to create 3D models that provide accurate geometry and view synthesis, partially closing the large geometric performance gap between NeRF and traditional MVS methods. We propose a patch-based approach that effectively… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

    Comments: 26 pages, 15 figures

  2. arXiv:2402.02352  [pdf, other

    cs.CV

    Region-Based Representations Revisited

    Authors: Michal Shlapentokh-Rothman, Ansel Blume, Yao Xiao, Yuqun Wu, Sethuraman T V, Heyi Tao, Jae Yong Lee, Wilfredo Torres, Yu-Xiong Wang, Derek Hoiem

    Abstract: We investigate whether region-based representations are effective for recognition. Regions were once a mainstay in recognition approaches, but pixel and patch-based features are now used almost exclusively. We show that recent class-agnostic segmenters like SAM can be effectively combined with strong unsupervised representations like DINOv2 and used for a wide variety of tasks, including semantic… ▽ More

    Submitted 9 June, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

    Comments: CVPR 2024 Camera Ready; website: https://regionreps.web.illinois.edu/

  3. arXiv:2312.17172  [pdf, other

    cs.CV cs.AI cs.CL

    Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

    Authors: Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, Aniruddha Kembhavi

    Abstract: We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse moda… ▽ More

    Submitted 28 December, 2023; originally announced December 2023.

    Comments: 38 pages, 20 figures

  4. arXiv:2311.13258  [pdf, other

    cs.CV cs.CL cs.LG

    ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation

    Authors: Yangyi Chen, Xingyao Wang, Manling Li, Derek Hoiem, Heng Ji

    Abstract: State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction, such as relations between objects. In this work, we present ViStruct, a training framework to learn VLMs for effective visual structural knowledge extraction. Two novel designs are incorporated. First, we propose to leverage the inherent structure of programming language to depict visu… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

    Comments: Accepted to EMNLP 2023

  5. arXiv:2310.16042  [pdf, other

    cs.CL cs.AI

    WebWISE: Web Interface Control and Sequential Exploration with Large Language Models

    Authors: Heyi Tao, Sethuraman T V, Michal Shlapentokh-Rothman, Derek Hoiem

    Abstract: The paper investigates using a Large Language Model (LLM) to automatically perform web software tasks using click, scroll, and text input operations. Previous approaches, such as reinforcement learning (RL) or imitation learning, are inefficient to train and task-specific. Our method uses filtered Document Object Model (DOM) elements as observations and performs tasks step-by-step, sequentially ge… ▽ More

    Submitted 24 October, 2023; v1 submitted 24 October, 2023; originally announced October 2023.

  6. arXiv:2307.01430  [pdf, other

    cs.CV

    Continual Learning in Open-vocabulary Classification with Complementary Memory Systems

    Authors: Zhen Zhu, Weijie Lyu, Yao Xiao, Derek Hoiem

    Abstract: We introduce a method for flexible and efficient continual learning in open-vocabulary image classification, drawing inspiration from the complementary learning systems observed in human cognition. Specifically, we propose to combine predictions from a CLIP zero-shot model and the exemplar-based model, using the zero-shot estimated probability that a sample's class is within the exemplar classes.… ▽ More

    Submitted 3 October, 2023; v1 submitted 3 July, 2023; originally announced July 2023.

    Comments: In review

  7. arXiv:2307.01425  [pdf, other

    cs.CV

    Consistent Multimodal Generation via A Unified GAN Framework

    Authors: Zhen Zhu, Yijun Li, Weijie Lyu, Krishna Kumar Singh, Zhixin Shu, Soeren Pirk, Derek Hoiem

    Abstract: We investigate how to generate multimodal image outputs, such as RGB, depth, and surface normals, with a single generative model. The challenge is to produce outputs that are realistic, and also consistent with each other. Our solution builds on the StyleGAN3 architecture, with a shared backbone and modality-specific branches in the last layers of the synthesis network, and we propose per-modality… ▽ More

    Submitted 3 July, 2023; originally announced July 2023.

    Comments: In review

  8. arXiv:2306.00987  [pdf, other

    cs.CV cs.GR cs.LG

    StyleGAN knows Normal, Depth, Albedo, and More

    Authors: Anand Bhattad, Daniel McKee, Derek Hoiem, D. A. Forsyth

    Abstract: Intrinsic images, in the original sense, are image-like maps of scene properties like depth, normal, albedo or shading. This paper demonstrates that StyleGAN can easily be induced to produce intrinsic images. The procedure is straightforward. We show that, if StyleGAN produces $G({w})$ from latents ${w}$, then for each type of intrinsic image, there is a fixed offset ${d}_c$ so that… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Beyond Image Generation: StyleGAN knows Normals, Depth, Albedo, Shading, Segmentation and perhaps more!

  9. arXiv:2304.14403  [pdf, other

    cs.CV cs.GR cs.LG

    Make It So: Steering StyleGAN for Any Image Inversion and Editing

    Authors: Anand Bhattad, Viraj Shah, Derek Hoiem, D. A. Forsyth

    Abstract: StyleGAN's disentangled style representation enables powerful image editing by manipulating the latent variables, but accurately map** real-world images to their latent variables (GAN inversion) remains a challenge. Existing GAN inversion methods struggle to maintain editing directions and produce realistic results. To address these limitations, we propose Make It So, a novel GAN inversion met… ▽ More

    Submitted 27 April, 2023; originally announced April 2023.

    Comments: project: https://anandbhattad.github.io/makeitso/

  10. arXiv:2212.00987  [pdf, other

    cs.CV

    Sparse SPN: Depth Completion from Sparse Keypoints

    Authors: Yuqun Wu, Jae Yong Lee, Derek Hoiem

    Abstract: Our long term goal is to use image-based depth completion to quickly create 3D models from sparse point clouds, e.g. from SfM or SLAM. Much progress has been made in depth completion. However, most current works assume well distributed samples of known depth, e.g. Lidar or random uniform sampling, and perform poorly on uneven samples, such as from keypoints, due to the large unsampled regions. To… ▽ More

    Submitted 2 December, 2022; originally announced December 2022.

  11. arXiv:2212.00914  [pdf, other

    cs.CV

    QFF: Quantized Fourier Features for Neural Field Representations

    Authors: Jae Yong Lee, Yuqun Wu, Chuhang Zou, Shenlong Wang, Derek Hoiem

    Abstract: Multilayer perceptrons (MLPs) learn high frequencies slowly. Recent approaches encode features in spatial bins to improve speed of learning details, but at the cost of larger model size and loss of continuity. Instead, we propose to encode features in bins of Fourier features that are commonly used for positional encoding. We call these Quantized Fourier Features (QFF). As a naturally multiresolut… ▽ More

    Submitted 1 December, 2022; originally announced December 2022.

  12. arXiv:2210.07582  [pdf, other

    cs.CV

    Deep PatchMatch MVS with Learned Patch Coplanarity, Geometric Consistency and Adaptive Pixel Sampling

    Authors: Jae Yong Lee, Chuhang Zou, Derek Hoiem

    Abstract: Recent work in multi-view stereo (MVS) combines learnable photometric scores and regularization with PatchMatch-based optimization to achieve robust pixelwise estimates of depth, normals, and visibility. However, non-learning based methods still outperform for large scenes with sparse views, in part due to use of geometric consistency constraints and ability to optimize over many views at high res… ▽ More

    Submitted 14 October, 2022; originally announced October 2022.

  13. arXiv:2205.10747  [pdf, other

    cs.CV cs.AI

    Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

    Authors: Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang, Mohit Bansal, Heng Ji

    Abstract: The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction. Existing few-shot video-language learners focus exclusively on the encoder, resulting in the absence of a video-to-text decoder to handle generative tasks. Video captioners have be… ▽ More

    Submitted 13 October, 2022; v1 submitted 22 May, 2022; originally announced May 2022.

  14. arXiv:2204.13653  [pdf, other

    cs.CV

    GRIT: General Robust Image Task Benchmark

    Authors: Tanmay Gupta, Ryan Marten, Aniruddha Kembhavi, Derek Hoiem

    Abstract: Computer vision models excel at making predictions when the test distribution closely resembles the training distribution. Such models have yet to match the ability of biological vision to learn from multiple sources and generalize to new data sources and tasks. To facilitate the development and evaluation of more general vision systems, we introduce the General Robust Image Task (GRIT) benchmark.… ▽ More

    Submitted 2 May, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

  15. arXiv:2202.02317  [pdf, other

    cs.CV cs.CL

    Webly Supervised Concept Expansion for General Purpose Vision Models

    Authors: Amita Kamath, Christopher Clark, Tanmay Gupta, Eric Kolve, Derek Hoiem, Aniruddha Kembhavi

    Abstract: General Purpose Vision (GPV) systems are models that are designed to solve a wide array of visual tasks without requiring architectural changes. Today, GPVs primarily learn both skills and concepts from large fully supervised datasets. Scaling GPVs to tens of thousands of concepts by acquiring data to learn each concept for every skill quickly becomes prohibitive. This work presents an effective a… ▽ More

    Submitted 20 July, 2022; v1 submitted 4 February, 2022; originally announced February 2022.

    Comments: ECCV 2022

  16. arXiv:2108.08943  [pdf, other

    cs.CV

    PatchMatch-RL: Deep MVS with Pixelwise Depth, Normal, and Visibility

    Authors: Jae Yong Lee, Joseph DeGol, Chuhang Zou, Derek Hoiem

    Abstract: Recent learning-based multi-view stereo (MVS) methods show excellent performance with dense cameras and small depth ranges. However, non-learning based approaches still outperform for scenes with large depth ranges and sparser wide-baseline views, in part due to their PatchMatch optimization over pixelwise estimates of depth, normals, and visibility. In this paper, we propose an end-to-end trainab… ▽ More

    Submitted 19 August, 2021; originally announced August 2021.

    Comments: Accepted to ICCV 2021 for oral presentation

  17. arXiv:2104.00743  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Towards General Purpose Vision Systems

    Authors: Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, Derek Hoiem

    Abstract: Computer vision systems today are primarily N-purpose systems, designed and trained for a predefined set of tasks. Adapting such systems to new tasks is challenging and often requires non-trivial modifications to the network architecture (e.g. adding new output heads) or training process (e.g. adding new losses). To reduce the time and expertise required to develop new applications, we would like… ▽ More

    Submitted 19 April, 2022; v1 submitted 1 April, 2021; originally announced April 2021.

    Comments: CVPR 2022 Oral; Project page: https://prior.allenai.org/projects/gpv

  18. arXiv:2010.11029  [pdf, other

    cs.LG cs.CV stat.ML

    Learning Curves for Analysis of Deep Networks

    Authors: Derek Hoiem, Tanmay Gupta, Zhizhong Li, Michal M. Shlapentokh-Rothman

    Abstract: Learning curves model a classifier's test error as a function of the number of training samples. Prior works show that learning curves can be used to select model parameters and extrapolate performance. We investigate how to use learning curves to evaluate design choices, such as pretraining, architecture, and data augmentation. We propose a method to robustly estimate learning curves, abstract th… ▽ More

    Submitted 5 April, 2021; v1 submitted 21 October, 2020; originally announced October 2020.

    Comments: Improved text and figure organization, additional experiments on optimization

  19. arXiv:2006.09920  [pdf, other

    cs.CV cs.CL cs.LG stat.ML

    Contrastive Learning for Weakly Supervised Phrase Grounding

    Authors: Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, Derek Hoiem

    Abstract: Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words i… ▽ More

    Submitted 5 August, 2020; v1 submitted 17 June, 2020; originally announced June 2020.

    Comments: ECCV 2020 (spotlight paper), Project page: http://tanmaygupta.info/info-ground

  20. arXiv:1912.11566  [pdf, other

    cs.CV

    Boundary Cues for 3D Object Shape Recovery

    Authors: Kevin Karsch, Zicheng Liao, Jason Rock, Jonathan T. Barron, Derek Hoiem

    Abstract: Early work in computer vision considered a host of geometric cues for both shape reconstruction and recognition. However, since then, the vision community has focused heavily on shading cues for reconstruction, and moved towards data-driven approaches for recognition. In this paper, we reconsider these perhaps overlooked "boundary" cues (such as self occlusions and folds in a surface), as well as… ▽ More

    Submitted 24 December, 2019; originally announced December 2019.

  21. arXiv:1912.11565  [pdf, other

    cs.GR

    Rendering Synthetic Objects into Legacy Photographs

    Authors: Kevin Karsch, Varsha Hedau, David Forsyth, Derek Hoiem

    Abstract: We propose a method to realistically insert synthetic objects into existing photographs without requiring access to the scene or any additional scene measurements. With a single image and a small amount of annotation, our method creates a physical model of the scene that is suitable for realistically rendering synthetic objects with diffuse, specular, and even glowing materials while accounting fo… ▽ More

    Submitted 24 December, 2019; originally announced December 2019.

  22. arXiv:1912.08795  [pdf, other

    cs.LG cs.CV stat.ML

    Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion

    Authors: Hongxu Yin, Pavlo Molchanov, Zhizhong Li, Jose M. Alvarez, Arun Mallya, Derek Hoiem, Niraj K. Jha, Jan Kautz

    Abstract: We introduce DeepInversion, a new method for synthesizing images from the image distribution used to train a deep neural network. We 'invert' a trained network (teacher) to synthesize class-conditional input images starting from random noise, without using any additional information about the training dataset. Kee** the teacher fixed, our method optimizes the input while regularizing the distrib… ▽ More

    Submitted 15 June, 2020; v1 submitted 18 December, 2019; originally announced December 2019.

  23. arXiv:1910.04099  [pdf, other

    cs.CV cs.LG eess.IV

    Manhattan Room Layout Reconstruction from a Single 360 image: A Comparative Study of State-of-the-art Methods

    Authors: Chuhang Zou, Jheng-Wei Su, Chi-Han Peng, Alex Colburn, Qi Shan, Peter Wonka, Hung-Kuo Chu, Derek Hoiem

    Abstract: Recent approaches for predicting layouts from 360 panoramas produce excellent results. These approaches build on a common framework consisting of three steps: a pre-processing step based on edge-based alignment, prediction of layout elements, and a post-processing step by fitting a 3D layout to the layout elements. Until now, it has been difficult to compare the methods due to multiple different d… ▽ More

    Submitted 25 December, 2020; v1 submitted 9 October, 2019; originally announced October 2019.

    Comments: Accepted by International Journal of Computer Vision (IJCV), 2021

  24. arXiv:1908.08527  [pdf, other

    cs.CV cs.CL

    ViCo: Word Embeddings from Visual Co-occurrences

    Authors: Tanmay Gupta, Alexander Schwing, Derek Hoiem

    Abstract: We propose to learn word embeddings from visual co-occurrences. Two words co-occur visually if both words apply to the same image or image region. Specifically, we extract four types of visual co-occurrences between object and attribute words from large-scale, textually-annotated visual databases like VisualGenome and ImageNet. We then train a multi-task log-bilinear model that compactly encodes w… ▽ More

    Submitted 22 August, 2019; originally announced August 2019.

    Comments: Accepted to ICCV 2019. Project Page: http://tanmaygupta.info/vico/

  25. arXiv:1908.06079  [pdf, other

    cs.CV stat.ML

    Task-Assisted Domain Adaptation with Anchor Tasks

    Authors: Zhizhong Li, Linjie Luo, Sergey Tulyakov, Qieyun Dai, Derek Hoiem

    Abstract: Some tasks, such as surface normals or single-view depth estimation, require per-pixel ground truth that is difficult to obtain on real images but easy to obtain on synthetic. However, models learned on synthetic images often do not generalize well to real images due to the domain shift. Our key idea to improve domain adaptation is to introduce a separate anchor task (such as facial landmarks) who… ▽ More

    Submitted 9 November, 2020; v1 submitted 16 August, 2019; originally announced August 2019.

    Comments: In WACV 2021

  26. arXiv:1907.12253  [pdf, other

    cs.CV cs.LG eess.IV

    Silhouette Guided Point Cloud Reconstruction beyond Occlusion

    Authors: Chuhang Zou, Derek Hoiem

    Abstract: One major challenge in 3D reconstruction is to infer the complete shape geometry from partial foreground occlusions. In this paper, we propose a method to reconstruct the complete 3D shape of an object from a single RGB image, with robustness to occlusion. Given the image and a silhouette of the visible region, our approach completes the silhouette of the occluded region and then generates a point… ▽ More

    Submitted 29 July, 2019; originally announced July 2019.

  27. arXiv:1811.05967  [pdf, other

    cs.CV

    No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques

    Authors: Tanmay Gupta, Alexander Schwing, Derek Hoiem

    Abstract: We show that for human-object interaction detection a relatively simple factorized model with appearance and layout encodings constructed from pre-trained object detectors outperforms more sophisticated approaches. Our model includes factors for detection scores, human and object appearance, and coarse (box-pair configuration) and optionally fine-grained layout (human pose). We also develop traini… ▽ More

    Submitted 22 August, 2019; v1 submitted 14 November, 2018; originally announced November 2018.

    Comments: Accepted to ICCV 2019. Project Page: http://tanmaygupta.info/no_frills/

  28. arXiv:1804.06032  [pdf, other

    cs.CV

    Pixels, voxels, and views: A study of shape representations for single view 3D object shape prediction

    Authors: Daeyun Shin, Charless C. Fowlkes, Derek Hoiem

    Abstract: The goal of this paper is to compare surface-based and volumetric 3D object shape representations, as well as viewer-centered and object-centered reference frames for single-view 3D shape prediction. We propose a new algorithm for predicting depth maps from multiple viewpoints, with a single depth or RGB image as input. By modifying the network and the way models are evaluated, we can directly com… ▽ More

    Submitted 11 June, 2018; v1 submitted 16 April, 2018; originally announced April 2018.

    Comments: CVPR 2018

  29. arXiv:1804.03608  [pdf, other

    cs.CV cs.CL cs.IR cs.LG

    Imagine This! Scripts to Compositions to Videos

    Authors: Tanmay Gupta, Dustin Schwenk, Ali Farhadi, Derek Hoiem, Aniruddha Kembhavi

    Abstract: Imagining a scene described in natural language with realistic layout and appearance of entities is the ultimate test of spatial, visual, and semantic world knowledge. Towards this goal, we present the Composition, Retrieval, and Fusion Network (CRAFT), a model capable of learning this knowledge from video-caption data and applying it while generating videos from novel captions. CRAFT explicitly p… ▽ More

    Submitted 10 April, 2018; originally announced April 2018.

    Comments: Supplementary material included

  30. arXiv:1804.03166  [pdf, other

    cs.CV cs.LG

    Improving Confidence Estimates for Unfamiliar Examples

    Authors: Zhizhong Li, Derek Hoiem

    Abstract: Intuitively, unfamiliarity should lead to lack of confidence. In reality, current algorithms often make highly confident yet wrong predictions when faced with relevant but unfamiliar examples. A classifier we trained to recognize gender is 12 times more likely to be wrong with a 99% confident prediction if presented with a subject from a different age group than those seen during training. In this… ▽ More

    Submitted 7 September, 2020; v1 submitted 9 April, 2018; originally announced April 2018.

    Comments: Published in CVPR 2020 (oral). ERRATA: (1) a previous version (v3) included erroneous results for $T$-scaling, where novel samples are mistakenly included in the validation set for calibration. Please disregard those results. (2) Previous versions (v4, v5) incorrectly stated that Adam was used. In fact, we used SGD

  31. arXiv:1803.08999  [pdf, other

    cs.CV cs.AI

    LayoutNet: Reconstructing the 3D Room Layout from a Single RGB Image

    Authors: Chuhang Zou, Alex Colburn, Qi Shan, Derek Hoiem

    Abstract: We propose an algorithm to predict room layout from a single image that generalizes across panoramas and perspective images, cuboid layouts and more general layouts (e.g. L-shape room). Our method operates directly on the panoramic image, rather than decomposing into perspective images as do recent works. Our network architecture is similar to that of RoomNet, but we show improvements due to align… ▽ More

    Submitted 23 March, 2018; originally announced March 2018.

    Comments: CVPR2018

  32. arXiv:1710.09490  [pdf, other

    cs.CV

    Complete 3D Scene Parsing from an RGBD Image

    Authors: Chuhang Zou, Ruiqi Guo, Zhizhong Li, Derek Hoiem

    Abstract: One major goal of vision is to infer physical models of objects, surfaces, and their layout from sensors. In this paper, we aim to interpret indoor scenes from one RGBD image. Our representation encodes the layout of orthogonal walls and the extent of objects, modeled with CAD-like 3D shapes. We parse both the visible and occluded portions of the scene and all observable objects, producing a compl… ▽ More

    Submitted 13 November, 2018; v1 submitted 25 October, 2017; originally announced October 2017.

    Comments: Accepted to International Journal of Computer Vision (IJCV), 2018 arXiv admin note: text overlap with arXiv:1504.02437

  33. arXiv:1708.02982  [pdf, other

    cs.CV

    ChromaTag: A Colored Marker and Fast Detection Algorithm

    Authors: Joseph DeGol, Timothy Bretl, Derek Hoiem

    Abstract: Current fiducial marker detection algorithms rely on marker IDs for false positive rejection. Time is wasted on potential detections that will eventually be rejected as false positives. We introduce ChromaTag, a fiducial marker and detection algorithm designed to use opponent colors to limit and quickly reject initial false detections and grayscale for precise localization. Through experiments, we… ▽ More

    Submitted 9 August, 2017; originally announced August 2017.

    Comments: International Conference on Computer Vision (ICCV '17)

  34. arXiv:1708.01648  [pdf, other

    cs.CV cs.AI cs.LG stat.ML

    3D-PRNN: Generating Shape Primitives with Recurrent Neural Networks

    Authors: Chuhang Zou, Ersin Yumer, Jimei Yang, Duygu Ceylan, Derek Hoiem

    Abstract: The success of various applications including robotics, digital content creation, and visualization demand a structured and abstract representation of the 3D world from limited sensor data. Inspired by the nature of human perception of 3D shapes as a collection of simple parts, we explore such an abstract shape representation based on primitives. Given a single depth image of an object, we present… ▽ More

    Submitted 4 August, 2017; originally announced August 2017.

    Comments: ICCV 2017

  35. arXiv:1704.00260  [pdf, other

    cs.CV cs.AI cs.LG cs.NE stat.ML

    Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

    Authors: Tanmay Gupta, Kevin Shih, Saurabh Singh, Derek Hoiem

    Abstract: An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it leads to better cross-task transfer than standard multi-task learning. In particular, the task of visual recognition is aligned to the task of visual question answe… ▽ More

    Submitted 16 October, 2017; v1 submitted 2 April, 2017; originally announced April 2017.

    Comments: Accepted in ICCV 2017. The arxiv version has an extra analysis on correlation with human attention

  36. Geometry-Informed Material Recognition

    Authors: Joseph DeGol, Mani Golparvar-Fard, Derek Hoiem

    Abstract: Our goal is to recognize material categories using images and geometry information. In many applications, such as construction management, coarse geometry information is available. We investigate how 3D geometry (surface normals, camera intrinsic and extrinsic parameters) can be used with 2D features (texture and color) to improve material classification. We introduce a new dataset, GeoMat, which… ▽ More

    Submitted 18 July, 2016; originally announced July 2016.

    Comments: IEEE Conference on Computer Vision and Pattern Recognition 2016 (CVPR '16)

  37. arXiv:1606.09282  [pdf, other

    cs.CV cs.LG stat.ML

    Learning without Forgetting

    Authors: Zhizhong Li, Derek Hoiem

    Abstract: When building a unified vision system or gradually adding new capabilities to a system, the usual assumption is that training data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing capabili… ▽ More

    Submitted 14 February, 2017; v1 submitted 29 June, 2016; originally announced June 2016.

    Comments: Conference version appears in ECCV 2016; updated with journal version

  38. arXiv:1606.05002  [pdf, other

    cs.CV

    3DFS: Deformable Dense Depth Fusion and Segmentation for Object Reconstruction from a Handheld Camera

    Authors: Tanmay Gupta, Daeyun Shin, Naren Sivagnanadasan, Derek Hoiem

    Abstract: We propose an approach for 3D reconstruction and segmentation of a single object placed on a flat surface from an input video. Our approach is to perform dense depth map estimation for multiple views using a proposed objective function that preserves detail. The resulting depth maps are then fused using a proposed implicit surface function that is robust to estimation error, producing a smooth sur… ▽ More

    Submitted 27 July, 2016; v1 submitted 15 June, 2016; originally announced June 2016.

  39. arXiv:1605.06465  [pdf, other

    cs.CV cs.LG cs.NE

    Swapout: Learning an ensemble of deep architectures

    Authors: Saurabh Singh, Derek Hoiem, David Forsyth

    Abstract: We describe Swapout, a new stochastic training method, that outperforms ResNets of identical network structure yielding impressive results on CIFAR-10 and CIFAR-100. Swapout samples from a rich set of architectures including dropout, stochastic depth and residual architectures as special cases. When viewed as a regularization method swapout not only inhibits co-adaptation of units in a layer, simi… ▽ More

    Submitted 20 May, 2016; originally announced May 2016.

    Comments: Submitted to NIPS 2016

  40. arXiv:1511.07394  [pdf, other

    cs.CV

    Where To Look: Focus Regions for Visual Question Answering

    Authors: Kevin J. Shih, Saurabh Singh, Derek Hoiem

    Abstract: We present a method that learns to answer visual questions by selecting image regions relevant to the text-based query. Our method exhibits significant improvements in answering questions such as "what color," where it is necessary to evaluate a specific location, and "what room," where it selectively identifies informative image regions. Our model is tested on the VQA dataset which is the largest… ▽ More

    Submitted 10 January, 2016; v1 submitted 23 November, 2015; originally announced November 2015.

    Comments: Submitted to CVPR2016

  41. arXiv:1507.06332  [pdf, other

    cs.CV

    Part Localization using Multi-Proposal Consensus for Fine-Grained Categorization

    Authors: Kevin J. Shih, Arun Mallya, Saurabh Singh, Derek Hoiem

    Abstract: We present a simple deep learning framework to simultaneously predict keypoint locations and their respective visibilities and use those to achieve state-of-the-art performance for fine-grained classification. We show that by conditioning the predictions on object proposals with sufficient image support, our method can do well without complicated spatial reasoning. Instead, inference methods with… ▽ More

    Submitted 22 July, 2015; originally announced July 2015.

    Comments: BMVC 2015

  42. arXiv:1504.02437  [pdf, other

    cs.CV

    Predicting Complete 3D Models of Indoor Scenes

    Authors: Ruiqi Guo, Chuhang Zou, Derek Hoiem

    Abstract: One major goal of vision is to infer physical models of objects, surfaces, and their layout from sensors. In this paper, we aim to interpret indoor scenes from one RGBD image. Our representation encodes the layout of walls, which must conform to a Manhattan structure but is otherwise flexible, and the layout and extent of objects, modeled with CAD-like 3D shapes. We represent both the visible and… ▽ More

    Submitted 17 August, 2017; v1 submitted 9 April, 2015; originally announced April 2015.