Search | arXiv e-print repository

Multistable Shape from Shading Emerges from Patch Diffusion

Authors: Xinran Nicole Han, Todd Zickler, Ko Nishino

Abstract: Models for monocular shape reconstruction of surfaces with diffuse reflection -- shape from shading -- ought to produce distributions of outputs, because there are fundamental mathematical ambiguities of both continuous (e.g., bas-relief) and discrete (e.g., convex/concave) varieties which are also experienced by humans. Yet, the outputs of current models are limited to point estimates or tight di… ▽ More Models for monocular shape reconstruction of surfaces with diffuse reflection -- shape from shading -- ought to produce distributions of outputs, because there are fundamental mathematical ambiguities of both continuous (e.g., bas-relief) and discrete (e.g., convex/concave) varieties which are also experienced by humans. Yet, the outputs of current models are limited to point estimates or tight distributions around single modes, which prevent them from capturing these effects. We introduce a model that reconstructs a multimodal distribution of shapes from a single shading image, which aligns with the human experience of multistable perception. We train a small denoising diffusion process to generate surface normal fields from $16\times 16$ patches of synthetic images of everyday 3D objects. We deploy this model patch-wise at multiple scales, with guidance from inter-patch shape consistency constraints. Despite its relatively small parameter count and predominantly bottom-up structure, we show that multistable shape explanations emerge from this model for ''ambiguous'' test images that humans experience as being multistable. At the same time, the model produces veridical shape estimates for object-like images that include distinctive occluding contours and appear less ambiguous. This may inspire new architectures for stochastic 3D shape perception that are more efficient and better aligned with human experience. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2312.04553 [pdf, other]

SPIDeRS: Structured Polarization for Invisible Depth and Reflectance Sensing

Authors: Tomoki Ichikawa, Shohei Nobuhara, Ko Nishino

Abstract: Can we capture shape and reflectance in stealth? Such capability would be valuable for many application domains in vision, xR, robotics, and HCI. We introduce structured polarization for invisible depth and reflectance sensing (SPIDeRS), the first depth and reflectance sensing method using patterns of polarized light. The key idea is to modulate the angle of linear polarization (AoLP) of projected… ▽ More Can we capture shape and reflectance in stealth? Such capability would be valuable for many application domains in vision, xR, robotics, and HCI. We introduce structured polarization for invisible depth and reflectance sensing (SPIDeRS), the first depth and reflectance sensing method using patterns of polarized light. The key idea is to modulate the angle of linear polarization (AoLP) of projected light at each pixel. The use of polarization makes it invisible and lets us recover not only depth but also directly surface normals and even reflectance. We implement SPIDeRS with a liquid crystal spatial light modulator (SLM) and a polarimetric camera. We derive a novel method for robustly extracting the projected structured polarization pattern from the polarimetric object appearance. We evaluate the effectiveness of SPIDeRS by applying it to a number of real-world objects. The results show that our method successfully reconstructs object shapes of various materials and is robust to diffuse reflection and ambient light. We also demonstrate relighting using recovered surface normals and reflectance. We believe SPIDeRS opens a new avenue of polarization use in visual sensing. △ Less

Submitted 31 March, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

Comments: to be published in CVPR 2024

arXiv:2312.04530 [pdf, other]

Camera Height Doesn't Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation

Authors: Genki Kinoshita, Ko Nishino

Abstract: In this paper, we introduce a novel training method for making any monocular depth network learn absolute scale and estimate metric road-scene depth just from regular training data, i.e., driving videos. We refer to this training framework as StableCamH. The key idea is to leverage cars found on the road as sources of scale supervision but to incorporate them in the training robustly. StableCamH d… ▽ More In this paper, we introduce a novel training method for making any monocular depth network learn absolute scale and estimate metric road-scene depth just from regular training data, i.e., driving videos. We refer to this training framework as StableCamH. The key idea is to leverage cars found on the road as sources of scale supervision but to incorporate them in the training robustly. StableCamH detects and estimates the sizes of cars in the frame and aggregates scale information extracted from them into a camera height estimate whose consistency across the entire video sequence is enforced as scale supervision. This realizes robust unsupervised training of any, otherwise scale-oblivious, monocular depth network to become not only scale-aware but also metric-accurate without the need for auxiliary sensors and extra supervision. Extensive experiments on the KITTI and Cityscapes datasets show the effectiveness of StableCamH and its state-of-the-art accuracy compared with related methods. We also show that StableCamH enables training on mixed datasets of different camera heights, which leads to larger-scale training and thus higher generalization. Metric depth reconstruction is essential in any road-scene visual modeling, and StableCamH democratizes its deployment by establishing the means to train any model as a metric depth estimator. △ Less

Submitted 20 March, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

arXiv:2312.04529 [pdf, other]

Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance

Authors: Yuto Enyo, Ko Nishino

Abstract: Reflectance bounds the frequency spectrum of illumination in the object appearance. In this paper, we introduce the first stochastic inverse rendering method, which recovers the attenuated frequency spectrum of an illumination jointly with the reflectance of an object of known geometry from a single image. Our key idea is to solve this blind inverse problem in the reflectance map, an appearance re… ▽ More Reflectance bounds the frequency spectrum of illumination in the object appearance. In this paper, we introduce the first stochastic inverse rendering method, which recovers the attenuated frequency spectrum of an illumination jointly with the reflectance of an object of known geometry from a single image. Our key idea is to solve this blind inverse problem in the reflectance map, an appearance representation invariant to the underlying geometry, by learning to reverse the image formation with a novel diffusion model which we refer to as the Diffusion Reflectance Map Network (DRMNet). Given an observed reflectance map converted and completed from the single input image, DRMNet generates a reflectance map corresponding to a perfect mirror sphere while jointly estimating the reflectance. The forward process can be understood as gradually filtering a natural illumination with lower and lower frequency reflectance and additive Gaussian noise. DRMNet learns to invert this process with two subnetworks, IllNet and RefNet, which work in concert towards this joint estimation. The network is trained on an extensive synthetic dataset and is demonstrated to generalize to real images, showing state-of-the-art accuracy on established datasets. △ Less

Submitted 26 March, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

Comments: to be published in CVPR 2024

arXiv:2312.04527 [pdf, other]

Correspondences of the Third Kind: Camera Pose Estimation from Object Reflection

Authors: Kohei Yamashita, Vincent Lepetit, Ko Nishino

Abstract: Computer vision has long relied on two kinds of correspondences: pixel correspondences in images and 3D correspondences on object surfaces. Is there another kind, and if there is, what can they do for us? In this paper, we introduce correspondences of the third kind we call reflection correspondences and show that they can help estimate camera pose by just looking at objects without relying on the… ▽ More Computer vision has long relied on two kinds of correspondences: pixel correspondences in images and 3D correspondences on object surfaces. Is there another kind, and if there is, what can they do for us? In this paper, we introduce correspondences of the third kind we call reflection correspondences and show that they can help estimate camera pose by just looking at objects without relying on the background. Reflection correspondences are point correspondences in the reflected world, i.e., the scene reflected by the object surface. The object geometry and reflectance alters the scene geometrically and radiometrically, respectively, causing incorrect pixel correspondences. Geometry recovered from each image is also hampered by distortions, namely generalized bas-relief ambiguity, leading to erroneous 3D correspondences. We show that reflection correspondences can resolve the ambiguities arising from these distortions. We introduce a neural correspondence estimator and a RANSAC algorithm that fully leverages all three kinds of correspondences for robust and accurate joint camera pose and object shape estimation just from the object appearance. The method expands the horizon of numerous downstream tasks, including camera pose estimation for appearance modeling (e.g., NeRF) and motion estimation of reflective objects (e.g., cars on the road), to name a few, as it relieves the requirement of overlap** background. △ Less

Submitted 7 December, 2023; originally announced December 2023.

arXiv:2310.17632 [pdf, other]

DeepShaRM: Multi-View Shape and Reflectance Map Recovery Under Unknown Lighting

Authors: Kohei Yamashita, Shohei Nobuhara, Ko Nishino

Abstract: Geometry reconstruction of textureless, non-Lambertian objects under unknown natural illumination (i.e., in the wild) remains challenging as correspondences cannot be established and the reflectance cannot be expressed in simple analytical forms. We derive a novel multi-view method, DeepShaRM, that achieves state-of-the-art accuracy on this challenging task. Unlike past methods that formulate this… ▽ More Geometry reconstruction of textureless, non-Lambertian objects under unknown natural illumination (i.e., in the wild) remains challenging as correspondences cannot be established and the reflectance cannot be expressed in simple analytical forms. We derive a novel multi-view method, DeepShaRM, that achieves state-of-the-art accuracy on this challenging task. Unlike past methods that formulate this as inverse-rendering, i.e., estimation of reflectance, illumination, and geometry from images, our key idea is to realize that reflectance and illumination need not be disentangled and instead estimated as a compound reflectance map. We introduce a novel deep reflectance map estimation network that recovers the camera-view reflectance maps from the surface normals of the current geometry estimate and the input multi-view images. The network also explicitly estimates per-pixel confidence scores to handle global light transport effects. A deep shape-from-shading network then updates the geometry estimate expressed with a signed distance function using the recovered reflectance maps. By alternating between these two, and, most important, by bypassing the ill-posed problem of reflectance and illumination decomposition, the method accurately recovers object geometry in these challenging settings. Extensive experiments on both synthetic and real-world data clearly demonstrate its state-of-the-art accuracy. △ Less

Submitted 26 October, 2023; originally announced October 2023.

Comments: 3DV 2024

arXiv:2304.06977 [pdf, other]

DeePoint: Visual Pointing Recognition and Direction Estimation

Authors: Shu Nakamura, Yasutomo Kawanishi, Shohei Nobuhara, Ko Nishino

Abstract: In this paper, we realize automatic visual recognition and direction estimation of pointing. We introduce the first neural pointing understanding method based on two key contributions. The first is the introduction of a first-of-its-kind large-scale dataset for pointing recognition and direction estimation, which we refer to as the DP Dataset. DP Dataset consists of more than 2 million frames of 3… ▽ More In this paper, we realize automatic visual recognition and direction estimation of pointing. We introduce the first neural pointing understanding method based on two key contributions. The first is the introduction of a first-of-its-kind large-scale dataset for pointing recognition and direction estimation, which we refer to as the DP Dataset. DP Dataset consists of more than 2 million frames of 33 people pointing in various styles annotated for each frame with pointing timings and 3D directions. The second is DeePoint, a novel deep network model for joint recognition and 3D direction estimation of pointing. DeePoint is a Transformer-based network which fully leverages the spatio-temporal coordination of the body parts, not just the hands. Through extensive experiments, we demonstrate the accuracy and efficiency of DeePoint. We believe DP Dataset and DeePoint will serve as a sound foundation for visual human intention understanding. △ Less

Submitted 11 September, 2023; v1 submitted 14 April, 2023; originally announced April 2023.

Comments: to be published in ICCV 2023

arXiv:2303.17890 [pdf, other]

Fooling Polarization-based Vision using Locally Controllable Polarizing Projection

Authors: Zhuoxiao Li, Zhihang Zhong, Shohei Nobuhara, Ko Nishino, Yinqiang Zheng

Abstract: Polarization is a fundamental property of light that encodes abundant information regarding surface shape, material, illumination and viewing geometry. The computer vision community has witnessed a blossom of polarization-based vision applications, such as reflection removal, shape-from-polarization, transparent object segmentation and color constancy, partially due to the emergence of single-chip… ▽ More Polarization is a fundamental property of light that encodes abundant information regarding surface shape, material, illumination and viewing geometry. The computer vision community has witnessed a blossom of polarization-based vision applications, such as reflection removal, shape-from-polarization, transparent object segmentation and color constancy, partially due to the emergence of single-chip mono/color polarization sensors that make polarization data acquisition easier than ever. However, is polarization-based vision vulnerable to adversarial attacks? If so, is that possible to realize these adversarial attacks in the physical world, without being perceived by human eyes? In this paper, we warn the community of the vulnerability of polarization-based vision, which can be more serious than RGB-based vision. By adapting a commercial LCD projector, we achieve locally controllable polarizing projection, which is successfully utilized to fool state-of-the-art polarization-based vision algorithms for glass segmentation and color constancy. Compared with existing physical attacks on RGB-based vision, which always suffer from the trade-off between attack efficacy and eye conceivability, the adversarial attackers based on polarizing projection are contact-free and visually imperceptible, since naked human eyes can rarely perceive the difference of viciously manipulated polarizing light and ordinary illumination. This poses unprecedented risks on polarization-based vision, both in the monochromatic and trichromatic domain, for which due attentions should be paid and counter measures be considered. △ Less

Submitted 19 June, 2024; v1 submitted 31 March, 2023; originally announced March 2023.

arXiv:2303.13477 [pdf, other]

TransPoser: Transformer as an Optimizer for Joint Object Shape and Pose Estimation

Authors: Yuta Yoshitake, Mai Nishimura, Shohei Nobuhara, Ko Nishino

Abstract: We propose a novel method for joint estimation of shape and pose of rigid objects from their sequentially observed RGB-D images. In sharp contrast to past approaches that rely on complex non-linear optimization, we propose to formulate it as a neural optimization that learns to efficiently estimate the shape and pose. We introduce Deep Directional Distance Function (DeepDDF), a neural network that… ▽ More We propose a novel method for joint estimation of shape and pose of rigid objects from their sequentially observed RGB-D images. In sharp contrast to past approaches that rely on complex non-linear optimization, we propose to formulate it as a neural optimization that learns to efficiently estimate the shape and pose. We introduce Deep Directional Distance Function (DeepDDF), a neural network that directly outputs the depth image of an object given the camera viewpoint and viewing direction, for efficient error computation in 2D image space. We formulate the joint estimation itself as a Transformer which we refer to as TransPoser. We fully leverage the tokenization and multi-head attention to sequentially process the growing set of observations and to efficiently update the shape and pose with a learned momentum, respectively. Experimental results on synthetic and real data show that DeepDDF achieves high accuracy as a category-level object shape representation and TransPoser achieves state-of-the-art accuracy efficiently for joint shape and pose estimation. △ Less

Submitted 23 March, 2023; originally announced March 2023.

arXiv:2303.09534 [pdf, other]

InCrowdFormer: On-Ground Pedestrian World Model From Egocentric Views

Authors: Mai Nishimura, Shohei Nobuhara, Ko Nishino

Abstract: We introduce an on-ground Pedestrian World Model, a computational model that can predict how pedestrians move around an observer in the crowd on the ground plane, but from just the egocentric-views of the observer. Our model, InCrowdFormer, fully leverages the Transformer architecture by modeling pedestrian interaction and egocentric to top-down view transformation with attention, and autoregressi… ▽ More We introduce an on-ground Pedestrian World Model, a computational model that can predict how pedestrians move around an observer in the crowd on the ground plane, but from just the egocentric-views of the observer. Our model, InCrowdFormer, fully leverages the Transformer architecture by modeling pedestrian interaction and egocentric to top-down view transformation with attention, and autoregressively predicts on-ground positions of a variable number of people with an encoder-decoder architecture. We encode the uncertainties arising from unknown pedestrian heights with latent codes to predict the posterior distributions of pedestrian positions. We validate the effectiveness of InCrowdFormer on a novel prediction benchmark of real movements. The results show that InCrowdFormer accurately predicts the future coordination of pedestrians. To the best of our knowledge, InCrowdFormer is the first-of-its-kind pedestrian world model which we believe will benefit a wide range of egocentric-view applications including crowd navigation, tracking, and synthesis. △ Less

Submitted 16 March, 2023; originally announced March 2023.

arXiv:2212.04483 [pdf, other]

Fresnel Microfacet BRDF: Unification of Polari-Radiometric Surface-Body Reflection

Authors: Tomoki Ichikawa, Yoshiki Fukao, Shohei Nobuhara, Ko Nishino

Abstract: Computer vision applications have heavily relied on the linear combination of Lambertian diffuse and microfacet specular reflection models for representing reflected radiance, which turns out to be physically incompatible and limited in applicability. In this paper, we derive a novel analytical reflectance model, which we refer to as Fresnel Microfacet BRDF model, that is physically accurate and g… ▽ More Computer vision applications have heavily relied on the linear combination of Lambertian diffuse and microfacet specular reflection models for representing reflected radiance, which turns out to be physically incompatible and limited in applicability. In this paper, we derive a novel analytical reflectance model, which we refer to as Fresnel Microfacet BRDF model, that is physically accurate and generalizes to various real-world surfaces. Our key idea is to model the Fresnel reflection and transmission of the surface microgeometry with a collection of oriented mirror facets, both for body and surface reflections. We carefully derive the Fresnel reflection and transmission for each microfacet as well as the light transport between them in the subsurface. This physically-grounded modeling also allows us to express the polarimetric behavior of reflected light in addition to its radiometric behavior. That is, FMBRDF unifies not only body and surface reflections but also light reflection in radiometry and polarization and represents them in a single model. Experimental results demonstrate its effectiveness in accuracy, expressive power, and image-based estimation. △ Less

Submitted 8 December, 2022; originally announced December 2022.

arXiv:2210.06332 [pdf, other]

ViewBirdiformer: Learning to recover ground-plane crowd trajectories and ego-motion from a single ego-centric view

Authors: Mai Nishimura, Shohei Nobuhara, Ko Nishino

Abstract: We introduce a novel learning-based method for view birdification, the task of recovering ground-plane trajectories of pedestrians of a crowd and their observer in the same crowd just from the observed ego-centric video. View birdification becomes essential for mobile robot navigation and localization in dense crowds where the static background is hard to see and reliably track. It is challenging… ▽ More We introduce a novel learning-based method for view birdification, the task of recovering ground-plane trajectories of pedestrians of a crowd and their observer in the same crowd just from the observed ego-centric video. View birdification becomes essential for mobile robot navigation and localization in dense crowds where the static background is hard to see and reliably track. It is challenging mainly for two reasons; i) absolute trajectories of pedestrians are entangled with the movement of the observer which needs to be decoupled from their observed relative movements in the ego-centric video, and ii) a crowd motion model describing the pedestrian movement interactions is specific to the scene yet unknown a priori. For this, we introduce a Transformer-based network referred to as ViewBirdiformer which implicitly models the crowd motion through self-attention and decomposes relative 2D movement observations onto the ground-plane trajectories of the crowd and the camera through cross-attention between views. Most important, ViewBirdiformer achieves view birdification in a single forward pass which opens the door to accurate real-time, always-on situational awareness. Extensive experimental results demonstrate that ViewBirdiformer achieves accuracy similar to or better than state-of-the-art with three orders of magnitude reduction in execution time. △ Less

Submitted 12 October, 2022; originally announced October 2022.

arXiv:2207.11876 [pdf, other]

nLMVS-Net: Deep Non-Lambertian Multi-View Stereo

Authors: Kohei Yamashita, Yuto Enyo, Shohei Nobuhara, Ko Nishino

Abstract: We introduce a novel multi-view stereo (MVS) method that can simultaneously recover not just per-pixel depth but also surface normals, together with the reflectance of textureless, complex non-Lambertian surfaces captured under known but natural illumination. Our key idea is to formulate MVS as an end-to-end learnable network, which we refer to as nLMVS-Net, that seamlessly integrates radiometric… ▽ More We introduce a novel multi-view stereo (MVS) method that can simultaneously recover not just per-pixel depth but also surface normals, together with the reflectance of textureless, complex non-Lambertian surfaces captured under known but natural illumination. Our key idea is to formulate MVS as an end-to-end learnable network, which we refer to as nLMVS-Net, that seamlessly integrates radiometric cues to leverage surface normals as view-independent surface features for learned cost volume construction and filtering. It first estimates surface normals as pixel-wise probability densities for each view with a novel shape-from-shading network. These per-pixel surface normal densities and the input multi-view images are then input to a novel cost volume filtering network that learns to recover per-pixel depth and surface normal. The reflectance is also explicitly estimated by alternating with geometry reconstruction. Extensive quantitative evaluations on newly established synthetic and real-world datasets show that nLMVS-Net can robustly and accurately recover the shape and reflectance of complex objects in natural settings. △ Less

Submitted 10 November, 2022; v1 submitted 24 July, 2022; originally announced July 2022.

Comments: Accepted to WACV 2023

arXiv:2207.03870 [pdf, other]

BlindSpotNet: Seeing Where We Cannot See

Authors: Taichi Fukuda, Kotaro Hasegawa, Shinya Ishizaki, Shohei Nobuhara, Ko Nishino

Abstract: We introduce 2D blind spot estimation as a critical visual task for road scene understanding. By automatically detecting road regions that are occluded from the vehicle's vantage point, we can proactively alert a manual driver or a self-driving system to potential causes of accidents (e.g., draw attention to a road region from which a child may spring out). Detecting blind spots in full 3D would b… ▽ More We introduce 2D blind spot estimation as a critical visual task for road scene understanding. By automatically detecting road regions that are occluded from the vehicle's vantage point, we can proactively alert a manual driver or a self-driving system to potential causes of accidents (e.g., draw attention to a road region from which a child may spring out). Detecting blind spots in full 3D would be challenging, as 3D reasoning on the fly even if the car is equipped with LiDAR would be prohibitively expensive and error prone. We instead propose to learn to estimate blind spots in 2D, just from a monocular camera. We achieve this in two steps. We first introduce an automatic method for generating ``ground-truth'' blind spot training data for arbitrary driving videos by leveraging monocular depth estimation, semantic segmentation, and SLAM. The key idea is to reason in 3D but from 2D images by defining blind spots as those road regions that are currently invisible but become visible in the near future. We construct a large-scale dataset with this automatic offline blind spot estimation, which we refer to as Road Blind Spot (RBS) dataset. Next, we introduce BlindSpotNet (BSN), a simple network that fully leverages this dataset for fully automatic estimation of frame-wise blind spot probability maps for arbitrary driving videos. Extensive experimental results demonstrate the validity of our RBS Dataset and the effectiveness of our BSN. △ Less

Submitted 8 July, 2022; originally announced July 2022.

arXiv:2111.05060 [pdf, other]

View Birdification in the Crowd: Ground-Plane Localization from Perceived Movements

Authors: Mai Nishimura, Shohei Nobuhara, Ko Nishino

Abstract: We introduce view birdification, the problem of recovering ground-plane movements of people in a crowd from an ego-centric video captured from an observer (e.g., a person or a vehicle) also moving in the crowd. Recovered ground-plane movements would provide a sound basis for situational understanding and benefit downstream applications in computer vision and robotics. In this paper, we formulate v… ▽ More We introduce view birdification, the problem of recovering ground-plane movements of people in a crowd from an ego-centric video captured from an observer (e.g., a person or a vehicle) also moving in the crowd. Recovered ground-plane movements would provide a sound basis for situational understanding and benefit downstream applications in computer vision and robotics. In this paper, we formulate view birdification as a geometric trajectory reconstruction problem and derive a cascaded optimization method from a Bayesian perspective. The method first estimates the observer's movement and then localizes surrounding pedestrians for each frame while taking into account the local interactions between them. We introduce three datasets by leveraging synthetic and real trajectories of people in crowds and evaluate the effectiveness of our method. The results demonstrate the accuracy of our method and set the ground for further studies of view birdification as an important but challenging visual understanding problem. △ Less

Submitted 25 October, 2022; v1 submitted 9 November, 2021; originally announced November 2021.

Comments: Extended journal version of the original paper at BMVC 2021

arXiv:2009.11072 [pdf, other]

Differential Viewpoints for Ground Terrain Material Recognition

Authors: Jia Xue, Hang Zhang, Ko Nishino, Kristin J. Dana

Abstract: Computational surface modeling that underlies material recognition has transitioned from reflectance modeling using in-lab controlled radiometric measurements to image-based representations based on internet-mined single-view images captured in the scene. We take a middle-ground approach for material recognition that takes advantage of both rich radiometric cues and flexible image capture. A key c… ▽ More Computational surface modeling that underlies material recognition has transitioned from reflectance modeling using in-lab controlled radiometric measurements to image-based representations based on internet-mined single-view images captured in the scene. We take a middle-ground approach for material recognition that takes advantage of both rich radiometric cues and flexible image capture. A key concept is differential angular imaging, where small angular variations in image capture enables angular-gradient features for an enhanced appearance representation that improves recognition. We build a large-scale material database, Ground Terrain in Outdoor Scenes (GTOS) database, to support ground terrain recognition for applications such as autonomous driving and robot navigation. The database consists of over 30,000 images covering 40 classes of outdoor ground terrain under varying weather and lighting conditions. We develop a novel approach for material recognition called texture-encoded angular network (TEAN) that combines deep encoding pooling of RGB information and differential angular images for angular-gradient features to fully leverage this large dataset. With this novel network architecture, we extract characteristics of materials encoded in the angular and spatial gradients of their appearance. Our results show that TEAN achieves recognition performance that surpasses single view performance and standard (non-differential/large-angle sampling) multiview performance. △ Less

Submitted 21 September, 2020; originally announced September 2020.

Comments: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). arXiv admin note: substantial text overlap with arXiv:1612.02372

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020

arXiv:2008.07049 [pdf, other]

Video Region Annotation with Sparse Bounding Boxes

Authors: Yuzheng Xu, Yang Wu, Nur Sabrina binti Zuraimi, Shohei Nobuhara, Ko Nishino

Abstract: Video analysis has been moving towards more detailed interpretation (e.g. segmentation) with encouraging progresses. These tasks, however, increasingly rely on densely annotated training data both in space and time. Since such annotation is labour-intensive, few densely annotated video data with detailed region boundaries exist. This work aims to resolve this dilemma by learning to automatically g… ▽ More Video analysis has been moving towards more detailed interpretation (e.g. segmentation) with encouraging progresses. These tasks, however, increasingly rely on densely annotated training data both in space and time. Since such annotation is labour-intensive, few densely annotated video data with detailed region boundaries exist. This work aims to resolve this dilemma by learning to automatically generate region boundaries for all frames of a video from sparsely annotated bounding boxes of target regions. We achieve this with a Volumetric Graph Convolutional Network (VGCN), which learns to iteratively find keypoints on the region boundaries using the spatio-temporal volume of surrounding appearance and motion. The global optimization of VGCN makes it significantly stronger and generalize better than existing solutions. Experimental results using two latest datasets (one real and one synthetic), including ablation studies, demonstrate the effectiveness and superiority of our method. △ Less

Submitted 16 August, 2020; originally announced August 2020.

Comments: Accepted for publication in BMVC 2020 (Oral)

arXiv:2008.04030 [pdf, other]

Invertible Neural BRDF for Object Inverse Rendering

Authors: Zhe Chen, Shohei Nobuhara, Ko Nishino

Abstract: We introduce a novel neural network-based BRDF model and a Bayesian framework for object inverse rendering, i.e., joint estimation of reflectance and natural illumination from a single image of an object of known geometry. The BRDF is expressed with an invertible neural network, namely, normalizing flow, which provides the expressive power of a high-dimensional representation, computational simpli… ▽ More We introduce a novel neural network-based BRDF model and a Bayesian framework for object inverse rendering, i.e., joint estimation of reflectance and natural illumination from a single image of an object of known geometry. The BRDF is expressed with an invertible neural network, namely, normalizing flow, which provides the expressive power of a high-dimensional representation, computational simplicity of a compact analytical model, and physical plausibility of a real-world BRDF. We extract the latent space of real-world reflectance by conditioning this model, which directly results in a strong reflectance prior. We refer to this model as the invertible neural BRDF model (iBRDF). We also devise a deep illumination prior by leveraging the structural bias of deep neural networks. By integrating this novel BRDF model and reflectance and illumination priors in a MAP estimation formulation, we show that this joint estimation can be computed efficiently with stochastic gradient descent. We experimentally validate the accuracy of the invertible neural BRDF model on a large number of measured data and demonstrate its use in object inverse rendering on a number of synthetic and real images. The results show new ways in which deep neural networks can help solve challenging radiometric inverse problems. △ Less

Submitted 11 August, 2020; v1 submitted 10 August, 2020; originally announced August 2020.

Comments: accepted to ECCV 2020 as spotlight

arXiv:1912.04663 [pdf, other]

3D-GMNet: Single-View 3D Shape Recovery as A Gaussian Mixture

Authors: Kohei Yamashita, Shohei Nobuhara, Ko Nishino

Abstract: In this paper, we introduce 3D-GMNet, a deep neural network for 3D object shape reconstruction from a single image. As the name suggests, 3D-GMNet recovers 3D shape as a Gaussian mixture. In contrast to voxels, point clouds, or meshes, a Gaussian mixture representation provides an analytical expression with a small memory footprint while accurately representing the target 3D shape. At the same tim… ▽ More In this paper, we introduce 3D-GMNet, a deep neural network for 3D object shape reconstruction from a single image. As the name suggests, 3D-GMNet recovers 3D shape as a Gaussian mixture. In contrast to voxels, point clouds, or meshes, a Gaussian mixture representation provides an analytical expression with a small memory footprint while accurately representing the target 3D shape. At the same time, it offers a number of additional advantages including instant pose estimation and controllable level-of-detail reconstruction, while also enabling interpretation as a point cloud, volume, and a mesh model. We train 3D-GMNet end-to-end with single input images and corresponding 3D models by introducing two novel loss functions, a 3D Gaussian mixture loss and a 2D multi-view loss, which collectively enable accurate shape reconstruction as kernel density estimation. We thoroughly evaluate the effectiveness of 3D-GMNet with synthetic and real images of objects. The results show accurate reconstruction with a compact representation that also realizes novel applications of single-image 3D reconstruction. △ Less

Submitted 15 August, 2020; v1 submitted 10 December, 2019; originally announced December 2019.

Comments: BMVC 2020

arXiv:1906.10284 [pdf, other]

Appearance and Shape from Water Reflection

Authors: Ryo Kawahara, Meng-Yu Jennifer Kuo, Shohei Nobuhara, Ko Nishino

Abstract: This paper introduces single-image geometric and appearance reconstruction from water reflection photography, i.e., images capturing direct and water-reflected real-world scenes. Water reflection offers an additional viewpoint to the direct sight, collectively forming a stereo pair. The water-reflected scene, however, includes internally scattered and reflected environmental illumination in additi… ▽ More This paper introduces single-image geometric and appearance reconstruction from water reflection photography, i.e., images capturing direct and water-reflected real-world scenes. Water reflection offers an additional viewpoint to the direct sight, collectively forming a stereo pair. The water-reflected scene, however, includes internally scattered and reflected environmental illumination in addition to the scene radiance, which precludes direct stereo matching. We derive a principled iterative method that disentangles this scene radiometry and geometry for reconstructing 3D scene structure as well as its high-dynamic range appearance. In the presence of waves, we simultaneously recover the wave geometry as surface normal perturbations of the water surface. Most important, we show that the water reflection enables calibration of the camera. In other words, for the first time, we show that capturing a direct and water-reflected scene in a single exposure forms a self-calibrating HDR catadioptric stereo camera. We demonstrate our method on a number of images taken in the wild. The results demonstrate a new means for leveraging this accidental catadioptric camera. △ Less

Submitted 7 January, 2020; v1 submitted 24 June, 2019; originally announced June 2019.

Comments: WACV 2020

arXiv:1811.03331 [pdf, other]

Improving Multi-Person Pose Estimation using Label Correction

Authors: Naoki Kato, Tianqi Li, Kohei Nishino, Yusuke Uchida

Abstract: Significant attention is being paid to multi-person pose estimation methods recently, as there has been rapid progress in the field owing to convolutional neural networks. Especially, recent method which exploits part confidence maps and Part Affinity Fields (PAFs) has achieved accurate real-time prediction of multi-person keypoints. However, human annotated labels are sometimes inappropriate for… ▽ More Significant attention is being paid to multi-person pose estimation methods recently, as there has been rapid progress in the field owing to convolutional neural networks. Especially, recent method which exploits part confidence maps and Part Affinity Fields (PAFs) has achieved accurate real-time prediction of multi-person keypoints. However, human annotated labels are sometimes inappropriate for learning models. For example, if there is a limb that extends outside an image, a keypoint for the limb may not have annotations because it exists outside of the image, and thus the labels for the limb can not be generated. If a model is trained with data including such missing labels, the output of the model for the location, even though it is correct, is penalized as a false positive, which is likely to cause negative effects on the performance of the model. In this paper, we point out the existence of some patterns of inappropriate labels, and propose a novel method for correcting such labels with a teacher model trained on such incomplete data. Experiments on the COCO dataset show that training with the corrected labels improves the performance of the model and also speeds up training. △ Less

Submitted 8 November, 2018; originally announced November 2018.

arXiv:1801.03127 [pdf, other]

Recognizing Material Properties from Images

Authors: Gabriel Schwartz, Ko Nishino

Abstract: Humans rely on properties of the materials that make up objects to guide our interactions with them. Gras** smooth materials, for example, requires care, and softness is an ideal property for fabric used in bedding. Even when these properties are not visual (e.g. softness is a physical property), we may still infer their presence visually. We refer to such material properties as visual material… ▽ More Humans rely on properties of the materials that make up objects to guide our interactions with them. Gras** smooth materials, for example, requires care, and softness is an ideal property for fabric used in bedding. Even when these properties are not visual (e.g. softness is a physical property), we may still infer their presence visually. We refer to such material properties as visual material attributes. Recognizing these attributes in images can contribute valuable information for general scene understanding and material recognition. Unlike well-known object and scene attributes, visual material attributes are local properties with no fixed shape or spatial extent. We show that given a set of images annotated with known material attributes, we may accurately recognize the attributes from small local image patches. Obtaining such annotations in a consistent fashion at scale, however, is challenging. To address this, we introduce a method that allows us to probe the human visual perception of materials by asking simple yes/no questions comparing pairs of image patches. This provides sufficient weak supervision to build a set of attributes and associated classifiers that, while unnamed, serve the same function as the named attributes we use to describe materials. Doing so allows us to recognize visual material attributes without resorting to exhaustive manual annotation of a fixed set of named attributes. Furthermore, we show that this method may be integrated in the end-to-end learning of a material classification CNN to simultaneously recognize materials and discover their visual attributes. Our experimental results show that visual material attributes, whether named or automatically discovered, provide a useful intermediate representation for known material categories themselves as well as a basis for transfer learning when recognizing previously-unseen categories. △ Less

Submitted 9 January, 2018; originally announced January 2018.

arXiv:1612.02372 [pdf, other]

Differential Angular Imaging for Material Recognition

Authors: Jia Xue, Hang Zhang, Kristin Dana, Ko Nishino

Abstract: Material recognition for real-world outdoor surfaces has become increasingly important for computer vision to support its operation "in the wild." Computational surface modeling that underlies material recognition has transitioned from reflectance modeling using in-lab controlled radiometric measurements to image-based representations based on internet-mined images of materials captured in the sce… ▽ More Material recognition for real-world outdoor surfaces has become increasingly important for computer vision to support its operation "in the wild." Computational surface modeling that underlies material recognition has transitioned from reflectance modeling using in-lab controlled radiometric measurements to image-based representations based on internet-mined images of materials captured in the scene. We propose to take a middle-ground approach for material recognition that takes advantage of both rich radiometric cues and flexible image capture. We realize this by develo** a framework for differential angular imaging, where small angular variations in image capture provide an enhanced appearance representation and significant recognition improvement. We build a large-scale material database, Ground Terrain in Outdoor Scenes (GTOS) database, geared towards real use for autonomous agents. The database consists of over 30,000 images covering 40 classes of outdoor ground terrain under varying weather and lighting conditions. We develop a novel approach for material recognition called a Differential Angular Imaging Network (DAIN) to fully leverage this large dataset. With this novel network architecture, we extract characteristics of materials encoded in the angular and spatial gradients of their appearance. Our results show that DAIN achieves recognition performance that surpasses single view or coarsely quantized multiview images. These results demonstrate the effectiveness of differential angular imaging as a means for flexible, in-place material recognition. △ Less

Submitted 13 July, 2017; v1 submitted 7 December, 2016; originally announced December 2016.

arXiv:1611.09394 [pdf, other]

Material Recognition from Local Appearance in Global Context

Authors: Gabriel Schwartz, Ko Nishino

Abstract: Recognition of materials has proven to be a challenging problem due to the wide variation in appearance within and between categories. Global image context, such as where the material is or what object it makes up, can be crucial to recognizing the material. Existing methods, however, operate on an implicit fusion of materials and context by using large receptive fields as input (i.e., large image… ▽ More Recognition of materials has proven to be a challenging problem due to the wide variation in appearance within and between categories. Global image context, such as where the material is or what object it makes up, can be crucial to recognizing the material. Existing methods, however, operate on an implicit fusion of materials and context by using large receptive fields as input (i.e., large image patches). Many recent material recognition methods treat materials as yet another set of labels like objects. Materials are, however, fundamentally different from objects as they have no inherent shape or defined spatial extent. Approaches that ignore this can only take advantage of limited implicit context as it appears during training. We instead show that recognizing materials purely from their local appearance and integrating separately recognized global contextual cues including objects and places leads to superior dense, per-pixel, material recognition. We achieve this by training a fully-convolutional material recognition network end-to-end with only material category supervision. We integrate object and place estimates to this network from independent CNNs. This approach avoids the necessity of preparing an impractically-large amount of training data to cover the product space of materials, objects, and scenes, while fully leveraging contextual cues for dense material recognition. Furthermore, we perform a detailed analysis of the effects of context granularity, spatial resolution, and the network level at which we introduce context. On a recently introduced comprehensive and diverse material database \cite{Schwartz2016}, we confirm that our method achieves state-of-the-art accuracy with significantly less training data compared to past methods. △ Less

Submitted 12 April, 2017; v1 submitted 28 November, 2016; originally announced November 2016.

arXiv:1604.01354 [pdf, other]

Radiometric Scene Decomposition: Scene Reflectance, Illumination, and Geometry from RGB-D Images

Authors: Stephen Lombardi, Ko Nishino

Abstract: Recovering the radiometric properties of a scene (i.e., the reflectance, illumination, and geometry) is a long-sought ability of computer vision that can provide invaluable information for a wide range of applications. Deciphering the radiometric ingredients from the appearance of a real-world scene, as opposed to a single isolated object, is particularly challenging as it generally consists of va… ▽ More Recovering the radiometric properties of a scene (i.e., the reflectance, illumination, and geometry) is a long-sought ability of computer vision that can provide invaluable information for a wide range of applications. Deciphering the radiometric ingredients from the appearance of a real-world scene, as opposed to a single isolated object, is particularly challenging as it generally consists of various objects with different material compositions exhibiting complex reflectance and light interactions that are also part of the illumination. We introduce the first method for radiometric scene decomposition that handles those intricacies. We use RGB-D images to bootstrap geometry recovery and simultaneously recover the complex reflectance and natural illumination while refining the noisy initial geometry and segmenting the scene into different material regions. Most important, we handle real-world scenes consisting of multiple objects of unknown materials, which necessitates the modeling of spatially-varying complex reflectance, natural illumination, texture, interreflection and shadows. We systematically evaluate the effectiveness of our method on synthetic scenes and demonstrate its application to real-world scenes. The results show that rich radiometric information can be recovered from RGB-D images and demonstrate a new role RGB-D sensors can play for general scene understanding tasks. △ Less

Submitted 5 April, 2016; originally announced April 2016.

Comments: 16 pages

arXiv:1604.01345 [pdf, other]

Integrating Local Material Recognition with Large-Scale Perceptual Attribute Discovery

Authors: Gabriel Schwartz, Ko Nishino

Abstract: Material attributes have been shown to provide a discriminative intermediate representation for recognizing materials, especially for the challenging task of recognition from local material appearance (i.e., regardless of object and scene context). In the past, however, material attributes have been recognized separately preceding category recognition. In contrast, neuroscience studies on material… ▽ More Material attributes have been shown to provide a discriminative intermediate representation for recognizing materials, especially for the challenging task of recognition from local material appearance (i.e., regardless of object and scene context). In the past, however, material attributes have been recognized separately preceding category recognition. In contrast, neuroscience studies on material perception and computer vision research on object and place recognition have shown that attributes are produced as a by-product during the category recognition process. Does the same hold true for material attribute and category recognition? In this paper, we introduce a novel material category recognition network architecture to show that perceptual attributes can, in fact, be automatically discovered inside a local material recognition framework. The novel material-attribute-category convolutional neural network (MAC-CNN) produces perceptual material attributes from the intermediate pooling layers of an end-to-end trained category recognition network using an auxiliary loss function that encodes human material perception. To train this model, we introduce a novel large-scale database of local material appearance organized under a canonical material category taxonomy and careful image patch extraction that avoids unwanted object and scene context. We show that the discovered attributes correspond well with semantically-meaningful visual material traits via Boolean algebra, and enable recognition of previously unseen material categories given only a few examples. These results have strong implications in how perceptually meaningful attributes can be learned in other recognition tasks. △ Less

Submitted 12 April, 2017; v1 submitted 5 April, 2016; originally announced April 2016.

arXiv:1603.07998 [pdf, other]

Friction from Reflectance: Deep Reflectance Codes for Predicting Physical Surface Properties from One-Shot In-Field Reflectance

Authors: Hang Zhang, Kristin Dana, Ko Nishino

Abstract: Images are the standard input for vision algorithms, but one-shot infield reflectance measurements are creating new opportunities for recognition and scene understanding. In this work, we address the question of what reflectance can reveal about materials in an efficient manner. We go beyond the question of recognition and labeling and ask the question: What intrinsic physical properties of the su… ▽ More Images are the standard input for vision algorithms, but one-shot infield reflectance measurements are creating new opportunities for recognition and scene understanding. In this work, we address the question of what reflectance can reveal about materials in an efficient manner. We go beyond the question of recognition and labeling and ask the question: What intrinsic physical properties of the surface can be estimated using reflectance? We introduce a framework that enables prediction of actual friction values for surfaces using one-shot reflectance measurements. This work is a first of its kind vision-based friction estimation. We develop a novel representation for reflectance disks that capture partial BRDF measurements instantaneously. Our method of deep reflectance codes combines CNN features and fisher vector pooling with optimal binary embedding to create codes that have sufficient discriminatory power and have important properties of illumination and spatial invariance. The experimental results demonstrate that reflectance can play a new role in deciphering the underlying physical properties of real-world scenes. △ Less

Submitted 10 July, 2016; v1 submitted 25 March, 2016; originally announced March 2016.

arXiv:1502.02092 [pdf, other]

Reflectance Hashing for Material Recognition

Authors: Hang Zhang, Kristin Dana, Ko Nishino

Abstract: We introduce a novel method for using reflectance to identify materials. Reflectance offers a unique signature of the material but is challenging to measure and use for recognizing materials due to its high-dimensionality. In this work, one-shot reflectance is captured using a unique optical camera measuring {\it reflectance disks} where the pixel coordinates correspond to surface viewing angles.… ▽ More We introduce a novel method for using reflectance to identify materials. Reflectance offers a unique signature of the material but is challenging to measure and use for recognizing materials due to its high-dimensionality. In this work, one-shot reflectance is captured using a unique optical camera measuring {\it reflectance disks} where the pixel coordinates correspond to surface viewing angles. The reflectance has class-specific stucture and angular gradients computed in this reflectance space reveal the material class. These reflectance disks encode discriminative information for efficient and accurate material recognition. We introduce a framework called reflectance hashing that models the reflectance disks with dictionary learning and binary hashing. We demonstrate the effectiveness of reflectance hashing for material recognition with a number of real-world materials. △ Less

Submitted 6 February, 2015; originally announced February 2015.

Showing 1–28 of 28 results for author: Nishino, K