Search | arXiv e-print repository

GenWarp: Single Image to Novel Views with Semantic-Preserving Generative War**

Authors: Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Seungryong Kim, Yuki Mitsufuji

Abstract: Generating novel views from a single image remains a challenging task due to the complexity of 3D scenes and the limited diversity in the existing multi-view datasets to train a model on. Recent research combining large-scale text-to-image (T2I) models with monocular depth estimation (MDE) has shown promise in handling in-the-wild images. In these methods, an input view is geometrically warped to… ▽ More Generating novel views from a single image remains a challenging task due to the complexity of 3D scenes and the limited diversity in the existing multi-view datasets to train a model on. Recent research combining large-scale text-to-image (T2I) models with monocular depth estimation (MDE) has shown promise in handling in-the-wild images. In these methods, an input view is geometrically warped to novel views with estimated depth maps, then the warped image is inpainted by T2I models. However, they struggle with noisy depth maps and loss of semantic details when war** an input view to novel viewpoints. In this paper, we propose a novel approach for single-shot novel view synthesis, a semantic-preserving generative war** framework that enables T2I generative models to learn where to warp and where to generate, through augmenting cross-view attention with self-attention. Our approach addresses the limitations of existing methods by conditioning the generative model on source view images and incorporating geometric war** signals. Qualitative and quantitative evaluations demonstrate that our model outperforms existing methods in both in-domain and out-of-domain scenarios. Project page is available at https://GenWarp-NVS.github.io/. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: Project page: https://GenWarp-NVS.github.io

arXiv:2303.15780 [pdf, other]

Instruct 3D-to-3D: Text Instruction Guided 3D-to-3D conversion

Authors: Hiromichi Kamata, Yuiko Sakuma, Akio Hayakawa, Masato Ishii, Takuya Narihira

Abstract: We propose a high-quality 3D-to-3D conversion method, Instruct 3D-to-3D. Our method is designed for a novel task, which is to convert a given 3D scene to another scene according to text instructions. Instruct 3D-to-3D applies pretrained Image-to-Image diffusion models for 3D-to-3D conversion. This enables the likelihood maximization of each viewpoint image and high-quality 3D generation. In additi… ▽ More We propose a high-quality 3D-to-3D conversion method, Instruct 3D-to-3D. Our method is designed for a novel task, which is to convert a given 3D scene to another scene according to text instructions. Instruct 3D-to-3D applies pretrained Image-to-Image diffusion models for 3D-to-3D conversion. This enables the likelihood maximization of each viewpoint image and high-quality 3D generation. In addition, our proposed method explicitly inputs the source 3D scene as a condition, which enhances 3D consistency and controllability of how much of the source 3D scene structure is reflected. We also propose dynamic scaling, which allows the intensity of the geometry transformation to be adjusted. We performed quantitative and qualitative evaluations and showed that our proposed method achieves higher quality 3D-to-3D conversions than baseline methods. △ Less

Submitted 28 March, 2023; originally announced March 2023.

Comments: Project page: https://sony.github.io/Instruct3Dto3D-doc/

arXiv:2303.13121 [pdf, other]

DetOFA: Efficient Training of Once-for-All Networks for Object Detection Using Path Filter

Authors: Yuiko Sakuma, Masato Ishii, Takuya Narihira

Abstract: We address the challenge of training a large supernet for the object detection task, using a relatively small amount of training data. Specifically, we propose an efficient supernet-based neural architecture search (NAS) method that uses search space pruning. The search space defined by the supernet is pruned by removing candidate models that are predicted to perform poorly. To effectively remove… ▽ More We address the challenge of training a large supernet for the object detection task, using a relatively small amount of training data. Specifically, we propose an efficient supernet-based neural architecture search (NAS) method that uses search space pruning. The search space defined by the supernet is pruned by removing candidate models that are predicted to perform poorly. To effectively remove the candidates over a wide range of resource constraints, we particularly design a performance predictor for supernet, called path filter, which is conditioned by resource constraints and can accurately predict the relative performance of the models that satisfy similar resource constraints. Hence, supernet training is more focused on the best-performing candidates. Our path filter handles prediction for paths with different resource budgets. Compared to once-for-all, our proposed method reduces the computational cost of the optimal network architecture by 30% and 63%, while yielding better accuracy-floating point operations Pareto front (0.85 and 0.45 points of improvement on average precision for Pascal VOC and COCO, respectively). △ Less

Submitted 19 October, 2023; v1 submitted 23 March, 2023; originally announced March 2023.

Comments: Accepted to ICCV workshop 2023

arXiv:2302.00675 [pdf, other]

NDJIR: Neural Direct and Joint Inverse Rendering for Geometry, Lights, and Materials of Real Object

Authors: Kazuki Yoshiyama, Takuya Narihira

Abstract: The goal of inverse rendering is to decompose geometry, lights, and materials given pose multi-view images. To achieve this goal, we propose neural direct and joint inverse rendering, NDJIR. Different from prior works which relies on some approximations of the rendering equation, NDJIR directly addresses the integrals in the rendering equation and jointly decomposes geometry: signed distance funct… ▽ More The goal of inverse rendering is to decompose geometry, lights, and materials given pose multi-view images. To achieve this goal, we propose neural direct and joint inverse rendering, NDJIR. Different from prior works which relies on some approximations of the rendering equation, NDJIR directly addresses the integrals in the rendering equation and jointly decomposes geometry: signed distance function, lights: environment and implicit lights, materials: base color, roughness, specular reflectance using the powerful and flexible volume rendering framework, voxel grid feature, and Bayesian prior. Our method directly uses the physically-based rendering, so we can seamlessly export an extracted mesh with materials to DCC tools and show material conversion examples. We perform intensive experiments to show that our proposed method can decompose semantically well for real object in photogrammetric setting and what factors contribute towards accurate inverse rendering. △ Less

Submitted 2 February, 2023; originally announced February 2023.

Comments: 26 pages

arXiv:2212.02024 [pdf, other]

Fine-grained Image Editing by Pixel-wise Guidance Using Diffusion Models

Authors: Naoki Matsunaga, Masato Ishii, Akio Hayakawa, Kenji Suzuki, Takuya Narihira

Abstract: Our goal is to develop fine-grained real-image editing methods suitable for real-world applications. In this paper, we first summarize four requirements for these methods and propose a novel diffusion-based image editing framework with pixel-wise guidance that satisfies these requirements. Specifically, we train pixel-classifiers with a few annotated data and then infer the segmentation map of a t… ▽ More Our goal is to develop fine-grained real-image editing methods suitable for real-world applications. In this paper, we first summarize four requirements for these methods and propose a novel diffusion-based image editing framework with pixel-wise guidance that satisfies these requirements. Specifically, we train pixel-classifiers with a few annotated data and then infer the segmentation map of a target image. Users then manipulate the map to instruct how the image will be edited. We utilize a pre-trained diffusion model to generate edited images aligned with the user's intention with pixel-wise guidance. The effective combination of proposed guidance and other techniques enables highly controllable editing with preserving the outside of the edited area, which results in meeting our requirements. The experimental results demonstrate that our proposal outperforms the GAN-based method for editing quality and speed. △ Less

Submitted 31 May, 2023; v1 submitted 4 December, 2022; originally announced December 2022.

Comments: Accepted by AI for Content Creation (AI4CC) workshop at CVPR 2023

arXiv:2202.10758 [pdf, other]

Thinking the Fusion Strategy of Multi-reference Face Reenactment

Authors: Takuya Yashima, Takuya Narihira, Tamaki Kojima

Abstract: In recent advances of deep generative models, face reenactment -manipulating and controlling human face, including their head movement-has drawn much attention for its wide range of applicability. Despite its strong expressiveness, it is inevitable that the models fail to reconstruct or accurately generate unseen side of the face of a given single reference image. Most of existing methods alleviat… ▽ More In recent advances of deep generative models, face reenactment -manipulating and controlling human face, including their head movement-has drawn much attention for its wide range of applicability. Despite its strong expressiveness, it is inevitable that the models fail to reconstruct or accurately generate unseen side of the face of a given single reference image. Most of existing methods alleviate this problem by learning appearances of human faces from large amount of data and generate realistic texture at inference time. Rather than completely relying on what generative models learn, we show that simple extension by using multiple reference images significantly improves generation quality. We show this by 1) conducting the reconstruction task on publicly available dataset, 2) conducting facial motion transfer on our original dataset which consists of multi-person's head movement video sequences, and 3) using a newly proposed evaluation metric to validate that our method achieves better quantitative results. △ Less

Submitted 22 February, 2022; originally announced February 2022.

Comments: Submitted to ICIP2022, 5 pages, 3 figures, 3 tables

arXiv:2103.11807 [pdf, other]

Data Cleansing for Deep Neural Networks with Storage-efficient Approximation of Influence Functions

Authors: Kenji Suzuki, Yoshiyuki Kobayashi, Takuya Narihira

Abstract: Identifying the influence of training data for data cleansing can improve the accuracy of deep learning. An approach with stochastic gradient descent (SGD) called SGD-influence to calculate the influence scores was proposed, but, the calculation costs are expensive. It is necessary to temporally store the parameters of the model during training phase for inference phase to calculate influence sore… ▽ More Identifying the influence of training data for data cleansing can improve the accuracy of deep learning. An approach with stochastic gradient descent (SGD) called SGD-influence to calculate the influence scores was proposed, but, the calculation costs are expensive. It is necessary to temporally store the parameters of the model during training phase for inference phase to calculate influence sores. In close connection with the previous method, we propose a method to reduce cache files to store the parameters in training phase for calculating inference score. We only adopt the final parameters in last epoch for influence functions calculation. In our experiments on classification, the cache size of training using MNIST dataset with our approach is 1.236 MB. On the other hand, the previous method used cache size of 1.932 GB in last epoch. It means that cache size has been reduced to 1/1,563. We also observed the accuracy improvement by data cleansing with removal of negatively influential data using our approach as well as the previous method. Moreover, our simple and general proposed method to calculate influence scores is available on our auto ML tool without programing, Neural Network Console. The source code is also available. △ Less

Submitted 1 June, 2021; v1 submitted 22 March, 2021; originally announced March 2021.

arXiv:2103.04037 [pdf, other]

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Authors: Andrew Shin, Masato Ishii, Takuya Narihira

Abstract: Transformer architectures have brought about fundamental changes to computational linguistic field, which had been dominated by recurrent neural networks for many years. Its success also implies drastic changes in cross-modal tasks with language and vision, and many researchers have already tackled the issue. In this paper, we review some of the most critical milestones in the field, as well as ov… ▽ More Transformer architectures have brought about fundamental changes to computational linguistic field, which had been dominated by recurrent neural networks for many years. Its success also implies drastic changes in cross-modal tasks with language and vision, and many researchers have already tackled the issue. In this paper, we review some of the most critical milestones in the field, as well as overall trends on how transformer architecture has been incorporated into visuolinguistic cross-modal tasks. Furthermore, we discuss its current limitations and speculate upon some of the prospects that we find imminent. △ Less

Submitted 9 November, 2021; v1 submitted 6 March, 2021; originally announced March 2021.

Comments: Accepted for publication by International Journal of Computer Vision (IJCV)

arXiv:2102.06725 [pdf, other]

Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives

Authors: Takuya Narihira, Javier Alonsogarcia, Fabien Cardinaux, Akio Hayakawa, Masato Ishii, Kazunori Iwaki, Thomas Kemp, Yoshiyuki Kobayashi, Lukas Mauch, Akira Nakamura, Yukio Obuchi, Andrew Shin, Kenji Suzuki, Stephen Tiedmann, Stefan Uhlich, Takuya Yashima, Kazuki Yoshiyama

Abstract: While there exist a plethora of deep learning tools and frameworks, the fast-growing complexity of the field brings new demands and challenges, such as more flexible network design, speedy computation on distributed setting, and compatibility between different tools. In this paper, we introduce Neural Network Libraries (https://nnabla.org), a deep learning framework designed from engineer's perspe… ▽ More While there exist a plethora of deep learning tools and frameworks, the fast-growing complexity of the field brings new demands and challenges, such as more flexible network design, speedy computation on distributed setting, and compatibility between different tools. In this paper, we introduce Neural Network Libraries (https://nnabla.org), a deep learning framework designed from engineer's perspective, with emphasis on usability and compatibility as its core design principles. We elaborate on each of our design principles and its merits, and validate our attempts via experiments. △ Less

Submitted 21 June, 2021; v1 submitted 12 February, 2021; originally announced February 2021.

Comments: https://nnabla.org

arXiv:2011.12528 [pdf, other]

Reference-Based Video Colorization with Spatiotemporal Correspondence

Authors: Naofumi Akimoto, Akio Hayakawa, Andrew Shin, Takuya Narihira

Abstract: We propose a novel reference-based video colorization framework with spatiotemporal correspondence. Reference-based methods colorize grayscale frames referencing a user input color frame. Existing methods suffer from the color leakage between objects and the emergence of average colors, derived from non-local semantic correspondence in space. To address this issue, we warp colors only from the reg… ▽ More We propose a novel reference-based video colorization framework with spatiotemporal correspondence. Reference-based methods colorize grayscale frames referencing a user input color frame. Existing methods suffer from the color leakage between objects and the emergence of average colors, derived from non-local semantic correspondence in space. To address this issue, we warp colors only from the regions on the reference frame restricted by correspondence in time. We propagate masks as temporal correspondences, using two complementary tracking approaches: off-the-shelf instance tracking for high performance segmentation, and newly proposed dense tracking to track various types of objects. By restricting temporally-related regions for referencing colors, our approach propagates faithful colors throughout the video. Experiments demonstrate that our method outperforms state-of-the-art methods quantitatively and qualitatively. △ Less

Submitted 25 November, 2020; originally announced November 2020.

arXiv:2010.14109 [pdf, other]

Out-of-core Training for Extremely Large-Scale Neural Networks With Adaptive Window-Based Scheduling

Authors: Akio Hayakawa, Takuya Narihira

Abstract: While large neural networks demonstrate higher performance in various tasks, training large networks is difficult due to limitations on GPU memory size. We propose a novel out-of-core algorithm that enables faster training of extremely large-scale neural networks with sizes larger than allotted GPU memory. Under a given memory budget constraint, our scheduling algorithm locally adapts the timing o… ▽ More While large neural networks demonstrate higher performance in various tasks, training large networks is difficult due to limitations on GPU memory size. We propose a novel out-of-core algorithm that enables faster training of extremely large-scale neural networks with sizes larger than allotted GPU memory. Under a given memory budget constraint, our scheduling algorithm locally adapts the timing of memory transfers according to memory usage of each function, which improves overlap between computation and memory transfers. Additionally, we apply virtual addressing technique, commonly performed in OS, to training of neural networks with out-of-core execution, which drastically reduces the amount of memory fragmentation caused by frequent memory transfers. With our proposed algorithm, we successfully train ResNet-50 with 1440 batch-size with kee** training speed at 55%, which is 7.5x larger than the upper bound of physical memory. It also outperforms a previous state-of-the-art substantially, i.e. it trains a 1.55x larger network than state-of-the-art with faster execution. Moreover, we experimentally show that our approach is also scalable for various types of networks. △ Less

Submitted 27 October, 2020; originally announced October 2020.

arXiv:1908.03343 [pdf, ps, other]

Fully Convolutional Search Heuristic Learning for Rapid Path Planners

Authors: Yuka Ariki, Takuya Narihira

Abstract: Path-planning algorithms are an important part of a wide variety of robotic applications, such as mobile robot navigation and robot arm manipulation. However, in large search spaces in which local traps may exist, it remains challenging to reliably find a path while satisfying real-time constraints. Efforts to speed up the path search have led to the development of many practical path-planning alg… ▽ More Path-planning algorithms are an important part of a wide variety of robotic applications, such as mobile robot navigation and robot arm manipulation. However, in large search spaces in which local traps may exist, it remains challenging to reliably find a path while satisfying real-time constraints. Efforts to speed up the path search have led to the development of many practical path-planning algorithms. These algorithms often define a search heuristic to guide the search towards the goal. The heuristics should be carefully designed for each specific problem to ensure reliability in the various situations encountered in the problem. However, it is often difficult for humans to craft such robust heuristics, and the search performance often degrades under conditions that violate the heuristic assumption. Rather than manually designing the heuristics, in this work, we propose a learning approach to acquire these search heuristics. Our method represents the environment containing the obstacles as an image, and this image is fed into fully convolutional neural networks to produce a search heuristic image where every pixel represents a heuristic value (cost-to-go value to a goal) in the form of a vertex of a search graph. Training the heuristic is performed using previously collected planning results. Our preliminary experiments (2D grid world navigation experiments) demonstrate significant reduction in the search costs relative to a hand-designed heuristic. △ Less

Submitted 9 August, 2019; originally announced August 2019.

Comments: 11 pages, 4 figures

arXiv:1512.02767 [pdf, other]

Affinity CNN: Learning Pixel-Centric Pairwise Relations for Figure/Ground Embedding

Authors: Michael Maire, Takuya Narihira, Stella X. Yu

Abstract: Spectral embedding provides a framework for solving perceptual organization problems, including image segmentation and figure/ground organization. From an affinity matrix describing pairwise relationships between pixels, it clusters pixels into regions, and, using a complex-valued extension, orders pixels according to layer. We train a convolutional neural network (CNN) to directly predict the pai… ▽ More Spectral embedding provides a framework for solving perceptual organization problems, including image segmentation and figure/ground organization. From an affinity matrix describing pairwise relationships between pixels, it clusters pixels into regions, and, using a complex-valued extension, orders pixels according to layer. We train a convolutional neural network (CNN) to directly predict the pairwise relationships that define this affinity matrix. Spectral embedding then resolves these predictions into a globally-consistent segmentation and figure/ground organization of the scene. Experiments demonstrate significant benefit to this direct coupling compared to prior works which use explicit intermediate stages, such as edge detection, on the pathway from image to affinities. Our results suggest spectral embedding as a powerful alternative to the conditional random field (CRF)-based globalization schemes typically coupled to deep neural networks. △ Less

Submitted 11 April, 2016; v1 submitted 9 December, 2015; originally announced December 2015.

Comments: minor updates; extended version of CVPR 2016 conference paper

arXiv:1512.02311 [pdf, other]

Direct Intrinsics: Learning Albedo-Shading Decomposition by Convolutional Regression

Authors: Takuya Narihira, Michael Maire, Stella X. Yu

Abstract: We introduce a new approach to intrinsic image decomposition, the task of decomposing a single image into albedo and shading components. Our strategy, which we term direct intrinsics, is to learn a convolutional neural network (CNN) that directly predicts output albedo and shading channels from an input RGB image patch. Direct intrinsics is a departure from classical techniques for intrinsic image… ▽ More We introduce a new approach to intrinsic image decomposition, the task of decomposing a single image into albedo and shading components. Our strategy, which we term direct intrinsics, is to learn a convolutional neural network (CNN) that directly predicts output albedo and shading channels from an input RGB image patch. Direct intrinsics is a departure from classical techniques for intrinsic image decomposition, which typically rely on physically-motivated priors and graph-based inference algorithms. The large-scale synthetic ground-truth of the MPI Sintel dataset plays a key role in training direct intrinsics. We demonstrate results on both the synthetic images of Sintel and the real images of the classic MIT intrinsic image dataset. On Sintel, direct intrinsics, using only RGB input, outperforms all prior work, including methods that rely on RGB+Depth input. Direct intrinsics also generalizes across modalities; it produces quite reasonable decompositions on the real images of the MIT dataset. Our results indicate that the marriage of CNNs with synthetic training data may be a powerful new technique for tackling classic problems in computer vision. △ Less

Submitted 7 December, 2015; originally announced December 2015.

Comments: International Conference on Computer Vision (ICCV), 2015

arXiv:1511.06838 [pdf, other]

Map** Images to Sentiment Adjective Noun Pairs with Factorized Neural Nets

Authors: Takuya Narihira, Damian Borth, Stella X. Yu, Karl Ni, Trevor Darrell

Abstract: We consider the visual sentiment task of map** an image to an adjective noun pair (ANP) such as "cute baby". To capture the two-factor structure of our ANP semantics as well as to overcome annotation noise and ambiguity, we propose a novel factorized CNN model which learns separate representations for adjectives and nouns but optimizes the classification performance over their product. Our exper… ▽ More We consider the visual sentiment task of map** an image to an adjective noun pair (ANP) such as "cute baby". To capture the two-factor structure of our ANP semantics as well as to overcome annotation noise and ambiguity, we propose a novel factorized CNN model which learns separate representations for adjectives and nouns but optimizes the classification performance over their product. Our experiments on the publicly available SentiBank dataset show that our model significantly outperforms not only independent ANP classifiers on unseen ANPs and on retrieving images of novel ANPs, but also image captioning models which capture word semantics from co-occurrence of natural text; the latter turn out to be surprisingly poor at capturing the sentiment evoked by pure visual experience. That is, our factorized ANP CNN not only trains better from noisy labels, generalizes better to new images, but can also expands the ANP vocabulary on its own. △ Less

Submitted 20 November, 2015; originally announced November 2015.

Showing 1–15 of 15 results for author: Narihira, T