Search | arXiv e-print repository

Memory-Maze: Scenario Driven Benchmark and Visual Language Navigation Model for Guiding Blind People

Authors: Masaki Kuribayashi, Kohei Uehara, Allan Wang, Daisuke Sato, Simon Chu, Shigeo Morishima

Abstract: Visual Language Navigation (VLN) powered navigation robots have the potential to guide blind people by understanding and executing route instructions provided by sighted passersby. This capability allows robots to operate in environments that are often unknown a priori. Existing VLN models are insufficient for the scenario of navigation guidance for blind people, as they need to understand routes… ▽ More Visual Language Navigation (VLN) powered navigation robots have the potential to guide blind people by understanding and executing route instructions provided by sighted passersby. This capability allows robots to operate in environments that are often unknown a priori. Existing VLN models are insufficient for the scenario of navigation guidance for blind people, as they need to understand routes described from human memory, which frequently contain stutters, errors, and omission of details as opposed to those obtained by thinking out loud, such as in the Room-to-Room dataset. However, currently, there is no benchmark that simulates instructions that were obtained from human memory in environments where blind people navigate. To this end, we present our benchmark, Memory-Maze, which simulates the scenario of seeking route instructions for guiding blind people. Our benchmark contains a maze-like structured virtual environment and novel route instruction data from human memory. To collect natural language instructions, we conducted two studies from sighted passersby onsite and annotators online. Our analysis demonstrates that instructions data collected onsite were more lengthy and contained more varied wording. Alongside our benchmark, we propose a VLN model better equipped to handle the scenario. Our proposed VLN model uses Large Language Models (LLM) to parse instructions and generate Python codes for robot control. We further show that the existing state-of-the-art model performed suboptimally on our benchmark. In contrast, our proposed method outperformed the state-of-the-art model by a fair margin. We found that future research should exercise caution when considering VLN technology for practical applications, as real-world scenarios have different characteristics than ones collected in traditional settings. △ Less

Submitted 11 May, 2024; originally announced May 2024.

arXiv:2310.00355 [pdf, other]

doi 10.1145/3610661.3616177

Gaze-Driven Sentence Simplification for Language Learners: Enhancing Comprehension and Readability

Authors: Taichi Higasa, Keitaro Tanaka, Qi Feng, Shigeo Morishima

Abstract: Language learners should regularly engage in reading challenging materials as part of their study routine. Nevertheless, constantly referring to dictionaries is time-consuming and distracting. This paper presents a novel gaze-driven sentence simplification system designed to enhance reading comprehension while maintaining their focus on the content. Our system incorporates machine learning models… ▽ More Language learners should regularly engage in reading challenging materials as part of their study routine. Nevertheless, constantly referring to dictionaries is time-consuming and distracting. This paper presents a novel gaze-driven sentence simplification system designed to enhance reading comprehension while maintaining their focus on the content. Our system incorporates machine learning models tailored to individual learners, combining eye gaze features and linguistic features to assess sentence comprehension. When the system identifies comprehension difficulties, it provides simplified versions by replacing complex vocabulary and grammar with simpler alternatives via GPT-3.5. We conducted an experiment with 19 English learners, collecting data on their eye movements while reading English text. The results demonstrated that our system is capable of accurately estimating sentence-level comprehension. Additionally, we found that GPT-3.5 simplification improved readability in terms of traditional readability metrics and individual word difficulty, paraphrasing across different linguistic levels. △ Less

Submitted 30 September, 2023; originally announced October 2023.

Comments: Accepted by ACM ICMI 2023 workshops (Multimodal, Interactive Interfaces for Education)

arXiv:2309.10375 [pdf, other]

Pointing out Human Answer Mistakes in a Goal-Oriented Visual Dialogue

Authors: Ryosuke Oshima, Seitaro Shinagawa, Hideki Tsunashima, Qi Feng, Shigeo Morishima

Abstract: Effective communication between humans and intelligent agents has promising applications for solving complex problems. One such approach is visual dialogue, which leverages multimodal context to assist humans. However, real-world scenarios occasionally involve human mistakes, which can cause intelligent agents to fail. While most prior research assumes perfect answers from human interlocutors, we… ▽ More Effective communication between humans and intelligent agents has promising applications for solving complex problems. One such approach is visual dialogue, which leverages multimodal context to assist humans. However, real-world scenarios occasionally involve human mistakes, which can cause intelligent agents to fail. While most prior research assumes perfect answers from human interlocutors, we focus on a setting where the agent points out unintentional mistakes for the interlocutor to review, better reflecting real-world situations. In this paper, we show that human answer mistakes depend on question type and QA turn in the visual dialogue by analyzing a previously unused data collection of human mistakes. We demonstrate the effectiveness of those factors for the model's accuracy in a pointing-human-mistake task through experiments using a simple MLP model and a Visual Language Model. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: Accepted at ICCVW 2023

arXiv:2308.13042 [pdf, other]

Enhancing Perception and Immersion in Pre-Captured Environments through Learning-Based Eye Height Adaptation

Authors: Qi Feng, Hubert P. H. Shum, Shigeo Morishima

Abstract: Pre-captured immersive environments using omnidirectional cameras provide a wide range of virtual reality applications. Previous research has shown that manipulating the eye height in egocentric virtual environments can significantly affect distance perception and immersion. However, the influence of eye height in pre-captured real environments has received less attention due to the difficulty of… ▽ More Pre-captured immersive environments using omnidirectional cameras provide a wide range of virtual reality applications. Previous research has shown that manipulating the eye height in egocentric virtual environments can significantly affect distance perception and immersion. However, the influence of eye height in pre-captured real environments has received less attention due to the difficulty of altering the perspective after finishing the capture process. To explore this influence, we first propose a pilot study that captures real environments with multiple eye heights and asks participants to judge the egocentric distances and immersion. If a significant influence is confirmed, an effective image-based approach to adapt pre-captured real-world environments to the user's eye height would be desirable. Motivated by the study, we propose a learning-based approach for synthesizing novel views for omnidirectional images with altered eye heights. This approach employs a multitask architecture that learns depth and semantic segmentation in two formats, and generates high-quality depth and semantic segmentation to facilitate the inpainting stage. With the improved omnidirectional-aware layered depth image, our approach synthesizes natural and realistic visuals for eye height adaptation. Quantitative and qualitative evaluation shows favorable results against state-of-the-art methods, and an extensive user study verifies improved perception and immersion for pre-captured real-world environments. △ Less

Submitted 24 August, 2023; originally announced August 2023.

Comments: 10 pages, 13 figures, 3 tables, submitted to ISMAR 2023

arXiv:2306.06495 [pdf, other]

Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction

Authors: Tomoya Yoshinaga, Keitaro Tanaka, Shigeo Morishima

Abstract: This paper describes an audio-visual speech enhancement (AV-SE) method that estimates from noisy input audio a mixture of the speech of the speaker appearing in an input video (on-screen target speech) and of a selected speaker not appearing in the video (off-screen target speech). Although conventional AV-SE methods have suppressed all off-screen sounds, it is necessary to listen to a specific pr… ▽ More This paper describes an audio-visual speech enhancement (AV-SE) method that estimates from noisy input audio a mixture of the speech of the speaker appearing in an input video (on-screen target speech) and of a selected speaker not appearing in the video (off-screen target speech). Although conventional AV-SE methods have suppressed all off-screen sounds, it is necessary to listen to a specific pre-known speaker's speech (e.g., family member's voice and announcements in stations) in future applications of AV-SE (e.g., hearing aids), even when users' sight does not capture the speaker. To overcome this limitation, we extract a visual clue for the on-screen target speech from the input video and a voiceprint clue for the off-screen one from a pre-recorded speech of the speaker. Two clues from different domains are integrated as an audio-visual clue, and the proposed model directly estimates the target mixture. To improve the estimation accuracy, we introduce a temporal attention mechanism for the voiceprint clue and propose a training strategy called the muting strategy. Experimental results show that our method outperforms a baseline method that uses the state-of-the-art AV-SE and speaker extraction methods individually in terms of estimation accuracy and computational efficiency. △ Less

Submitted 10 June, 2023; originally announced June 2023.

Comments: Accepted by EUSIPCO 2023

arXiv:2305.14203 [pdf, other]

doi 10.21437/Interspeech.2023-370

Improving the Gap in Visual Speech Recognition Between Normal and Silent Speech Based on Metric Learning

Authors: Sara Kashiwagi, Keitaro Tanaka, Qi Feng, Shigeo Morishima

Abstract: This paper presents a novel metric learning approach to address the performance gap between normal and silent speech in visual speech recognition (VSR). The difference in lip movements between the two poses a challenge for existing VSR models, which exhibit degraded accuracy when applied to silent speech. To solve this issue and tackle the scarcity of training data for silent speech, we propose to… ▽ More This paper presents a novel metric learning approach to address the performance gap between normal and silent speech in visual speech recognition (VSR). The difference in lip movements between the two poses a challenge for existing VSR models, which exhibit degraded accuracy when applied to silent speech. To solve this issue and tackle the scarcity of training data for silent speech, we propose to leverage the shared literal content between normal and silent speech and present a metric learning approach based on visemes. Specifically, we aim to map the input of two speech types close to each other in a latent space if they have similar viseme representations. By minimizing the Kullback-Leibler divergence of the predicted viseme probability distributions between and within the two speech types, our model effectively learns and predicts viseme identities. Our evaluation demonstrates that our method improves the accuracy of silent VSR, even when limited training data is available. △ Less

Submitted 16 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: Accepted by INTERSPEECH 2023

arXiv:2304.07087 [pdf, other]

Memory Efficient Diffusion Probabilistic Models via Patch-based Generation

Authors: Shinei Arakawa, Hideki Tsunashima, Daichi Horita, Keitaro Tanaka, Shigeo Morishima

Abstract: Diffusion probabilistic models have been successful in generating high-quality and diverse images. However, traditional models, whose input and output are high-resolution images, suffer from excessive memory requirements, making them less practical for edge devices. Previous approaches for generative adversarial networks proposed a patch-based method that uses positional encoding and global conten… ▽ More Diffusion probabilistic models have been successful in generating high-quality and diverse images. However, traditional models, whose input and output are high-resolution images, suffer from excessive memory requirements, making them less practical for edge devices. Previous approaches for generative adversarial networks proposed a patch-based method that uses positional encoding and global content information. Nevertheless, designing a patch-based approach for diffusion probabilistic models is non-trivial. In this paper, we resent a diffusion probabilistic model that generates images on a patch-by-patch basis. We propose two conditioning methods for a patch-based generation. First, we propose position-wise conditioning using one-hot representation to ensure patches are in proper positions. Second, we propose Global Content Conditioning (GCC) to ensure patches have coherent content when concatenated together. We evaluate our model qualitatively and quantitatively on CelebA and LSUN bedroom datasets and demonstrate a moderate trade-off between maximum memory consumption and generated image quality. Specifically, when an entire image is divided into 2 x 2 patches, our proposed approach can reduce the maximum memory consumption by half while maintaining comparable image quality. △ Less

Submitted 14 April, 2023; originally announced April 2023.

Comments: Accepted to the Generative Models for Computer Vision workshop at CVPR 2023

arXiv:2303.02930 [pdf, other]

Scapegoat Generation for Privacy Protection from Deepfake

Authors: Gido Kato, Yoshihiro Fukuhara, Mariko Isogawa, Hideki Tsunashima, Hirokatsu Kataoka, Shigeo Morishima

Abstract: To protect privacy and prevent malicious use of deepfake, current studies propose methods that interfere with the generation process, such as detection and destruction approaches. However, these methods suffer from sub-optimal generalization performance to unseen models and add undesirable noise to the original image. To address these problems, we propose a new problem formulation for deepfake pre… ▽ More To protect privacy and prevent malicious use of deepfake, current studies propose methods that interfere with the generation process, such as detection and destruction approaches. However, these methods suffer from sub-optimal generalization performance to unseen models and add undesirable noise to the original image. To address these problems, we propose a new problem formulation for deepfake prevention: generating a ``scapegoat image'' by modifying the style of the original input in a way that is recognizable as an avatar by the user, but impossible to reconstruct the real face. Even in the case of malicious deepfake, the privacy of the users is still protected. To achieve this, we introduce an optimization-based editing method that utilizes GAN inversion to discourage deepfake models from generating similar scapegoats. We validate the effectiveness of our proposed method through quantitative and user studies. △ Less

Submitted 6 March, 2023; originally announced March 2023.

Comments: 5 pages, 5 figures

MSC Class: 68T07

arXiv:2303.02608 [pdf, other]

doi 10.1109/icip49359.2023.10222771

Event-based Camera Simulation using Monte Carlo Path Tracing with Adaptive Denoising

Authors: Yuta Tsuji, Tatsuya Yatagawa, Hiroyuki Kubo, Shigeo Morishima

Abstract: This paper presents an algorithm to obtain an event-based video from noisy frames given by physics-based Monte Carlo path tracing over a synthetic 3D scene. Given the nature of dynamic vision sensor (DVS), rendering event-based video can be viewed as a process of detecting the changes from noisy brightness values. We extend a denoising method based on a weighted local regression (WLR) to detect th… ▽ More This paper presents an algorithm to obtain an event-based video from noisy frames given by physics-based Monte Carlo path tracing over a synthetic 3D scene. Given the nature of dynamic vision sensor (DVS), rendering event-based video can be viewed as a process of detecting the changes from noisy brightness values. We extend a denoising method based on a weighted local regression (WLR) to detect the brightness changes rather than applying denoising to every pixel. Specifically, we derive a threshold to determine the likelihood of event occurrence and reduce the number of times to perform the regression. Our method is robust to noisy video frames obtained from a few path-traced samples. Despite its efficiency, our method performs comparably to or even better than an approach that exhaustively denoises every frame. △ Less

Submitted 22 August, 2023; v1 submitted 5 March, 2023; originally announced March 2023.

Comments: 8 pages, 6 figures, 3 tables

Journal ref: Proceedings of the IEEE International Conference on Image Processing (ICCP) 2023

arXiv:2301.06816 [pdf, other]

A Combined Finite Element and Finite Volume Method for Liquid Simulation

Authors: Tatsuya Koike, Shigeo Morishima, Ryoichi Ando

Abstract: We introduce a new Eulerian simulation framework for liquid animation that leverages both finite element and finite volume methods. In contrast to previous methods where the whole simulation domain is discretized either using the finite volume method or finite element method, our method spatially merges them together using two types of discretization being tightly coupled on its seams while enforc… ▽ More We introduce a new Eulerian simulation framework for liquid animation that leverages both finite element and finite volume methods. In contrast to previous methods where the whole simulation domain is discretized either using the finite volume method or finite element method, our method spatially merges them together using two types of discretization being tightly coupled on its seams while enforcing second order accurate boundary conditions at free surfaces. We achieve our formulation via a variational form using new shape functions specifically designed for this purpose. By enabling a mixture of the two methods, we can take advantage of the best of two worlds. For example, finite volume method (FVM) result in sparse linear systems; however, complexity is encountered when unstructured grids such as tetrahedral or Voronoi elements are used. Finite element method (FEM), on the other hand, result in comparably denser linear systems, but the complexity remains the same even if unstructured elements are chosen; thereby facilitating spatial adaptivity. In this paper, we propose to use FVM for the majority parts to retain the sparsity of linear systems and FEM for parts where the grid elements are allowed to be freely deformed. An example of this application is locally moving grids. We show that by adapting the local grid movement to an underlying nearly rigid motion, numerical diffusion is noticeably reduced; leading to better preservation of structural details such as sharp edges, thin sheets and spindles of liquids. △ Less

Submitted 17 January, 2023; originally announced January 2023.

Comments: 13 pages, 8 figures

arXiv:2207.09425 [pdf, other]

Geometric Features Informed Multi-person Human-object Interaction Recognition in Videos

Authors: Tanqiu Qiao, Qianhui Men, Frederick W. B. Li, Yoshiki Kubotani, Shigeo Morishima, Hubert P. H. Shum

Abstract: Human-Object Interaction (HOI) recognition in videos is important for analyzing human activity. Most existing work focusing on visual features usually suffer from occlusion in the real-world scenarios. Such a problem will be further complicated when multiple people and objects are involved in HOIs. Consider that geometric features such as human pose and object position provide meaningful informati… ▽ More Human-Object Interaction (HOI) recognition in videos is important for analyzing human activity. Most existing work focusing on visual features usually suffer from occlusion in the real-world scenarios. Such a problem will be further complicated when multiple people and objects are involved in HOIs. Consider that geometric features such as human pose and object position provide meaningful information to understand HOIs, we argue to combine the benefits of both visual and geometric features in HOI recognition, and propose a novel Two-level Geometric feature-informed Graph Convolutional Network (2G-GCN). The geometric-level graph models the interdependency between geometric features of humans and objects, while the fusion-level graph further fuses them with visual features of humans and objects. To demonstrate the novelty and effectiveness of our method in challenging scenarios, we propose a new multi-person HOI dataset (MPHOI-72). Extensive experiments on MPHOI-72 (multi-person HOI), CAD-120 (single-human HOI) and Bimanual Actions (two-hand HOI) datasets demonstrate our superior performance compared to state-of-the-arts. △ Less

Submitted 19 July, 2022; originally announced July 2022.

Comments: Accepted by ECCV 2022

arXiv:2203.15991 [pdf, other]

The Sound of Bounding-Boxes

Authors: Takashi Oya, Shohei Iwase, Shigeo Morishima

Abstract: In the task of audio-visual sound source separation, which leverages visual information for sound source separation, identifying objects in an image is a crucial step prior to separating the sound source. However, existing methods that assign sound on detected bounding boxes suffer from a problem that their approach heavily relies on pre-trained object detectors. Specifically, when using these exi… ▽ More In the task of audio-visual sound source separation, which leverages visual information for sound source separation, identifying objects in an image is a crucial step prior to separating the sound source. However, existing methods that assign sound on detected bounding boxes suffer from a problem that their approach heavily relies on pre-trained object detectors. Specifically, when using these existing methods, it is required to predetermine all the possible categories of objects that can produce sound and use an object detector applicable to all such categories. To tackle this problem, we propose a fully unsupervised method that learns to detect objects in an image and separate sound source simultaneously. As our method does not rely on any pre-trained detector, our method is applicable to arbitrary categories without any additional annotation. Furthermore, although being fully unsupervised, we found that our method performs comparably in separation accuracy. △ Less

Submitted 29 March, 2022; originally announced March 2022.

Comments: 6 pages, 5 figures, ICPR (International Conference on Pattern Recognition) 2022

arXiv:2203.09109 [pdf, other]

Community-Driven Comprehensive Scientific Paper Summarization: Insight from cvpaper.challenge

Authors: Shintaro Yamamoto, Hirokatsu Kataoka, Ryota Suzuki, Seitaro Shinagawa, Shigeo Morishima

Abstract: The present paper introduces a group activity involving writing summaries of conference proceedings by volunteer participants. The rapid increase in scientific papers is a heavy burden for researchers, especially non-native speakers, who need to survey scientific literature. To alleviate this problem, we organized a group of non-native English speakers to write summaries of papers presented at a c… ▽ More The present paper introduces a group activity involving writing summaries of conference proceedings by volunteer participants. The rapid increase in scientific papers is a heavy burden for researchers, especially non-native speakers, who need to survey scientific literature. To alleviate this problem, we organized a group of non-native English speakers to write summaries of papers presented at a computer vision conference to share the knowledge of the papers read by the group. We summarized a total of 2,000 papers presented at the Conference on Computer Vision and Pattern Recognition, a top-tier conference on computer vision, in 2019 and 2020. We quantitatively analyzed participants' selection regarding which papers they read among the many available papers. The experimental results suggest that we can summarize a wide range of papers without asking participants to read papers unrelated to their interests. △ Less

Submitted 17 March, 2022; originally announced March 2022.

arXiv:2202.08010 [pdf, other]

360 Depth Estimation in the Wild -- The Depth360 Dataset and the SegFuse Network

Authors: Qi Feng, Hubert P. H. Shum, Shigeo Morishima

Abstract: Single-view depth estimation from omnidirectional images has gained popularity with its wide range of applications such as autonomous driving and scene reconstruction. Although data-driven learning-based methods demonstrate significant potential in this field, scarce training data and ineffective 360 estimation algorithms are still two key limitations hindering accurate estimation across diverse d… ▽ More Single-view depth estimation from omnidirectional images has gained popularity with its wide range of applications such as autonomous driving and scene reconstruction. Although data-driven learning-based methods demonstrate significant potential in this field, scarce training data and ineffective 360 estimation algorithms are still two key limitations hindering accurate estimation across diverse domains. In this work, we first establish a large-scale dataset with varied settings called Depth360 to tackle the training data problem. This is achieved by exploring the use of a plenteous source of data, 360 videos from the internet, using a test-time training method that leverages unique information in each omnidirectional sequence. With novel geometric and temporal constraints, our method generates consistent and convincing depth samples to facilitate single-view estimation. We then propose an end-to-end two-branch multi-task learning network, SegFuse, that mimics the human eye to effectively learn from the dataset and estimate high-quality depth maps from diverse monocular RGB images. With a peripheral branch that uses equirectangular projection for depth estimation and a foveal branch that uses cubemap projection for semantic segmentation, our method predicts consistent global depth while maintaining sharp details at local regions. Experimental results show favorable performance against the state-of-the-art methods. △ Less

Submitted 16 February, 2022; originally announced February 2022.

Comments: 10 pages, 10 figures, 5 tables, submitted to IEEE VR 2022

ACM Class: I.2.10

arXiv:2108.00268 [pdf, other]

RLTutor: Reinforcement Learning Based Adaptive Tutoring System by Modeling Virtual Student with Fewer Interactions

Authors: Yoshiki Kubotani, Yoshihiro Fukuhara, Shigeo Morishima

Abstract: A major challenge in the field of education is providing review schedules that present learned items at appropriate intervals to each student so that memory is retained over time. In recent years, attempts have been made to formulate item reviews as sequential decision-making problems to realize adaptive instruction based on the knowledge state of students. It has been reported previously that rei… ▽ More A major challenge in the field of education is providing review schedules that present learned items at appropriate intervals to each student so that memory is retained over time. In recent years, attempts have been made to formulate item reviews as sequential decision-making problems to realize adaptive instruction based on the knowledge state of students. It has been reported previously that reinforcement learning can help realize mathematical models of students learning strategies to maintain a high memory rate. However, optimization using reinforcement learning requires a large number of interactions, and thus it cannot be applied directly to actual students. In this study, we propose a framework for optimizing teaching strategies by constructing a virtual model of the student while minimizing the interaction with the actual teaching target. In addition, we conducted an experiment considering actual instructions using the mathematical model and confirmed that the model performance is comparable to that of conventional teaching methods. Our framework can directly substitute mathematical models used in experiments with human students, and our results can serve as a buffer between theoretical instructional optimization and practical applications in e-learning systems. △ Less

Submitted 31 July, 2021; originally announced August 2021.

Comments: Accepted in AI4EDU workshop at IJCAI2021. The official code is available on https://github.com/YoshikiKubotani/rltutor

arXiv:2102.00845 [pdf, other]

LSTM-SAKT: LSTM-Encoded SAKT-like Transformer for Knowledge Tracing

Authors: Takashi Oya, Shigeo Morishima

Abstract: This paper introduces the 2nd place solution for the Riiid! Answer Correctness Prediction in Kaggle, the world's largest data science competition website. This competition was held from October 16, 2020, to January 7, 2021, with 3395 teams and 4387 competitors. The main insights and contributions of this paper are as follows. (i) We pointed out existing Transformer-based models are suffering from… ▽ More This paper introduces the 2nd place solution for the Riiid! Answer Correctness Prediction in Kaggle, the world's largest data science competition website. This competition was held from October 16, 2020, to January 7, 2021, with 3395 teams and 4387 competitors. The main insights and contributions of this paper are as follows. (i) We pointed out existing Transformer-based models are suffering from a problem that the information which their query/key/value can contain is limited. To solve this problem, we proposed a method that uses LSTM to obtain query/key/value and verified its effectiveness. (ii) We pointed out 'inter-container' leakage problem, which happens in datasets where questions are sometimes served together. To solve this problem, we showed special indexing/masking techniques that are useful when using RNN-variants and Transformer. (iii) We found additional hand-crafted features are effective to overcome the limits of Transformer, which can never consider the samples older than the sequence length. △ Less

Submitted 10 February, 2021; v1 submitted 28 January, 2021; originally announced February 2021.

Comments: 4 pages, 3 figures, the paper at AAAI 2021 Workshop on AI Education https://sites.google.com/view/tipce-2021/home

arXiv:2012.11213 [pdf, ps, other]

Self-Supervised Learning for Visual Summary Identification in Scientific Publications

Authors: Shintaro Yamamoto, Anne Lauscher, Simone Paolo Ponzetto, Goran Glavaš, Shigeo Morishima

Abstract: Providing visual summaries of scientific publications can increase information access for readers and thereby help deal with the exponential growth in the number of scientific publications. Nonetheless, efforts in providing visual publication summaries have been few and far apart, primarily focusing on the biomedical domain. This is primarily because of the limited availability of annotated gold s… ▽ More Providing visual summaries of scientific publications can increase information access for readers and thereby help deal with the exponential growth in the number of scientific publications. Nonetheless, efforts in providing visual publication summaries have been few and far apart, primarily focusing on the biomedical domain. This is primarily because of the limited availability of annotated gold standards, which hampers the application of robust and high-performing supervised learning techniques. To address these problems we create a new benchmark dataset for selecting figures to serve as visual summaries of publications based on their abstracts, covering several domains in computer science. Moreover, we develop a self-supervised learning approach, based on heuristic matching of inline references to figures with figure captions. Experiments in both biomedical and computer science domains show that our model is able to outperform the state of the art despite being self-supervised and therefore not relying on any annotated training data. △ Less

Submitted 14 January, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

arXiv:2007.05722 [pdf, other]

Do We Need Sound for Sound Source Localization?

Authors: Takashi Oya, Shohei Iwase, Ryota Natsume, Takahiro Itazuri, Shugo Yamaguchi, Shigeo Morishima

Abstract: During the performance of sound source localization which uses both visual and aural information, it presently remains unclear how much either image or sound modalities contribute to the result, i.e. do we need both image and sound for sound source localization? To address this question, we develop an unsupervised learning system that solves sound source localization by decomposing this task into… ▽ More During the performance of sound source localization which uses both visual and aural information, it presently remains unclear how much either image or sound modalities contribute to the result, i.e. do we need both image and sound for sound source localization? To address this question, we develop an unsupervised learning system that solves sound source localization by decomposing this task into two steps: (i) "potential sound source localization", a step that localizes possible sound sources using only visual information (ii) "object selection", a step that identifies which objects are actually sounding using aural information. Our overall system achieves state-of-the-art performance in sound source localization, and more importantly, we find that despite the constraint on available information, the results of (i) achieve similar performance. From this observation and further experiments, we show that visual information is dominant in "sound" source localization when evaluated with the currently adopted benchmark dataset. Moreover, we show that the majority of sound-producing objects within the samples in this dataset can be inherently identified using only visual information, and thus that the dataset is inadequate to evaluate a system's capability to leverage aural information. As an alternative, we present an evaluation protocol that enforces both visual and aural information to be leveraged, and verify this property through several experiments. △ Less

Submitted 11 July, 2020; originally announced July 2020.

Comments: Paper: 14 pages, 6 figures. Supplementary Material: 6 pages, 3 figures. Videos and Codes will be released later

arXiv:2004.03811 [pdf, other]

MirrorNet: A Deep Bayesian Approach to Reflective 2D Pose Estimation from Human Images

Authors: Takayuki Nakatsuka, Kazuyoshi Yoshii, Yuki Koyama, Satoru Fukayama, Masataka Goto, Shigeo Morishima

Abstract: This paper proposes a statistical approach to 2D pose estimation from human images. The main problems with the standard supervised approach, which is based on a deep recognition (image-to-pose) model, are that it often yields anatomically implausible poses, and its performance is limited by the amount of paired data. To solve these problems, we propose a semi-supervised method that can make effect… ▽ More This paper proposes a statistical approach to 2D pose estimation from human images. The main problems with the standard supervised approach, which is based on a deep recognition (image-to-pose) model, are that it often yields anatomically implausible poses, and its performance is limited by the amount of paired data. To solve these problems, we propose a semi-supervised method that can make effective use of images with and without pose annotations. Specifically, we formulate a hierarchical generative model of poses and images by integrating a deep generative model of poses from pose features with that of images from poses and image features. We then introduce a deep recognition model that infers poses from images. Given images as observed data, these models can be trained jointly in a hierarchical variational autoencoding (image-to-pose-to-feature-to-pose-to-image) manner. The results of experiments show that the proposed reflective architecture makes estimated poses anatomically plausible, and the performance of pose estimation improved by integrating the recognition and generative models and also by feeding non-annotated images. △ Less

Submitted 8 April, 2020; originally announced April 2020.

Comments: 19 pages

arXiv:1905.07666 [pdf, other]

What Do Adversarially Robust Models Look At?

Authors: Takahiro Itazuri, Yoshihiro Fukuhara, Hirokatsu Kataoka, Shigeo Morishima

Abstract: In this paper, we address the open question: "What do adversarially robust models look at?" Recently, it has been reported in many works that there exists the trade-off between standard accuracy and adversarial robustness. According to prior works, this trade-off is rooted in the fact that adversarially robust and standard accurate models might depend on very different sets of features. However, i… ▽ More In this paper, we address the open question: "What do adversarially robust models look at?" Recently, it has been reported in many works that there exists the trade-off between standard accuracy and adversarial robustness. According to prior works, this trade-off is rooted in the fact that adversarially robust and standard accurate models might depend on very different sets of features. However, it has not been well studied what kind of difference actually exists. In this paper, we analyze this difference through various experiments visually and quantitatively. Experimental results show that adversarially robust models look at things at a larger scale than standard models and pay less attention to fine textures. Furthermore, although it has been claimed that adversarially robust features are not compatible with standard accuracy, there is even a positive effect by using them as pre-trained models particularly in low resolution datasets. △ Less

Submitted 18 May, 2019; originally announced May 2019.

arXiv:1905.05172 [pdf, other]

PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization

Authors: Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, Hao Li

Abstract: We introduce Pixel-aligned Implicit Function (PIFu), a highly effective implicit representation that locally aligns pixels of 2D images with the global context of their corresponding 3D object. Using PIFu, we propose an end-to-end deep learning method for digitizing highly detailed clothed humans that can infer both 3D surface and texture from a single image, and optionally, multiple input images.… ▽ More We introduce Pixel-aligned Implicit Function (PIFu), a highly effective implicit representation that locally aligns pixels of 2D images with the global context of their corresponding 3D object. Using PIFu, we propose an end-to-end deep learning method for digitizing highly detailed clothed humans that can infer both 3D surface and texture from a single image, and optionally, multiple input images. Highly intricate shapes, such as hairstyles, clothing, as well as their variations and deformations can be digitized in a unified way. Compared to existing representations used for 3D deep learning, PIFu can produce high-resolution surfaces including largely unseen regions such as the back of a person. In particular, it is memory efficient unlike the voxel representation, can handle arbitrary topology, and the resulting surface is spatially aligned with the input image. Furthermore, while previous techniques are designed to process either a single image or multiple views, PIFu extends naturally to arbitrary number of views. We demonstrate high-resolution and robust reconstructions on real world images from the DeepFashion dataset, which contains a variety of challenging clothing types. Our method achieves state-of-the-art performance on a public benchmark and outperforms the prior work for clothed human digitization from a single image. △ Less

Submitted 3 December, 2019; v1 submitted 13 May, 2019; originally announced May 2019.

Comments: project page: https://shunsukesaito.github.io/PIFu

Journal ref: The IEEE International Conference on Computer Vision (ICCV), 2019, pp. 2304-2314

arXiv:1901.00049 [pdf, other]

SiCloPe: Silhouette-Based Clothed People

Authors: Ryota Natsume, Shunsuke Saito, Zeng Huang, Weikai Chen, Chongyang Ma, Hao Li, Shigeo Morishima

Abstract: We introduce a new silhouette-based representation for modeling clothed human bodies using deep generative models. Our method can reconstruct a complete and textured 3D model of a person wearing clothes from a single input picture. Inspired by the visual hull algorithm, our implicit representation uses 2D silhouettes and 3D joints of a body pose to describe the immense shape complexity and variati… ▽ More We introduce a new silhouette-based representation for modeling clothed human bodies using deep generative models. Our method can reconstruct a complete and textured 3D model of a person wearing clothes from a single input picture. Inspired by the visual hull algorithm, our implicit representation uses 2D silhouettes and 3D joints of a body pose to describe the immense shape complexity and variations of clothed people. Given a segmented 2D silhouette of a person and its inferred 3D joints from the input picture, we first synthesize consistent silhouettes from novel view points around the subject. The synthesized silhouettes which are the most consistent with the input segmentation are fed into a deep visual hull algorithm for robust 3D shape prediction. We then infer the texture of the subject's back view using the frontal image and segmentation mask as input to a conditional generative adversarial network. Our experiments demonstrate that our silhouette-based model is an effective representation and the appearance of the back view can be predicted reliably using an image-to-image translation network. While classic methods based on parametric models often fail for single-view images of subjects with challenging clothing, our approach can still produce successful results, which are comparable to those obtained from multi-view input. △ Less

Submitted 10 April, 2019; v1 submitted 31 December, 2018; originally announced January 2019.

arXiv:1811.12666 [pdf, other]

doi 10.1007/978-3-030-20876-9_8

FSNet: An Identity-Aware Generative Model for Image-based Face Swap**

Authors: Ryota Natsume, Tatsuya Yatagawa, Shigeo Morishima

Abstract: This paper presents FSNet, a deep generative model for image-based face swap**. Traditionally, face-swap** methods are based on three-dimensional morphable models (3DMMs), and facial textures are replaced between the estimated three-dimensional (3D) geometries in two images of different individuals. However, the estimation of 3D geometries along with different lighting conditions using 3DMMs i… ▽ More This paper presents FSNet, a deep generative model for image-based face swap**. Traditionally, face-swap** methods are based on three-dimensional morphable models (3DMMs), and facial textures are replaced between the estimated three-dimensional (3D) geometries in two images of different individuals. However, the estimation of 3D geometries along with different lighting conditions using 3DMMs is still a difficult task. We herein represent the face region with a latent variable that is assigned with the proposed deep neural network (DNN) instead of facial textures. The proposed DNN synthesizes a face-swapped image using the latent variable of the face region and another image of the non-face region. The proposed method is not required to fit to the 3DMM; additionally, it performs face swap** only by feeding two face images to the proposed network. Consequently, our DNN-based face swap** performs better than previous approaches for challenging inputs with different face orientations and lighting conditions. Through several experiments, we demonstrated that the proposed method performs face swap** in a more stable manner than the state-of-the-art method, and that its results are compatible with the method thereof. △ Less

Submitted 30 November, 2018; originally announced November 2018.

Comments: 20pages, Asian Conference of Computer Vision 2018

arXiv:1811.06943 [pdf, ps, other]

Automatic Paper Summary Generation from Visual and Textual Information

Authors: Shintaro Yamamoto, Yoshihiro Fukuhara, Ryota Suzuki, Shigeo Morishima, Hirokatsu Kataoka

Abstract: Due to the recent boom in artificial intelligence (AI) research, including computer vision (CV), it has become impossible for researchers in these fields to keep up with the exponentially increasing number of manuscripts. In response to this situation, this paper proposes the paper summary generation (PSG) task using a simple but effective method to automatically generate an academic paper summary… ▽ More Due to the recent boom in artificial intelligence (AI) research, including computer vision (CV), it has become impossible for researchers in these fields to keep up with the exponentially increasing number of manuscripts. In response to this situation, this paper proposes the paper summary generation (PSG) task using a simple but effective method to automatically generate an academic paper summary from raw PDF data. We realized PSG by combination of vision-based supervised components detector and language-based unsupervised important sentence extractor, which is applicable for a trained format of manuscripts. We show the quantitative evaluation of ability of simple vision-based components extraction, and the qualitative evaluation that our system can extract both visual item and sentence that are helpful for understanding. After processing via our PSG, the 979 manuscripts accepted by the Conference on Computer Vision and Pattern Recognition (CVPR) 2018 are available. It is believed that the proposed method will provide a better way for researchers to stay caught with important academic papers. △ Less

Submitted 16 November, 2018; originally announced November 2018.

Comments: International Conference on Machine Vision 2018, Munich, Germany

arXiv:1809.08391 [pdf, other]

Understanding Fake Faces

Authors: Ryota Natsume, Kazuki Inoue, Yoshihiro Fukuhara, Shintaro Yamamoto, Shigeo Morishima, Hirokatsu Kataoka

Abstract: Face recognition research is one of the most active topics in computer vision (CV), and deep neural networks (DNN) are now filling the gap between human-level and computer-driven performance levels in face verification algorithms. However, although the performance gap appears to be narrowing in terms of accuracy-based expectations, a curious question has arisen; specifically, "Face understanding o… ▽ More Face recognition research is one of the most active topics in computer vision (CV), and deep neural networks (DNN) are now filling the gap between human-level and computer-driven performance levels in face verification algorithms. However, although the performance gap appears to be narrowing in terms of accuracy-based expectations, a curious question has arisen; specifically, "Face understanding of AI is really close to that of human?" In the present study, in an effort to confirm the brain-driven concept, we conduct image-based detection, classification, and generation using an in-house created fake face database. This database has two configurations: (i) false positive face detections produced using both the Viola Jones (VJ) method and convolutional neural networks (CNN), and (ii) simulacra that have fundamental characteristics that resemble faces but are completely artificial. The results show a level of suggestive knowledge that indicates the continuing existence of a gap between the capabilities of recent vision-based face recognition algorithms and human-level performance. On a positive note, however, we have obtained knowledge that will advance the progress of face-understanding models. △ Less

Submitted 22 September, 2018; originally announced September 2018.

Comments: 11 pages, 3 figures, ECCV 2018 Workshop on Brain-Driven Computer Vision (BDCV)

arXiv:1804.03447 [pdf, other]

doi 10.1145/3230744.3230818

RSGAN: Face Swap** and Editing using Face and Hair Representation in Latent Spaces

Authors: Ryota Natsume, Tatsuya Yatagawa, Shigeo Morishima

Abstract: In this paper, we present an integrated system for automatically generating and editing face images through face swap**, attribute-based editing, and random face parts synthesis. The proposed system is based on a deep neural network that variationally learns the face and hair regions with large-scale face image datasets. Different from conventional variational methods, the proposed network repre… ▽ More In this paper, we present an integrated system for automatically generating and editing face images through face swap**, attribute-based editing, and random face parts synthesis. The proposed system is based on a deep neural network that variationally learns the face and hair regions with large-scale face image datasets. Different from conventional variational methods, the proposed network represents the latent spaces individually for faces and hairs. We refer to the proposed network as region-separative generative adversarial network (RSGAN). The proposed network independently handles face and hair appearances in the latent spaces, and then, face swap** is achieved by replacing the latent-space representations of the faces, and reconstruct the entire face image with them. This approach in the latent space robustly performs face swap** even for images which the previous methods result in failure due to inappropriate fitting or the 3D morphable models. In addition, the proposed system can further edit face-swapped images with the same network by manipulating visual attributes or by composing them with randomly generated face or hair parts. △ Less

Submitted 18 April, 2018; v1 submitted 10 April, 2018; originally announced April 2018.

Showing 1–26 of 26 results for author: Morishima, S