-
Map-based Modular Approach for Zero-shot Embodied Question Answering
Authors:
Koya Sakamoto,
Daichi Azuma,
Taiki Miyanishi,
Shuhei Kurita,
Motoaki Kawanabe
Abstract:
Building robots capable of interacting with humans through natural language in the visual world presents a significant challenge in the field of robotics. To overcome this challenge, Embodied Question Answering (EQA) has been proposed as a benchmark task to measure the ability to identify an object navigating through a previously unseen environment in response to human-posed questions. Although so…
▽ More
Building robots capable of interacting with humans through natural language in the visual world presents a significant challenge in the field of robotics. To overcome this challenge, Embodied Question Answering (EQA) has been proposed as a benchmark task to measure the ability to identify an object navigating through a previously unseen environment in response to human-posed questions. Although some methods have been proposed, their evaluations have been limited to simulations, without experiments in real-world scenarios. Furthermore, all of these methods are constrained by a limited vocabulary for question-and-answer interactions, making them unsuitable for practical applications. In this work, we propose a map-based modular EQA method that enables real robots to navigate unknown environments through frontier-based map creation and address unknown QA pairs using foundation models that support open vocabulary. Unlike the questions of the previous EQA dataset on Matterport 3D (MP3D), questions in our real-world experiments contain various question formats and vocabularies not included in the training data. We conduct comprehensive experiments on virtual environments (MP3D-EQA) and two real-world house environments and demonstrate that our method can perform EQA even in the real world.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud Data
Authors:
Taiki Miyanishi,
Fumiya Kitamori,
Shuhei Kurita,
Jungdae Lee,
Motoaki Kawanabe,
Nakamasa Inoue
Abstract:
City-scale 3D point cloud is a promising way to express detailed and complicated outdoor structures. It encompasses both the appearance and geometry features of segmented city components, including cars, streets, and buildings, that can be utilized for attractive applications such as user-interactive navigation of autonomous vehicles and drones. However, compared to the extensive text annotations…
▽ More
City-scale 3D point cloud is a promising way to express detailed and complicated outdoor structures. It encompasses both the appearance and geometry features of segmented city components, including cars, streets, and buildings, that can be utilized for attractive applications such as user-interactive navigation of autonomous vehicles and drones. However, compared to the extensive text annotations available for images and indoor scenes, the scarcity of text annotations for outdoor scenes poses a significant challenge for achieving these applications. To tackle this problem, we introduce the CityRefer dataset for city-level visual grounding. The dataset consists of 35k natural language descriptions of 3D objects appearing in SensatUrban city scenes and 5k landmarks labels synchronizing with OpenStreetMap. To ensure the quality and accuracy of the dataset, all descriptions and labels in the CityRefer dataset are manually verified. We also have developed a baseline system that can learn encoded language descriptions, 3D object instances, and geographical information about the city's landmarks to perform visual grounding on the CityRefer dataset. To the best of our knowledge, the CityRefer dataset is the largest city-level visual grounding dataset for localizing specific 3D objects.
△ Less
Submitted 28 October, 2023;
originally announced October 2023.
-
Cross3DVG: Cross-Dataset 3D Visual Grounding on Different RGB-D Scans
Authors:
Taiki Miyanishi,
Daichi Azuma,
Shuhei Kurita,
Motoki Kawanabe
Abstract:
We present a novel task for cross-dataset visual grounding in 3D scenes (Cross3DVG), which overcomes limitations of existing 3D visual grounding models, specifically their restricted 3D resources and consequent tendencies of overfitting a specific 3D dataset. We created RIORefer, a large-scale 3D visual grounding dataset, to facilitate Cross3DVG. It includes more than 63k diverse descriptions of 3…
▽ More
We present a novel task for cross-dataset visual grounding in 3D scenes (Cross3DVG), which overcomes limitations of existing 3D visual grounding models, specifically their restricted 3D resources and consequent tendencies of overfitting a specific 3D dataset. We created RIORefer, a large-scale 3D visual grounding dataset, to facilitate Cross3DVG. It includes more than 63k diverse descriptions of 3D objects within 1,380 indoor RGB-D scans from 3RScan, with human annotations. After training the Cross3DVG model using the source 3D visual grounding dataset, we evaluate it without target labels using the target dataset with, e.g., different sensors, 3D reconstruction methods, and language annotators. Comprehensive experiments are conducted using established visual grounding models and with CLIP-based multi-view 2D and 3D integration designed to bridge gaps among 3D datasets. For Cross3DVG tasks, (i) cross-dataset 3D visual grounding exhibits significantly worse performance than learning and evaluation with a single dataset because of the 3D data and language variants across datasets. Moreover, (ii) better object detector and localization modules and fusing 3D data and multi-view CLIP-based image features can alleviate this lower performance. Our Cross3DVG task can provide a benchmark for develo** robust 3D visual grounding models to handle diverse 3D scenes while leveraging deep language understanding.
△ Less
Submitted 7 February, 2024; v1 submitted 23 May, 2023;
originally announced May 2023.
-
SPD domain-specific batch normalization to crack interpretable unsupervised domain adaptation in EEG
Authors:
Reinmar J Kobler,
Jun-ichiro Hirayama,
Qibin Zhao,
Motoaki Kawanabe
Abstract:
Electroencephalography (EEG) provides access to neuronal dynamics non-invasively with millisecond resolution, rendering it a viable method in neuroscience and healthcare. However, its utility is limited as current EEG technology does not generalize well across domains (i.e., sessions and subjects) without expensive supervised re-calibration. Contemporary methods cast this transfer learning (TL) pr…
▽ More
Electroencephalography (EEG) provides access to neuronal dynamics non-invasively with millisecond resolution, rendering it a viable method in neuroscience and healthcare. However, its utility is limited as current EEG technology does not generalize well across domains (i.e., sessions and subjects) without expensive supervised re-calibration. Contemporary methods cast this transfer learning (TL) problem as a multi-source/-target unsupervised domain adaptation (UDA) problem and address it with deep learning or shallow, Riemannian geometry aware alignment methods. Both directions have, so far, failed to consistently close the performance gap to state-of-the-art domain-specific methods based on tangent space map** (TSM) on the symmetric positive definite (SPD) manifold. Here, we propose a theory-based machine learning framework that enables, for the first time, learning domain-invariant TSM models in an end-to-end fashion. To achieve this, we propose a new building block for geometric deep learning, which we denote SPD domain-specific momentum batch normalization (SPDDSMBN). A SPDDSMBN layer can transform domain-specific SPD inputs into domain-invariant SPD outputs, and can be readily applied to multi-source/-target and online UDA scenarios. In extensive experiments with 6 diverse EEG brain-computer interface (BCI) datasets, we obtain state-of-the-art performance in inter-session and -subject TL with a simple, intrinsically interpretable network architecture, which we denote TSMNet.
△ Less
Submitted 12 October, 2022; v1 submitted 2 June, 2022;
originally announced June 2022.
-
ScanQA: 3D Question Answering for Spatial Scene Understanding
Authors:
Daichi Azuma,
Taiki Miyanishi,
Shuhei Kurita,
Motoaki Kawanabe
Abstract:
We propose a new 3D spatial understanding task of 3D Question Answering (3D-QA). In the 3D-QA task, models receive visual information from the entire 3D scene of the rich RGB-D indoor scan and answer the given textual questions about the 3D scene. Unlike the 2D-question answering of VQA, the conventional 2D-QA models suffer from problems with spatial understanding of object alignment and direction…
▽ More
We propose a new 3D spatial understanding task of 3D Question Answering (3D-QA). In the 3D-QA task, models receive visual information from the entire 3D scene of the rich RGB-D indoor scan and answer the given textual questions about the 3D scene. Unlike the 2D-question answering of VQA, the conventional 2D-QA models suffer from problems with spatial understanding of object alignment and directions and fail the object identification from the textual questions in 3D-QA. We propose a baseline model for 3D-QA, named ScanQA model, where the model learns a fused descriptor from 3D object proposals and encoded sentence embeddings. This learned descriptor correlates the language expressions with the underlying geometric features of the 3D scan and facilitates the regression of 3D bounding boxes to determine described objects in textual questions and outputs correct answers. We collected human-edited question-answer pairs with free-form answers that are grounded to 3D objects in each 3D scene. Our new ScanQA dataset contains over 40K question-answer pairs from the 800 indoor scenes drawn from the ScanNet dataset. To the best of our knowledge, the proposed 3D-QA task is the first large-scale effort to perform object-grounded question-answering in 3D environments.
△ Less
Submitted 7 May, 2022; v1 submitted 20 December, 2021;
originally announced December 2021.
-
On the interpretation of linear Riemannian tangent space model parameters in M/EEG
Authors:
Reinmar J. Kobler,
Jun-Ichiro Hirayama,
Lea Hehenberger Catarina Lopes-Dias,
Gernot R. Müller-Putz,
Motoaki Kawanabe
Abstract:
Riemannian tangent space methods offer state-of-the-art performance in magnetoencephalography (MEG) and electroencephalography (EEG) based applications such as brain-computer interfaces and biomarker development. One limitation, particularly relevant for biomarker development, is limited model interpretability compared to established component-based methods. Here, we propose a method to transform…
▽ More
Riemannian tangent space methods offer state-of-the-art performance in magnetoencephalography (MEG) and electroencephalography (EEG) based applications such as brain-computer interfaces and biomarker development. One limitation, particularly relevant for biomarker development, is limited model interpretability compared to established component-based methods. Here, we propose a method to transform the parameters of linear tangent space models into interpretable patterns. Using typical assumptions, we show that this approach identifies the true patterns of latent sources, encoding a target signal. In simulations and two real MEG and EEG datasets, we demonstrate the validity of the proposed approach and investigate its behavior when the model assumptions are violated. Our results confirm that Riemannian tangent space methods are robust to differences in the source patterns across observations. We found that this robustness property also transfers to the associated patterns.
△ Less
Submitted 29 July, 2021;
originally announced July 2021.
-
Insights from Classifying Visual Concepts with Multiple Kernel Learning
Authors:
Alexander Binder,
Shinichi Nakajima,
Marius Kloft,
Christina Müller,
Wojciech Samek,
Ulf Brefeld,
Klaus-Robert Müller,
Motoaki Kawanabe
Abstract:
Combining information from various image features has become a standard technique in concept recognition tasks. However, the optimal way of fusing the resulting kernel functions is usually unknown in practical applications. Multiple kernel learning (MKL) techniques allow to determine an optimal linear combination of such similarity matrices. Classical approaches to MKL promote sparse mixtures. Unf…
▽ More
Combining information from various image features has become a standard technique in concept recognition tasks. However, the optimal way of fusing the resulting kernel functions is usually unknown in practical applications. Multiple kernel learning (MKL) techniques allow to determine an optimal linear combination of such similarity matrices. Classical approaches to MKL promote sparse mixtures. Unfortunately, so-called 1-norm MKL variants are often observed to be outperformed by an unweighted sum kernel. The contribution of this paper is twofold: We apply a recently developed non-sparse MKL variant to state-of-the-art concept recognition tasks within computer vision. We provide insights on benefits and limits of non-sparse MKL and compare it against its direct competitors, the sum kernel SVM and the sparse MKL. We report empirical results for the PASCAL VOC 2009 Classification and ImageCLEF2010 Photo Annotation challenge data sets. About to be submitted to PLoS ONE.
△ Less
Submitted 15 December, 2011;
originally announced December 2011.
-
How to Explain Individual Classification Decisions
Authors:
David Baehrens,
Timon Schroeter,
Stefan Harmeling,
Motoaki Kawanabe,
Katja Hansen,
Klaus-Robert Mueller
Abstract:
After building a classifier with modern tools of machine learning we typically have a black box at hand that is able to predict well for unseen data. Thus, we get an answer to the question what is the most likely label of a given unseen data point. However, most methods will provide no answer why the model predicted the particular label for a single instance and what features were most influenti…
▽ More
After building a classifier with modern tools of machine learning we typically have a black box at hand that is able to predict well for unseen data. Thus, we get an answer to the question what is the most likely label of a given unseen data point. However, most methods will provide no answer why the model predicted the particular label for a single instance and what features were most influential for that particular instance. The only method that is currently able to provide such explanations are decision trees. This paper proposes a procedure which (based on a set of assumptions) allows to explain the decisions of any classification method.
△ Less
Submitted 6 December, 2009;
originally announced December 2009.