Search | arXiv e-print repository

Designing Library of Skill-Agents for Hardware-Level Reusability

Authors: Jun Takamatsu, Daichi Saito, Katsushi Ikeuchi, Atsushi Kanehira, Kazuhiro Sasabuchi, Naoki Wake

Abstract: To use new robot hardware in a new environment, it is necessary to develop a control program tailored to that specific robot in that environment. Considering the reusability of software among robots is crucial to minimize the effort involved in this process and maximize software reuse across different robots in different environments. This paper proposes a method to remedy this process by consider… ▽ More To use new robot hardware in a new environment, it is necessary to develop a control program tailored to that specific robot in that environment. Considering the reusability of software among robots is crucial to minimize the effort involved in this process and maximize software reuse across different robots in different environments. This paper proposes a method to remedy this process by considering hardware-level reusability, using Learning-from-observation (LfO) paradigm with a pre-designed skill-agent library. The LfO framework represents the required actions in hardware-independent representations, referred to as task models, from observing human demonstrations, capturing the necessary parameters for the interaction between the environment and the robot. When executing the desired actions from the task models, a set of skill agents is employed to convert the representations into robot commands. This paper focuses on the latter part of the LfO framework, utilizing the set to generate robot actions from the task models, and explores a hardware-independent design approach for these skill agents. These skill agents are described in a hardware-independent manner, considering the relative relationship between the robot's hand position and the environment. As a result, it is possible to execute these actions on robots with different hardware configurations by simply swap** the inverse kinematics solver. This paper, first, defines a necessary and sufficient skill-agent set corresponding to cover all possible actions, and considers the design principles for these skill agents in the library. We provide concrete examples of such skill agents and demonstrate the practicality of using these skill agents by showing that the same representations can be executed on two different robots, Nextage and Fetch, using the proposed skill-agents set. △ Less

Submitted 20 March, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

arXiv:2311.12015 [pdf, other]

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

Authors: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Abstract: We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), to facilitate one-shot visual teaching for robotic manipulation. This system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances. The process begins with GPT-4V analyzing the videos to obtain textual explanations of environmental and… ▽ More We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), to facilitate one-shot visual teaching for robotic manipulation. This system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances. The process begins with GPT-4V analyzing the videos to obtain textual explanations of environmental and action details. A GPT-4-based task planner then encodes these details into a symbolic task plan. Subsequently, vision systems spatially and temporally ground the task plan in the videos. Object are identified using an open-vocabulary object detector, and hand-object interactions are analyzed to pinpoint moments of gras** and releasing. This spatiotemporal grounding allows for the gathering of affordance information (e.g., grasp types, waypoints, and body postures) critical for robot execution. Experiments across various scenarios demonstrate the method's efficacy in achieving real robots' operations from human demonstrations in a one-shot manner. Meanwhile, quantitative tests have revealed instances of hallucination in GPT-4V, highlighting the importance of incorporating human supervision within the pipeline. The prompts of GPT-4V/GPT-4 are available at this project page: △ Less

Submitted 6 May, 2024; v1 submitted 20 November, 2023; originally announced November 2023.

Comments: 9 pages, 12 figures, 2 tables. Last updated on May 6th, 2024

arXiv:2311.11007 [pdf, other]

Constraint-aware Policy for Compliant Manipulation

Authors: Daichi Saito, Kazuhiro Sasabuchi, Naoki Wake, Atsushi Kanehira, Jun Takamatsu, Hideki Koike, Katsushi Ikeuchi

Abstract: Robot manipulation in a physically-constrained environment requires compliant manipulation. Compliant manipulation is a manipulation skill to adjust hand motion based on the force imposed by the environment. Recently, reinforcement learning (RL) has been applied to solve household operations involving compliant manipulation. However, previous RL methods have primarily focused on designing a policy… ▽ More Robot manipulation in a physically-constrained environment requires compliant manipulation. Compliant manipulation is a manipulation skill to adjust hand motion based on the force imposed by the environment. Recently, reinforcement learning (RL) has been applied to solve household operations involving compliant manipulation. However, previous RL methods have primarily focused on designing a policy for a specific operation that limits their applicability and requires separate training for every new operation. We propose a constraint-aware policy that is applicable to various unseen manipulations by grou** several manipulations together based on the type of physical constraint involved. The type of physical constraint determines the characteristic of the imposed force direction; thus, a generalized policy is trained in the environment and reward designed on the basis of this characteristic. This paper focuses on two types of physical constraints: prismatic and revolute joints. Experiments demonstrated that the same policy could successfully execute various compliant-manipulation operations, both in the simulation and reality. We believe this study is the first step toward realizing a generalized household-robot. △ Less

Submitted 18 November, 2023; originally announced November 2023.

arXiv:2310.11753 [pdf, other]

Bias in Emotion Recognition with ChatGPT

Authors: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Abstract: This technical report explores the ability of ChatGPT in recognizing emotions from text, which can be the basis of various applications like interactive chatbots, data annotation, and mental health analysis. While prior research has shown ChatGPT's basic ability in sentiment analysis, its performance in more nuanced emotion recognition is not yet explored. Here, we conducted experiments to evaluat… ▽ More This technical report explores the ability of ChatGPT in recognizing emotions from text, which can be the basis of various applications like interactive chatbots, data annotation, and mental health analysis. While prior research has shown ChatGPT's basic ability in sentiment analysis, its performance in more nuanced emotion recognition is not yet explored. Here, we conducted experiments to evaluate its performance of emotion recognition across different datasets and emotion labels. Our findings indicate a reasonable level of reproducibility in its performance, with noticeable improvement through fine-tuning. However, the performance varies with different emotion labels and datasets, highlighting an inherent instability and possible bias. The choice of dataset and emotion labels significantly impacts ChatGPT's emotion recognition performance. This paper sheds light on the importance of dataset and label selection, and the potential of fine-tuning in enhancing ChatGPT's emotion recognition capabilities, providing a groundwork for better integration of emotion analysis in applications using ChatGPT. △ Less

Submitted 4 December, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

Comments: 5 pages, 4 figures, 6 tables

arXiv:2306.01741 [pdf, other]

GPT Models Meet Robotic Applications: Co-Speech Gesturing Chat System

Authors: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Abstract: This technical paper introduces a chatting robot system that utilizes recent advancements in large-scale language models (LLMs) such as GPT-3 and ChatGPT. The system is integrated with a co-speech gesture generation system, which selects appropriate gestures based on the conceptual meaning of speech. Our motivation is to explore ways of utilizing the recent progress in LLMs for practical robotic a… ▽ More This technical paper introduces a chatting robot system that utilizes recent advancements in large-scale language models (LLMs) such as GPT-3 and ChatGPT. The system is integrated with a co-speech gesture generation system, which selects appropriate gestures based on the conceptual meaning of speech. Our motivation is to explore ways of utilizing the recent progress in LLMs for practical robotic applications, which benefits the development of both chatbots and LLMs. Specifically, it enables the development of highly responsive chatbot systems by leveraging LLMs and adds visual effects to the user interface of LLMs as an additional value. The source code for the system is available on GitHub for our in-house robot (https://github.com/microsoft/LabanotationSuite/tree/master/MSRAbotChatSimulation) and GitHub for Toyota HSR (https://github.com/microsoft/GPT-Enabled-HSR-CoSpeechGestures). △ Less

Submitted 10 May, 2023; originally announced June 2023.

arXiv:2304.03893 [pdf, other]

doi 10.1109/ACCESS.2023.3310935

ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application

Authors: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Abstract: This paper demonstrates how OpenAI's ChatGPT can be used in a few-shot setting to convert natural language instructions into a sequence of executable robot actions. The paper proposes easy-to-customize input prompts for ChatGPT that meet common requirements in practical applications, such as easy integration with robot execution systems and applicability to various environments while minimizing th… ▽ More This paper demonstrates how OpenAI's ChatGPT can be used in a few-shot setting to convert natural language instructions into a sequence of executable robot actions. The paper proposes easy-to-customize input prompts for ChatGPT that meet common requirements in practical applications, such as easy integration with robot execution systems and applicability to various environments while minimizing the impact of ChatGPT's token limit. The prompts encourage ChatGPT to output a sequence of predefined robot actions, represent the operating environment in a formalized style, and infer the updated state of the operating environment. Experiments confirmed that the proposed prompts enable ChatGPT to act according to requirements in various environments, and users can adjust ChatGPT's output with natural language feedback for safe and robust operation. The proposed prompts and source code are open-source and publicly available at https://github.com/microsoft/ChatGPT-Robot-Manipulation-Prompts △ Less

Submitted 29 August, 2023; v1 submitted 7 April, 2023; originally announced April 2023.

Comments: 21 figures, 7 tables. Published in IEEE Access (in press). Last updated August 29th, 2023

arXiv:2301.01382 [pdf, other]

Task-sequencing Simulator: Integrated Machine Learning to Execution Simulation for Robot Manipulation

Authors: Kazuhiro Sasabuchi, Daichi Saito, Atsushi Kanehira, Naoki Wake, Jun Takamatsu, Katsushi Ikeuchi

Abstract: A task-sequencing simulator in robotics manipulation to integrate simulation-for-learning and simulation-for-execution is introduced. Unlike existing machine-learning simulation where a non-decomposed simulation is used to simulate a training scenario, the task-sequencing simulator runs a composed simulation using building blocks. This way, the simulation-for-learning is structured similarly to a… ▽ More A task-sequencing simulator in robotics manipulation to integrate simulation-for-learning and simulation-for-execution is introduced. Unlike existing machine-learning simulation where a non-decomposed simulation is used to simulate a training scenario, the task-sequencing simulator runs a composed simulation using building blocks. This way, the simulation-for-learning is structured similarly to a multi-step simulation-for-execution. To compose both learning and execution scenarios, a unified trainable-and-composable description of blocks called a concept model is proposed and used. Using the simulator design and concept models, a reusable simulator for learning different tasks, a common-ground system for learning-to-execution, simulation-to-real is achieved and shown. △ Less

Submitted 3 January, 2023; originally announced January 2023.

Comments: 7 pages, 6 figures

arXiv:2212.10787 [pdf, other]

doi 10.1109/AIM46323.2023.10196126

Interactive Task Encoding System for Learning-from-Observation

Authors: Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, Katsushi Ikeuchi

Abstract: We present the Interactive Task Encoding System (ITES) for teaching robots to perform manipulative tasks. ITES is designed as an input system for the Learning-from-Observation (LfO) framework, which enables household robots to be programmed using few-shot human demonstrations without the need for coding. In contrast to previous LfO systems that rely solely on visual demonstrations, ITES leverages… ▽ More We present the Interactive Task Encoding System (ITES) for teaching robots to perform manipulative tasks. ITES is designed as an input system for the Learning-from-Observation (LfO) framework, which enables household robots to be programmed using few-shot human demonstrations without the need for coding. In contrast to previous LfO systems that rely solely on visual demonstrations, ITES leverages both verbal instructions and interaction to enhance recognition robustness, thus enabling multimodal LfO. ITES identifies tasks from verbal instructions and extracts parameters from visual demonstrations. Meanwhile, the recognition result was reviewed by the user for interactive correction. Evaluations conducted on a real robot demonstrate the successful teaching of multiple operations for several scenarios, suggesting the usefulness of ITES for multimodal LfO. The source code is available at https://github.com/microsoft/symbolic-robot-teaching-interface. △ Less

Submitted 28 April, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

Comments: 6 pages, 9 figures. Submitted to and accepted by 2023 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM). Last updated April 28st, 2023

arXiv:2212.09242 [pdf, other]

Learning-from-Observation System Considering Hardware-Level Reusability

Authors: Jun Takamatsu, Kazuhiro Sasabuchi, Naoki Wake, Atsushi Kanehira, Katsushi Ikeuchi

Abstract: Robot developers develop various types of robots for satisfying users' various demands. Users' demands are related to their backgrounds and robots suitable for users may vary. If a certain developer would offer a robot that is different from the usual to a user, the robot-specific software has to be changed. On the other hand, robot-software developers would like to reuse their developed software… ▽ More Robot developers develop various types of robots for satisfying users' various demands. Users' demands are related to their backgrounds and robots suitable for users may vary. If a certain developer would offer a robot that is different from the usual to a user, the robot-specific software has to be changed. On the other hand, robot-software developers would like to reuse their developed software as much as possible to reduce their efforts. We propose the system design considering hardware-level reusability. For this purpose, we begin with the learning-from-observation framework. This framework represents a target task in robot-agnostic representation, and thus the represented task description can be shared with various robots. When executing the task, it is necessary to convert the robot-agnostic description into commands of a target robot. To increase the reusability, first, we implement the skill library, robot motion primitives, only considering a robot hand and we regarded that a robot was just a carrier to move the hand on the target trajectory. The skill library is reusable if we would like to the same robot hand. Second, we employ the generic IK solver to quickly swap a robot. We verify the hardware-level reusability by applying two task descriptions to two different robots, Nextage and Fetch. △ Less

Submitted 18 December, 2022; originally announced December 2022.

Comments: 5 pages, 4 figures

arXiv:2106.04555 [pdf, other]

Hierarchical Lovász Embeddings for Proposal-free Panoptic Segmentation

Authors: Tommi Kerola, Jie Li, Atsushi Kanehira, Yasunori Kudo, Alexis Vallet, Adrien Gaidon

Abstract: Panoptic segmentation brings together two separate tasks: instance and semantic segmentation. Although they are related, unifying them faces an apparent paradox: how to learn simultaneously instance-specific and category-specific (i.e. instance-agnostic) representations jointly. Hence, state-of-the-art panoptic segmentation methods use complex models with a distinct stream for each task. In contra… ▽ More Panoptic segmentation brings together two separate tasks: instance and semantic segmentation. Although they are related, unifying them faces an apparent paradox: how to learn simultaneously instance-specific and category-specific (i.e. instance-agnostic) representations jointly. Hence, state-of-the-art panoptic segmentation methods use complex models with a distinct stream for each task. In contrast, we propose Hierarchical Lovász Embeddings, per pixel feature vectors that simultaneously encode instance- and category-level discriminative information. We use a hierarchical Lovász hinge loss to learn a low-dimensional embedding space structured into a unified semantic and instance hierarchy without requiring separate network branches or object proposals. Besides modeling instances precisely in a proposal-free manner, our Hierarchical Lovász Embeddings generalize to categories by using a simple Nearest-Class-Mean classifier, including for non-instance "stuff" classes where instance segmentation methods are not applicable. Our simple model achieves state-of-the-art results compared to existing proposal-free panoptic segmentation methods on Cityscapes, COCO, and Mapillary Vistas. Furthermore, our model demonstrates temporal stability between video frames. △ Less

Submitted 8 June, 2021; originally announced June 2021.

Comments: 13 pages, 9 figures, including supplementary material. To be published in CVPR 2021

arXiv:1812.01280 [pdf, other]

Learning to Explain with Complemental Examples

Authors: Atsushi Kanehira, Tatsuya Harada

Abstract: This paper addresses the generation of explanations with visual examples. Given an input sample, we build a system that not only classifies it to a specific category, but also outputs linguistic explanations and a set of visual examples that render the decision interpretable. Focusing especially on the complementarity of the multimodal information, i.e., linguistic and visual examples, we attempt… ▽ More This paper addresses the generation of explanations with visual examples. Given an input sample, we build a system that not only classifies it to a specific category, but also outputs linguistic explanations and a set of visual examples that render the decision interpretable. Focusing especially on the complementarity of the multimodal information, i.e., linguistic and visual examples, we attempt to achieve it by maximizing the interaction information, which provides a natural definition of complementarity from an information theoretical viewpoint. We propose a novel framework to generate complemental explanations, on which the joint distribution of the variables to explain, and those to be explained is parameterized by three different neural networks: predictor, linguistic explainer, and example selector. Explanation models are trained collaboratively to maximize the interaction information to ensure the generated explanation are complemental to each other for the target. The results of experiments conducted on several datasets demonstrate the effectiveness of the proposed method. △ Less

Submitted 20 May, 2019; v1 submitted 4 December, 2018; originally announced December 2018.

Comments: Camera ready version of CVPR'19

arXiv:1812.01263 [pdf, other]

Multimodal Explanations by Predicting Counterfactuality in Videos

Authors: Atsushi Kanehira, Kentaro Takemoto, Sho Inayoshi, Tatsuya Harada

Abstract: This study addresses generating counterfactual explanations with multimodal information. Our goal is not only to classify a video into a specific category, but also to provide explanations on why it is not categorized to a specific class with combinations of visual-linguistic information. Requirements that the expected output should satisfy are referred to as counterfactuality in this paper: (1) C… ▽ More This study addresses generating counterfactual explanations with multimodal information. Our goal is not only to classify a video into a specific category, but also to provide explanations on why it is not categorized to a specific class with combinations of visual-linguistic information. Requirements that the expected output should satisfy are referred to as counterfactuality in this paper: (1) Compatibility of visual-linguistic explanations, and (2) Positiveness/negativeness for the specific positive/negative class. Exploiting a spatio-temporal region (tube) and an attribute as visual and linguistic explanations respectively, the explanation model is trained to predict the counterfactuality for possible combinations of multimodal information in a post-hoc manner. The optimization problem, which appears during training/inference, can be efficiently solved by inserting a novel neural network layer, namely the maximum subpath layer. We demonstrated the effectiveness of this method by comparison with a baseline of the action recognition datasets extended for this task. Moreover, we provide information-theoretical insight into the proposed method. △ Less

Submitted 19 May, 2019; v1 submitted 4 December, 2018; originally announced December 2018.

Comments: Camera ready version of CVPR'19

arXiv:1804.02843 [pdf, other]

Viewpoint-aware Video Summarization

Authors: Atsushi Kanehira, Luc Van Gool, Yoshitaka Ushiku, Tatsuya Harada

Abstract: This paper introduces a novel variant of video summarization, namely building a summary that depends on the particular aspect of a video the viewer focuses on. We refer to this as $\textit{viewpoint}$. To infer what the desired $\textit{viewpoint}$ may be, we assume that several other videos are available, especially groups of videos, e.g., as folders on a person's phone or laptop. The semantic si… ▽ More This paper introduces a novel variant of video summarization, namely building a summary that depends on the particular aspect of a video the viewer focuses on. We refer to this as $\textit{viewpoint}$. To infer what the desired $\textit{viewpoint}$ may be, we assume that several other videos are available, especially groups of videos, e.g., as folders on a person's phone or laptop. The semantic similarity between videos in a group vs. the dissimilarity between groups is used to produce $\textit{viewpoint}$-specific summaries. For considering similarity as well as avoiding redundancy, output summary should be (A) diverse, (B) representative of videos in the same group, and (C) discriminative against videos in the different groups. To satisfy these requirements (A)-(C) simultaneously, we proposed a novel video summarization method from multiple groups of videos. Inspired by Fisher's discriminant criteria, it selects summary by optimizing the combination of three terms (a) inner-summary, (b) inner-group, and (c) between-group variances defined on the feature representation of summary, which can simply represent (A)-(C). Moreover, we developed a novel dataset to investigate how well the generated summary reflects the underlying $\textit{viewpoint}$. Quantitative and qualitative experiments conducted on the dataset demonstrate the effectiveness of proposed method. △ Less

Submitted 10 April, 2018; v1 submitted 9 April, 2018; originally announced April 2018.

Comments: to appear at CVPR 2018

arXiv:1511.06783 [pdf, ps, other]

Recognizing Activities of Daily Living with a Wrist-mounted Camera

Authors: Katsunori Ohnishi, Atsushi Kanehira, Asako Kanezaki, Tatsuya Harada

Abstract: We present a novel dataset and a novel algorithm for recognizing activities of daily living (ADL) from a first-person wearable camera. Handled objects are crucially important for egocentric ADL recognition. For specific examination of objects related to users' actions separately from other objects in an environment, many previous works have addressed the detection of handled objects in images capt… ▽ More We present a novel dataset and a novel algorithm for recognizing activities of daily living (ADL) from a first-person wearable camera. Handled objects are crucially important for egocentric ADL recognition. For specific examination of objects related to users' actions separately from other objects in an environment, many previous works have addressed the detection of handled objects in images captured from head-mounted and chest-mounted cameras. Nevertheless, detecting handled objects is not always easy because they tend to appear small in images. They can be occluded by a user's body. As described herein, we mount a camera on a user's wrist. A wrist-mounted camera can capture handled objects at a large scale, and thus it enables us to skip object detection process. To compare a wrist-mounted camera and a head-mounted camera, we also develop a novel and publicly available dataset that includes videos and annotations of daily activities captured simultaneously by both cameras. Additionally, we propose a discriminative video representation that retains spatial and temporal information after encoding frame descriptors extracted by Convolutional Neural Networks (CNN). △ Less

Submitted 28 April, 2016; v1 submitted 20 November, 2015; originally announced November 2015.

Comments: CVPR2016 spotlight presentation

arXiv:1502.06064 [pdf, ps, other]

MILJS : Brand New JavaScript Libraries for Matrix Calculation and Machine Learning

Authors: Ken Miura, Tetsuaki Mano, Atsushi Kanehira, Yuichiro Tsuchiya, Tatsuya Harada

Abstract: MILJS is a collection of state-of-the-art, platform-independent, scalable, fast JavaScript libraries for matrix calculation and machine learning. Our core library offering a matrix calculation is called Sushi, which exhibits far better performance than any other leading machine learning libraries written in JavaScript. Especially, our matrix multiplication is 177 times faster than the fastest Java… ▽ More MILJS is a collection of state-of-the-art, platform-independent, scalable, fast JavaScript libraries for matrix calculation and machine learning. Our core library offering a matrix calculation is called Sushi, which exhibits far better performance than any other leading machine learning libraries written in JavaScript. Especially, our matrix multiplication is 177 times faster than the fastest JavaScript benchmark. Based on Sushi, a machine learning library called Tempura is provided, which supports various algorithms widely used in machine learning research. We also provide Soba as a visualization library. The implementations of our libraries are clearly written, properly documented and thus can are easy to get started with, as long as there is a web browser. These libraries are available from http://mil-tokyo.github.io/ under the MIT license. △ Less

Submitted 20 February, 2015; originally announced February 2015.

Showing 1–15 of 15 results for author: Kanehira, A