Search | arXiv e-print repository

Closed Loop Interactive Embodied Reasoning for Robot Manipulation

Authors: Michal Nazarczuk, Jan Kristof Behrens, Karla Stepanova, Matej Hoffmann, Krystian Mikolajczyk

Abstract: Embodied reasoning systems integrate robotic hardware and cognitive processes to perform complex tasks typically in response to a natural language query about a specific physical environment. This usually involves changing the belief about the scene or physically interacting and changing the scene (e.g. 'Sort the objects from lightest to heaviest'). In order to facilitate the development of such s… ▽ More Embodied reasoning systems integrate robotic hardware and cognitive processes to perform complex tasks typically in response to a natural language query about a specific physical environment. This usually involves changing the belief about the scene or physically interacting and changing the scene (e.g. 'Sort the objects from lightest to heaviest'). In order to facilitate the development of such systems we introduce a new simulating environment that makes use of MuJoCo physics engine and high-quality renderer Blender to provide realistic visual observations that are also accurate to the physical state of the scene. Together with the simulator we propose a new benchmark composed of 10 classes of multi-step reasoning scenarios that require simultaneous visual and physical measurements. Finally, we develop a new modular Closed Loop Interactive Reasoning (CLIER) approach that takes into account the measurements of non-visual object properties, changes in the scene caused by external disturbances as well as uncertain outcomes of robotic actions. We extensively evaluate our reasoning approach in simulation and in the real world manipulation tasks with a success rate above 76% and 64%, respectively. △ Less

Submitted 23 April, 2024; originally announced April 2024.

arXiv:2404.01932 [pdf, other]

Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks

Authors: Gabriela Sejnova, Michal Vavrecka, Karla Stepanova

Abstract: In this work, we focus on unsupervised vision-language-action map** in the area of robotic manipulation. Recently, multiple approaches employing pre-trained large language and vision models have been proposed for this task. However, they are computationally demanding and require careful fine-tuning of the produced outputs. A more lightweight alternative would be the implementation of multimodal… ▽ More In this work, we focus on unsupervised vision-language-action map** in the area of robotic manipulation. Recently, multiple approaches employing pre-trained large language and vision models have been proposed for this task. However, they are computationally demanding and require careful fine-tuning of the produced outputs. A more lightweight alternative would be the implementation of multimodal Variational Autoencoders (VAEs) which can extract the latent features of the data and integrate them into a joint representation, as has been demonstrated mostly on image-image or image-text data for the state-of-the-art models. Here we explore whether and how can multimodal VAEs be employed in unsupervised robotic manipulation tasks in a simulated environment. Based on the obtained results, we propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%. Moreover, we systematically evaluate the challenges raised by the individual tasks such as object or robot position variability, number of distractors or the task length. Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories based on vision and language. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: 7 pages, 5 figures, 2 tables, conference

arXiv:2404.01702 [pdf, other]

Tell and show: Combining multiple modalities to communicate manipulation tasks to a robot

Authors: Petr Vanc, Radoslav Skoviera, Karla Stepanova

Abstract: As human-robot collaboration is becoming more widespread, there is a need for a more natural way of communicating with the robot. This includes combining data from several modalities together with the context of the situation and background knowledge. Current approaches to communication typically rely only on a single modality or are often very rigid and not robust to missing, misaligned, or noisy… ▽ More As human-robot collaboration is becoming more widespread, there is a need for a more natural way of communicating with the robot. This includes combining data from several modalities together with the context of the situation and background knowledge. Current approaches to communication typically rely only on a single modality or are often very rigid and not robust to missing, misaligned, or noisy data. In this paper, we propose a novel method that takes inspiration from sensor fusion approaches to combine uncertain information from multiple modalities and enhance it with situational awareness (e.g., considering object properties or the scene setup). We first evaluate the proposed solution on simulated bimodal datasets (gestures and language) and show by several ablation experiments the importance of various components of the system and its robustness to noisy, missing, or misaligned observations. Then we implement and evaluate the model on the real setup. In human-robot interaction, we must also consider whether the selected action is probable enough to be executed or if we should better query humans for clarification. For these purposes, we enhance our model with adaptive entropy-based thresholding that detects the appropriate thresholds for different types of interaction showing similar performance as fine-tuned fixed thresholds. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: 8 pages, 8 figures

arXiv:2312.06280 [pdf, other]

Adaptive Compression of the Latent Space in Variational Autoencoders

Authors: Gabriela Sejnova, Michal Vavrecka, Karla Stepanova

Abstract: Variational Autoencoders (VAEs) are powerful generative models that have been widely used in various fields, including image and text generation. However, one of the known challenges in using VAEs is the model's sensitivity to its hyperparameters, such as the latent space size. This paper presents a simple extension of VAEs for automatically determining the optimal latent space size during the tra… ▽ More Variational Autoencoders (VAEs) are powerful generative models that have been widely used in various fields, including image and text generation. However, one of the known challenges in using VAEs is the model's sensitivity to its hyperparameters, such as the latent space size. This paper presents a simple extension of VAEs for automatically determining the optimal latent space size during the training process by gradually decreasing the latent size through neuron removal and observing the model performance. The proposed method is compared to traditional hyperparameter grid search and is shown to be significantly faster while still achieving the best optimal dimensionality on four image datasets. Furthermore, we show that the final performance of our method is comparable to training on the optimal latent size from scratch, and might thus serve as a convenient substitute. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: 10 pages, 4 figures

arXiv:2303.04451 [pdf, other]

doi 10.1109/IROS55552.2023.10341944

Communicating human intent to a robotic companion by multi-type gesture sentences

Authors: Petr Vanc, Jan Kristof Behrens, Karla Stepanova, Vaclav Hlavac

Abstract: Human-Robot collaboration in home and industrial workspaces is on the rise. However, the communication between robots and humans is a bottleneck. Although people use a combination of different types of gestures to complement speech, only a few robotic systems utilize gestures for communication. In this paper, we propose a gesture pseudo-language and show how multiple types of gestures can be combi… ▽ More Human-Robot collaboration in home and industrial workspaces is on the rise. However, the communication between robots and humans is a bottleneck. Although people use a combination of different types of gestures to complement speech, only a few robotic systems utilize gestures for communication. In this paper, we propose a gesture pseudo-language and show how multiple types of gestures can be combined to express human intent to a robot (i.e., expressing both the desired action and its parameters - e.g., pointing to an object and showing that the object should be emptied into a bowl). The demonstrated gestures and the perceived table-top scene (object poses detected by CosyPose) are processed in real-time) to extract the human's intent. We utilize behavior trees to generate reactive robot behavior that handles various possible states of the world (e.g., a drawer has to be opened before an object is placed into it) and recovers from errors (e.g., when the scene changes). Furthermore, our system enables switching between direct teleoperation of the end-effector and high-level operation using the proposed gesture sentences. The system is evaluated on increasingly complex tasks using a real 7-DoF Franka Emika Panda manipulator. Controlling the robot via action gestures lowered the execution time by up to 60%, compared to direct teleoperation. △ Less

Submitted 8 March, 2023; originally announced March 2023.

Comments: 7 pages, 9 figures

arXiv:2301.09899 [pdf, other]

doi 10.1109/ICRA48891.2023.10161308

Context-aware robot control using gesture episodes

Authors: Petr Vanc, Jan Kristof Behrens, Karla Stepanova

Abstract: Collaborative robots became a popular tool for increasing productivity in partly automated manufacturing plants. Intuitive robot teaching methods are required to quickly and flexibly adapt the robot programs to new tasks. Gestures have an essential role in human communication. However, in human-robot-interaction scenarios, gesture-based user interfaces are so far used rarely, and if they employ a… ▽ More Collaborative robots became a popular tool for increasing productivity in partly automated manufacturing plants. Intuitive robot teaching methods are required to quickly and flexibly adapt the robot programs to new tasks. Gestures have an essential role in human communication. However, in human-robot-interaction scenarios, gesture-based user interfaces are so far used rarely, and if they employ a one-to-one map** of gestures to robot control variables. In this paper, we propose a method that infers the user's intent based on gesture episodes, the context of the situation, and common sense. The approach is evaluated in a simulated table-top manipulation setting. We conduct deterministic experiments with simulated users and show that the system can even handle personal preferences of each user. △ Less

Submitted 24 January, 2023; originally announced January 2023.

Comments: 7 pages, 8 figures, accepted for ICRA 2023

arXiv:2209.07976 [pdf, other]

doi 10.1109/LRA.2023.3259735

Imitrob: Imitation Learning Dataset for Training and Evaluating 6D Object Pose Estimators

Authors: Jiri Sedlar, Karla Stepanova, Radoslav Skoviera, Jan K. Behrens, Matus Tuna, Gabriela Sejnova, Josef Sivic, Robert Babuska

Abstract: This paper introduces a dataset for training and evaluating methods for 6D pose estimation of hand-held tools in task demonstrations captured by a standard RGB camera. Despite the significant progress of 6D pose estimation methods, their performance is usually limited for heavily occluded objects, which is a common case in imitation learning, where the object is typically partially occluded by the… ▽ More This paper introduces a dataset for training and evaluating methods for 6D pose estimation of hand-held tools in task demonstrations captured by a standard RGB camera. Despite the significant progress of 6D pose estimation methods, their performance is usually limited for heavily occluded objects, which is a common case in imitation learning, where the object is typically partially occluded by the manipulating hand. Currently, there is a lack of datasets that would enable the development of robust 6D pose estimation methods for these conditions. To overcome this problem, we collect a new dataset (Imitrob) aimed at 6D pose estimation in imitation learning and other applications where a human holds a tool and performs a task. The dataset contains image sequences of nine different tools and twelve manipulation tasks with two camera viewpoints, four human subjects, and left/right hand. Each image is accompanied by an accurate ground truth measurement of the 6D object pose obtained by the HTC Vive motion tracking device. The use of the dataset is demonstrated by training and evaluating a recent 6D object pose estimation method (DOPE) in various setups. △ Less

Submitted 5 April, 2023; v1 submitted 16 September, 2022; originally announced September 2022.

Comments: The dataset and code are publicly available at http://imitrob.ciirc.cvut.cz/imitrobdataset.php

Journal ref: IEEE Robotics and Automation Letters, vol. 8, no. 5, pp. 2788-2795, 2023

arXiv:2209.03048 [pdf, other]

Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and Toolkit

Authors: Gabriela Sejnova, Michal Vavrecka, Karla Stepanova

Abstract: Multimodal Variational Autoencoders (VAEs) have been the subject of intense research in the past years as they can integrate multiple modalities into a joint representation and can thus serve as a promising tool for both data classification and generation. Several approaches toward multimodal VAE learning have been proposed so far, their comparison and evaluation have however been rather inconsist… ▽ More Multimodal Variational Autoencoders (VAEs) have been the subject of intense research in the past years as they can integrate multiple modalities into a joint representation and can thus serve as a promising tool for both data classification and generation. Several approaches toward multimodal VAE learning have been proposed so far, their comparison and evaluation have however been rather inconsistent. One reason is that the models differ at the implementation level, another problem is that the datasets commonly used in these cases were not initially designed to evaluate multimodal generative models. This paper addresses both mentioned issues. First, we propose a toolkit for systematic multimodal VAE training and comparison. The toolkit currently comprises 4 existing multimodal VAEs and 6 commonly used benchmark datasets along with instructions on how to easily add a new model or a dataset. Second, we present a disentangled bimodal dataset designed to comprehensively evaluate the joint generation and cross-generation capabilities across multiple difficulty levels. We demonstrate the utility of our dataset by comparing the implemented state-of-the-art models. △ Less

Submitted 24 November, 2023; v1 submitted 7 September, 2022; originally announced September 2022.

arXiv:2204.06343 [pdf, other]

Single-grasp deformable object discrimination: the effect of gripper morphology, sensing modalities, and action parameters

Authors: Michal Pliska, Shubhan Patni, Michal Mares, Pavel Stoudek, Zdenek Straka, Karla Stepanova, Matej Hoffmann

Abstract: In haptic object discrimination, the effect of gripper embodiment, action parameters, and sensory channels has not been systematically studied. We used two anthropomorphic hands and two 2-finger grippers to grasp two sets of deformable objects. On the object classification task, we found: (i) among classifiers, SVM on sensory features and LSTM on raw time series performed best across all grippers;… ▽ More In haptic object discrimination, the effect of gripper embodiment, action parameters, and sensory channels has not been systematically studied. We used two anthropomorphic hands and two 2-finger grippers to grasp two sets of deformable objects. On the object classification task, we found: (i) among classifiers, SVM on sensory features and LSTM on raw time series performed best across all grippers; (ii) faster compression speeds degraded performance; (iii) generalization to different gras** configurations was limited; transfer to different compression speeds worked well for the Barrett Hand only. Visualization of the feature spaces using PCA showed that the gripper morphology and the action parameters were the main source of variance, rendering generalization across embodiment or grasp configurations very hard. On the highly challenging dataset consisting of polyurethane foams alone, only the Barrett Hand achieved excellent performance. Tactile sensors can thus provide a key advantage even if recognition is based on stiffness rather than shape. The dataset with 24000 measurements is publicly available. △ Less

Submitted 2 February, 2024; v1 submitted 13 April, 2022; originally announced April 2022.

Comments: 12 pages, 9 figures

ACM Class: I.2.9

arXiv:2012.07548 [pdf, other]

doi 10.1016/j.rcim.2021.102250

Automatic self-contained calibration of an industrial dual-arm robot with cameras using self-contact, planar constraints, and self-observation

Authors: Karla Stepanova, Jakub Rozlivek, Frantisek Puciow, Pavel Krsek, Tomas Pajdla, Matej Hoffmann

Abstract: We present a robot kinematic calibration method that combines complementary calibration approaches: self-contact, planar constraints, and self-observation. We analyze the estimation of the end effector parameters, joint offsets of the manipulators, and calibration of the complete kinematic chain (DH parameters). The results are compared with ground truth measurements provided by a laser tracker. O… ▽ More We present a robot kinematic calibration method that combines complementary calibration approaches: self-contact, planar constraints, and self-observation. We analyze the estimation of the end effector parameters, joint offsets of the manipulators, and calibration of the complete kinematic chain (DH parameters). The results are compared with ground truth measurements provided by a laser tracker. Our main findings are: (1) When applying the complementary calibration approaches in isolation, the self-contact approach yields the best and most stable results. (2) All combinations of more than one approach were always superior to using any single approach in terms of calibration errors and the observability of the estimated parameters. Combining more approaches delivers robot parameters that better generalize to the workspace parts not used for the calibration. (3) Sequential calibration, i.e. calibrating cameras first and then robot kinematics, is more effective than simultaneous calibration of all parameters. In real experiments, we employ two industrial manipulators mounted on a common base. The manipulators are equipped with force/torque sensors at their wrists, with two cameras attached to the robot base, and with special end effectors with fiducial markers. We collect a new comprehensive dataset for robot kinematic calibration and make it publicly available. The dataset and its analysis provide quantitative and qualitative insights that go beyond the specific manipulators used in this work and apply to self-contained robot kinematic calibration in general. △ Less

Submitted 24 September, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

Comments: 25 pages, 29 figures

Journal ref: Robotics and Computer-Integrated Manufacturing 2022, Volume 73, 102250

arXiv:1901.08335 [pdf, other]

Teaching robots to imitate a human with no on-teacher sensors. What are the key challenges?

Authors: Radoslav Skoviera, Karla Stepanova, Michael Tesar, Gabriela Sejnova, Jiri Sedlar, Michal Vavrecka, Robert Babuska, Josef Sivic

Abstract: In this paper, we consider the problem of learning object manipulation tasks from human demonstration using RGB or RGB-D cameras. We highlight the key challenges in capturing sufficiently good data with no tracking devices - starting from sensor selection and accurate 6DoF pose estimation to natural language processing. In particular, we focus on two showcases: gluing task with a glue gun and simp… ▽ More In this paper, we consider the problem of learning object manipulation tasks from human demonstration using RGB or RGB-D cameras. We highlight the key challenges in capturing sufficiently good data with no tracking devices - starting from sensor selection and accurate 6DoF pose estimation to natural language processing. In particular, we focus on two showcases: gluing task with a glue gun and simple block-stacking with variable blocks. Furthermore, we discuss how a linguistic description of the task could help to improve the accuracy of task description. We also present the whole architecture of our transfer of the imitated task to the simulated and real robot environment. △ Less

Submitted 24 January, 2019; originally announced January 2019.

Journal ref: The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2018, Workshop on: Towards Intelligent Social Robots: From Naive Robots to Robot Sapiens http://intelligent-social-robots-ws.com/materials/

arXiv:1805.07263 [pdf, other]

doi 10.1109/LRA.2019.2898320

Robot self-calibration using multiple kinematic chains -- a simulation study on the iCub humanoid robot

Authors: Karla Stepanova, Tomas Pajdla, Matej Hoffmann

Abstract: Mechanism calibration is an important and non-trivial task in robotics. Advances in sensor technology make affordable but increasingly accurate devices such as cameras and tactile sensors available, making it possible to perform automated self-contained calibration relying on redundant information in these sensory streams. In this work, we use a simulated iCub humanoid robot with a stereo camera s… ▽ More Mechanism calibration is an important and non-trivial task in robotics. Advances in sensor technology make affordable but increasingly accurate devices such as cameras and tactile sensors available, making it possible to perform automated self-contained calibration relying on redundant information in these sensory streams. In this work, we use a simulated iCub humanoid robot with a stereo camera system and end-effector contact emulation to quantitatively compare the performance of kinematic calibration by employing different combinations of intersecting kinematic chains -- either through self-observation or self-touch. The parameters varied were: (i) type and number of intersecting kinematic chains used for calibration, (ii) parameters and chains subject to optimization, (iii) amount of initial perturbation of kinematic parameters, (iv) number of poses/configurations used for optimization, (v) amount of measurement noise in end-effector positions / cameras. The main findings are: (1) calibrating parameters of a single chain (e.g. one arm) by employing multiple kinematic chains ("self-observation" and "self-touch") is superior in terms of optimization results as well as observability; (2) when using multi-chain calibration, fewer poses suffice to get similar performance compared to when for example only observation from a single camera is used; (3) parameters of all chains (here 86 DH parameters) can be subject to calibration simultaneously and with 50 (100) poses, end-effector error of around 2 (1) mm can be achieved; (4) adding noise to a sensory modality degrades performance of all calibrations employing the chains relying on this information. △ Less

Submitted 31 August, 2020; v1 submitted 18 May, 2018; originally announced May 2018.

Comments: 8 pages; 8 figures; substantially revised version compared to previous - all data and results are new

MSC Class: 68T40 Robotics

Journal ref: IEEE Robotics and Automation Letters 4(2), 1900-1907 (2019)

arXiv:1706.02490 [pdf, other]

Where is my forearm? Clustering of body parts from simultaneous tactile and linguistic input using sequential map**

Authors: Karla Stepanova, Matej Hoffmann, Zdenek Straka, Frederico B. Klein, Angelo Cangelosi, Michal Vavrecka

Abstract: Humans and animals are constantly exposed to a continuous stream of sensory information from different modalities. At the same time, they form more compressed representations like concepts or symbols. In species that use language, this process is further structured by this interaction, where a map** between the sensorimotor concepts and linguistic elements needs to be established. There is evide… ▽ More Humans and animals are constantly exposed to a continuous stream of sensory information from different modalities. At the same time, they form more compressed representations like concepts or symbols. In species that use language, this process is further structured by this interaction, where a map** between the sensorimotor concepts and linguistic elements needs to be established. There is evidence that children might be learning language by simply disambiguating potential meanings based on multiple exposures to utterances in different contexts (cross-situational learning). In existing models, the map** between modalities is usually found in a single step by directly using frequencies of referent and meaning co-occurrences. In this paper, we present an extension of this one-step map** and introduce a newly proposed sequential map** algorithm together with a publicly available Matlab implementation. For demonstration, we have chosen a less typical scenario: instead of learning to associate objects with their names, we focus on body representations. A humanoid robot is receiving tactile stimulations on its body, while at the same time listening to utterances of the body part names (e.g., hand, forearm and torso). With the goal at arriving at the correct "body categories", we demonstrate how a sequential map** algorithm outperforms one-step map**. In addition, the effect of data set size and noise in the linguistic input are studied. △ Less

Submitted 8 June, 2017; originally announced June 2017.

Comments: pp. 155-162

Showing 1–13 of 13 results for author: Stepanova, K