-
Lifelong Robot Library Learning: Bootstrap** Composable and Generalizable Skills for Embodied Control with Language Models
Authors:
Georgios Tziafas,
Hamidreza Kasaei
Abstract:
Large Language Models (LLMs) have emerged as a new paradigm for embodied reasoning and control, most recently by generating robot policy code that utilizes a custom library of vision and control primitive skills. However, prior arts fix their skills library and steer the LLM with carefully hand-crafted prompt engineering, limiting the agent to a stationary range of addressable tasks. In this work,…
▽ More
Large Language Models (LLMs) have emerged as a new paradigm for embodied reasoning and control, most recently by generating robot policy code that utilizes a custom library of vision and control primitive skills. However, prior arts fix their skills library and steer the LLM with carefully hand-crafted prompt engineering, limiting the agent to a stationary range of addressable tasks. In this work, we introduce LRLL, an LLM-based lifelong learning agent that continuously grows the robot skill library to tackle manipulation tasks of ever-growing complexity. LRLL achieves this with four novel contributions: 1) a soft memory module that allows dynamic storage and retrieval of past experiences to serve as context, 2) a self-guided exploration policy that proposes new tasks in simulation, 3) a skill abstractor that distills recent experiences into new library skills, and 4) a lifelong learning algorithm for enabling human users to bootstrap new skills with minimal online interaction. LRLL continuously transfers knowledge from the memory to the library, building composable, general and interpretable policies, while bypassing gradient-based optimization, thus relieving the learner from catastrophic forgetting. Empirical evaluation in a simulated tabletop environment shows that LRLL outperforms end-to-end and vanilla LLM approaches in the lifelong setup while learning skills that are transferable to the real world. Project material will become available at the webpage https://gtziafas.github.io/LRLL_project.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
3D Feature Distillation with Object-Centric Priors
Authors:
Georgios Tziafas,
Yucheng Xu,
Zhibin Li,
Hamidreza Kasaei
Abstract:
Grounding natural language to the physical world is a ubiquitous topic with a wide range of applications in computer vision and robotics. Recently, 2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images. Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural f…
▽ More
Grounding natural language to the physical world is a ubiquitous topic with a wide range of applications in computer vision and robotics. Recently, 2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images. Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific and hence lack generalization, or focus on indoor room scan data that require access to multiple camera views, which is not practical in robot manipulation scenarios. Additionally, related methods typically fuse features at pixel-level and assume that all camera views are equally informative. In this work, we show that this approach leads to sub-optimal 3D features, both in terms of grounding accuracy, as well as segmentation crispness. To alleviate this, we propose a multi-view feature fusion strategy that employs object-centric priors to eliminate uninformative views based on semantic information, and fuse features at object-level via instance segmentation masks. To distill our object-centric 3D features, we generate a large-scale synthetic multi-view dataset of cluttered tabletop scenes, spawning 15k scenes from over 3300 unique object instances, which we make publicly available. We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency, while doing so from single-view RGB-D, thus departing from the assumption of multiple camera views at test time. Finally, we show that our approach can generalize to novel tabletop domains and be re-purposed for 3D instance segmentation without fine-tuning, and demonstrate its utility for language-guided robotic gras** in clutter
△ Less
Submitted 1 July, 2024; v1 submitted 26 June, 2024;
originally announced June 2024.
-
Towards Open-World Gras** with Large Vision-Language Models
Authors:
Georgios Tziafas,
Hamidreza Kasaei
Abstract:
The ability to grasp objects in-the-wild from open-ended language instructions constitutes a fundamental challenge in robotics. An open-world gras** system should be able to combine high-level contextual with low-level physical-geometric reasoning in order to be applicable in arbitrary scenarios. Recent works exploit the web-scale knowledge inherent in large language models (LLMs) to plan and re…
▽ More
The ability to grasp objects in-the-wild from open-ended language instructions constitutes a fundamental challenge in robotics. An open-world gras** system should be able to combine high-level contextual with low-level physical-geometric reasoning in order to be applicable in arbitrary scenarios. Recent works exploit the web-scale knowledge inherent in large language models (LLMs) to plan and reason in robotic context, but rely on external vision and action models to ground such knowledge into the environment and parameterize actuation. This setup suffers from two major bottlenecks: a) the LLM's reasoning capacity is constrained by the quality of visual grounding, and b) LLMs do not contain low-level spatial understanding of the world, which is essential for gras** in contact-rich scenarios. In this work we demonstrate that modern vision-language models (VLMs) are capable of tackling such limitations, as they are implicitly grounded and can jointly reason about semantics and geometry. We propose OWG, an open-world gras** pipeline that combines VLMs with segmentation and grasp synthesis models to unlock grounded world understanding in three stages: open-ended referring segmentation, grounded grasp planning and grasp ranking via contact reasoning, all of which can be applied zero-shot via suitable visual prompting mechanisms. We conduct extensive evaluation in cluttered indoor scene datasets to showcase OWG's robustness in grounding from open-ended language, as well as open-world robotic gras** experiments in both simulation and hardware that demonstrate superior performance compared to previous supervised and zero-shot LLM-based methods.
△ Less
Submitted 1 July, 2024; v1 submitted 26 June, 2024;
originally announced June 2024.
-
Harnessing the Synergy between Pushing, Gras**, and Throwing to Enhance Object Manipulation in Cluttered Scenarios
Authors:
Hamidreza Kasaei,
Mohammadreza Kasaei
Abstract:
In this work, we delve into the intricate synergy among non-prehensile actions like pushing, and prehensile actions such as gras** and throwing, within the domain of robotic manipulation. We introduce an innovative approach to learning these synergies by leveraging model-free deep reinforcement learning. The robot's workflow involves detecting the pose of the target object and the basket at each…
▽ More
In this work, we delve into the intricate synergy among non-prehensile actions like pushing, and prehensile actions such as gras** and throwing, within the domain of robotic manipulation. We introduce an innovative approach to learning these synergies by leveraging model-free deep reinforcement learning. The robot's workflow involves detecting the pose of the target object and the basket at each time step, predicting the optimal push configuration to isolate the target object, determining the appropriate grasp configuration, and inferring the necessary parameters for an accurate throw into the basket. This empowers robots to skillfully reconfigure cluttered scenarios through pushing, creating space for collision-free gras** actions. Simultaneously, we integrate throwing behavior, showcasing how this action significantly extends the robot's operational reach. Ensuring safety, we developed a simulation environment in Gazebo for robot training, applying the learned policy directly to our real robot. Notably, this work represents a pioneering effort to learn the synergy between pushing, gras**, and throwing actions. Extensive experimentation in both simulated and real-robot scenarios substantiates the effectiveness of our approach across diverse settings. Our approach achieves a success rate exceeding 80\% in both simulated and real-world scenarios. A video showcasing our experiments is available online at: https://youtu.be/q1l4BJVDbRw
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
Language-guided Robot Gras**: CLIP-based Referring Grasp Synthesis in Clutter
Authors:
Georgios Tziafas,
Yucheng Xu,
Arushi Goel,
Mohammadreza Kasaei,
Zhibin Li,
Hamidreza Kasaei
Abstract:
Robots operating in human-centric environments require the integration of visual grounding and gras** capabilities to effectively manipulate objects based on user instructions. This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes. Existing approaches often employ multi-stage pipelines that firs…
▽ More
Robots operating in human-centric environments require the integration of visual grounding and gras** capabilities to effectively manipulate objects based on user instructions. This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes. Existing approaches often employ multi-stage pipelines that first segment the referred object and then propose a suitable grasp, and are evaluated in private datasets or simulators that do not capture the complexity of natural indoor scenes. To address these limitations, we develop a challenging benchmark based on cluttered indoor scenes from OCID dataset, for which we generate referring expressions and connect them with 4-DoF grasp poses. Further, we propose a novel end-to-end model (CROG) that leverages the visual grounding capabilities of CLIP to learn grasp synthesis directly from image-text pairs. Our results show that vanilla integration of CLIP with pretrained models transfers poorly in our challenging benchmark, while CROG achieves significant improvements both in terms of grounding and gras**. Extensive robot experiments in both simulation and hardware demonstrate the effectiveness of our approach in challenging interactive object gras** scenarios that include clutter.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
Anchor Space Optimal Transport: Accelerating Batch Processing of Multiple OT Problems
Authors:
Jianming Huang,
Xun Su,
Zhongxi Fang,
Hiroyuki Kasai
Abstract:
The optimal transport (OT) theory provides an effective way to compare probability distributions on a defined metric space, but it suffers from cubic computational complexity. Although the Sinkhorn's algorithm greatly reduces the computational complexity of OT solutions, the solutions of multiple OT problems are still time-consuming and memory-comsuming in practice. However, many works on the comp…
▽ More
The optimal transport (OT) theory provides an effective way to compare probability distributions on a defined metric space, but it suffers from cubic computational complexity. Although the Sinkhorn's algorithm greatly reduces the computational complexity of OT solutions, the solutions of multiple OT problems are still time-consuming and memory-comsuming in practice. However, many works on the computational acceleration of OT are usually based on the premise of a single OT problem, ignoring the potential common characteristics of the distributions in a mini-batch. Therefore, we propose a translated OT problem designated as the anchor space optimal transport (ASOT) problem, which is specially designed for batch processing of multiple OT problem solutions. For the proposed ASOT problem, the distributions will be mapped into a shared anchor point space, which learns the potential common characteristics and thus help accelerate OT batch processing. Based on the proposed ASOT, the Wasserstein distance error to the original OT problem is proven to be bounded by ground cost errors. Building upon this, we propose three methods to learn an anchor space minimizing the distance error, each of which has its application background. Numerical experiments on real-world datasets show that our proposed methods can greatly reduce computational time while maintaining reasonable approximation performance.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation using Large Language Models
Authors:
Bangguo Yu,
Hamidreza Kasaei,
Ming Cao
Abstract:
In advanced human-robot interaction tasks, visual target navigation is crucial for autonomous robots navigating unknown environments. While numerous approaches have been developed in the past, most are designed for single-robot operations, which often suffer from reduced efficiency and robustness due to environmental complexities. Furthermore, learning policies for multi-robot collaboration are re…
▽ More
In advanced human-robot interaction tasks, visual target navigation is crucial for autonomous robots navigating unknown environments. While numerous approaches have been developed in the past, most are designed for single-robot operations, which often suffer from reduced efficiency and robustness due to environmental complexities. Furthermore, learning policies for multi-robot collaboration are resource-intensive. To address these challenges, we propose Co-NavGPT, an innovative framework that integrates Large Language Models (LLMs) as a global planner for multi-robot cooperative visual target navigation. Co-NavGPT encodes the explored environment data into prompts, enhancing LLMs' scene comprehension. It then assigns exploration frontiers to each robot for efficient target search. Experimental results on Habitat-Matterport 3D (HM3D) demonstrate that Co-NavGPT surpasses existing models in success rates and efficiency without any learning process, demonstrating the vast potential of LLMs in multi-robot collaboration domains. The supplementary video, prompts, and code can be accessed via the following link: https://sites.google.com/view/co-navgpt
△ Less
Submitted 25 December, 2023; v1 submitted 11 October, 2023;
originally announced October 2023.
-
Safe Screening for Unbalanced Optimal Transport
Authors:
Xun Su,
Zhongxi Fang,
Hiroyuki Kasai
Abstract:
This paper introduces a framework that utilizes the Safe Screening technique to accelerate the optimization process of the Unbalanced Optimal Transport (UOT) problem by proactively identifying and eliminating zero elements in the sparse solutions. We demonstrate the feasibility of applying Safe Screening to the UOT problem with $\ell_2$-penalty and KL-penalty by conducting an analysis of the solut…
▽ More
This paper introduces a framework that utilizes the Safe Screening technique to accelerate the optimization process of the Unbalanced Optimal Transport (UOT) problem by proactively identifying and eliminating zero elements in the sparse solutions. We demonstrate the feasibility of applying Safe Screening to the UOT problem with $\ell_2$-penalty and KL-penalty by conducting an analysis of the solution's bounds and considering the local strong convexity of the dual problem. Considering the specific structural characteristics of the UOT in comparison to general Lasso problems on the index matrix, we specifically propose a novel approximate projection, an elliptical safe region construction, and a two-hyperplane relaxation method. These enhancements significantly improve the screening efficiency for the UOT's without altering the algorithm's complexity.
△ Less
Submitted 1 July, 2023;
originally announced July 2023.
-
Fine-grained 3D object recognition: an approach and experiments
Authors:
Junhyung Jo,
Hamidreza Kasaei
Abstract:
Three-dimensional (3D) object recognition technology is being used as a core technology in advanced technologies such as autonomous driving of automobiles. There are two sets of approaches for 3D object recognition: (i) hand-crafted approaches like Global Orthographic Object Descriptor (GOOD), and (ii) deep learning-based approaches such as MobileNet and VGG. However, it is needed to know which of…
▽ More
Three-dimensional (3D) object recognition technology is being used as a core technology in advanced technologies such as autonomous driving of automobiles. There are two sets of approaches for 3D object recognition: (i) hand-crafted approaches like Global Orthographic Object Descriptor (GOOD), and (ii) deep learning-based approaches such as MobileNet and VGG. However, it is needed to know which of these approaches works better in an open-ended domain where the number of known categories increases over time, and the system should learn about new object categories using few training examples. In this paper, we first implemented an offline 3D object recognition system that takes an object view as input and generates category labels as output. In the offline stage, instance-based learning (IBL) is used to form a new category and we use K-fold cross-validation to evaluate the obtained object recognition performance. We then test the proposed approach in an online fashion by integrating the code into a simulated teacher test. As a result, we concluded that the approach using deep learning features is more suitable for open-ended fashion. Moreover, we observed that concatenating the hand-crafted and deep learning features increases the classification accuracy.
△ Less
Submitted 28 June, 2023;
originally announced June 2023.
-
Anonymous estimation of intensity distribution of magnetic fields with quantum sensing network
Authors:
Hiroto Kasai,
Yuki Takeuchi,
Yuichiro Matsuzaki,
Yasuhiro Tokura
Abstract:
A quantum sensing network is used to simultaneously detect and measure physical quantities, such as magnetic fields, at different locations. However, there is a risk that the measurement data is leaked to the third party during the communication. Many theoretical and experimental efforts have been made to realize a secure quantum sensing network where a high level of security is guaranteed. In thi…
▽ More
A quantum sensing network is used to simultaneously detect and measure physical quantities, such as magnetic fields, at different locations. However, there is a risk that the measurement data is leaked to the third party during the communication. Many theoretical and experimental efforts have been made to realize a secure quantum sensing network where a high level of security is guaranteed. In this paper, we propose a protocol to estimate statistical quantities of the target fields at different places without knowing individual value of the target fields. We generate an enanglement between $L$ quantum sensors, let the quantum sensor interact with local fields, and perform specific measurements on them. By calculating the quantum Fisher information to estimate the individual value of the magnetic fields, we show that we cannot obtain any information of the value of the individual fields in the limit of large $L$. On the other hand, in our protocol, we can estimate theoretically any moment of the field distribution by measuring a specific observable and evaluated relative uncertainty of $k$-th ($k=1,2,3,4$) order moment. Our results are a significant step towards using a quantum sensing network with security inbuilt.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
Learning Perceptive Bipedal Locomotion over Irregular Terrain
Authors:
Bart van Marum,
Matthia Sabatelli,
Hamidreza Kasaei
Abstract:
In this paper we propose a novel bipedal locomotion controller that uses noisy exteroception to traverse a wide variety of terrains. Building on the cutting-edge advancements in attention based belief encoding for quadrupedal locomotion, our work extends these methods to the bipedal domain, resulting in a robust and reliable internal belief of the terrain ahead despite noisy sensor inputs. Additio…
▽ More
In this paper we propose a novel bipedal locomotion controller that uses noisy exteroception to traverse a wide variety of terrains. Building on the cutting-edge advancements in attention based belief encoding for quadrupedal locomotion, our work extends these methods to the bipedal domain, resulting in a robust and reliable internal belief of the terrain ahead despite noisy sensor inputs. Additionally, we present a reward function that allows the controller to successfully traverse irregular terrain. We compare our method with a proprioceptive baseline and show that our method is able to traverse a wide variety of terrains and greatly outperforms the state-of-the-art in terms of robustness, speed and efficiency.
△ Less
Submitted 14 April, 2023;
originally announced April 2023.
-
Frontier Semantic Exploration for Visual Target Navigation
Authors:
Bangguo Yu,
Hamidreza Kasaei,
Ming Cao
Abstract:
This work focuses on the problem of visual target navigation, which is very important for autonomous robots as it is closely related to high-level tasks. To find a special object in unknown environments, classical and learning-based approaches are fundamental components of navigation that have been investigated thoroughly in the past. However, due to the difficulty in the representation of complic…
▽ More
This work focuses on the problem of visual target navigation, which is very important for autonomous robots as it is closely related to high-level tasks. To find a special object in unknown environments, classical and learning-based approaches are fundamental components of navigation that have been investigated thoroughly in the past. However, due to the difficulty in the representation of complicated scenes and the learning of the navigation policy, previous methods are still not adequate, especially for large unknown scenes. Hence, we propose a novel framework for visual target navigation using the frontier semantic policy. In this proposed framework, the semantic map and the frontier map are built from the current observation of the environment. Using the features of the maps and object category, deep reinforcement learning enables to learn a frontier semantic policy which can be used to select a frontier cell as a long-term goal to explore the environment efficiently. Experiments on Gibson and Habitat-Matterport 3D (HM3D) demonstrate that the proposed framework significantly outperforms existing map-based methods in terms of success rate and efficiency. Ablation analysis also indicates that the proposed approach learns a more efficient exploration policy based on the frontiers. A demonstration is provided to verify the applicability of applying our model to real-world transfer. The supplementary video and code can be accessed via the following link: https://sites.google.com/view/fsevn.
△ Less
Submitted 25 December, 2023; v1 submitted 11 April, 2023;
originally announced April 2023.
-
L3MVN: Leveraging Large Language Models for Visual Target Navigation
Authors:
Bangguo Yu,
Hamidreza Kasaei,
Ming Cao
Abstract:
Visual target navigation in unknown environments is a crucial problem in robotics. Despite extensive investigation of classical and learning-based approaches in the past, robots lack common-sense knowledge about household objects and layouts. Prior state-of-the-art approaches to this task rely on learning the priors during the training and typically require significant expensive resources and time…
▽ More
Visual target navigation in unknown environments is a crucial problem in robotics. Despite extensive investigation of classical and learning-based approaches in the past, robots lack common-sense knowledge about household objects and layouts. Prior state-of-the-art approaches to this task rely on learning the priors during the training and typically require significant expensive resources and time for learning. To address this, we propose a new framework for visual target navigation that leverages Large Language Models (LLM) to impart common sense for object searching. Specifically, we introduce two paradigms: (i) zero-shot and (ii) feed-forward approaches that use language to find the relevant frontier from the semantic map as a long-term goal and explore the environment efficiently. Our analysis demonstrates the notable zero-shot generalization and transfer capabilities from the use of language. Experiments on Gibson and Habitat-Matterport 3D (HM3D) demonstrate that the proposed framework significantly outperforms existing map-based methods in terms of success rate and generalization. Ablation analysis also indicates that the common-sense knowledge from the language model leads to more efficient semantic exploration. Finally, we provide a real robot experiment to verify the applicability of our framework in real-world scenarios. The supplementary video and code can be accessed via the following link: https://sites.google.com/view/l3mvn.
△ Less
Submitted 25 December, 2023; v1 submitted 11 April, 2023;
originally announced April 2023.
-
Controllable Video Generation by Learning the Underlying Dynamical System with Neural ODE
Authors:
Yucheng Xu,
Li Nanbo,
Arushi Goel,
Zijian Guo,
Zonghai Yao,
Hamidreza Kasaei,
Mohammadreze Kasaei,
Zhibin Li
Abstract:
Videos depict the change of complex dynamical systems over time in the form of discrete image sequences. Generating controllable videos by learning the dynamical system is an important yet underexplored topic in the computer vision community. This paper presents a novel framework, TiV-ODE, to generate highly controllable videos from a static image and a text caption. Specifically, our framework le…
▽ More
Videos depict the change of complex dynamical systems over time in the form of discrete image sequences. Generating controllable videos by learning the dynamical system is an important yet underexplored topic in the computer vision community. This paper presents a novel framework, TiV-ODE, to generate highly controllable videos from a static image and a text caption. Specifically, our framework leverages the ability of Neural Ordinary Differential Equations~(Neural ODEs) to represent complex dynamical systems as a set of nonlinear ordinary differential equations. The resulting framework is capable of generating videos with both desired dynamics and content. Experiments demonstrate the ability of the proposed method in generating highly controllable and visually consistent videos, and its capability of modeling dynamical systems. Overall, this work is a significant step towards develo** advanced controllable video generation models that can handle complex and dynamic scenes.
△ Less
Submitted 4 April, 2023; v1 submitted 9 March, 2023;
originally announced March 2023.
-
Instance-wise Grasp Synthesis for Robotic Gras**
Authors:
Yucheng Xu,
Mohammadreza Kasaei,
Hamidreza Kasaei,
Zhibin Li
Abstract:
Generating high-quality instance-wise grasp configurations provides critical information of how to grasp specific objects in a multi-object environment and is of high importance for robot manipulation tasks. This work proposed a novel \textbf{S}ingle-\textbf{S}tage \textbf{G}rasp (SSG) synthesis network, which performs high-quality instance-wise grasp synthesis in a single stage: instance mask and…
▽ More
Generating high-quality instance-wise grasp configurations provides critical information of how to grasp specific objects in a multi-object environment and is of high importance for robot manipulation tasks. This work proposed a novel \textbf{S}ingle-\textbf{S}tage \textbf{G}rasp (SSG) synthesis network, which performs high-quality instance-wise grasp synthesis in a single stage: instance mask and grasp configurations are generated for each object simultaneously. Our method outperforms state-of-the-art on robotic grasp prediction based on the OCID-Grasp dataset, and performs competitively on the JACQUARD dataset. The benchmarking results showed significant improvements compared to the baseline on the accuracy of generated grasp configurations. The performance of the proposed method has been validated through both extensive simulations and real robot experiments for three tasks including single object pick-and-place, grasp synthesis in cluttered environments and table cleaning task.
△ Less
Submitted 15 February, 2023;
originally announced February 2023.
-
Explain What You See: Open-Ended Segmentation and Recognition of Occluded 3D Objects
Authors:
H. Ayoobi,
H. Kasaei,
M. Cao,
R. Verbrugge,
B. Verheij
Abstract:
Local-HDP (for Local Hierarchical Dirichlet Process) is a hierarchical Bayesian method that has recently been used for open-ended 3D object category recognition. This method has been proven to be efficient in real-time robotic applications. However, the method is not robust to a high degree of occlusion. We address this limitation in two steps. First, we propose a novel semantic 3D object-parts se…
▽ More
Local-HDP (for Local Hierarchical Dirichlet Process) is a hierarchical Bayesian method that has recently been used for open-ended 3D object category recognition. This method has been proven to be efficient in real-time robotic applications. However, the method is not robust to a high degree of occlusion. We address this limitation in two steps. First, we propose a novel semantic 3D object-parts segmentation method that has the flexibility of Local-HDP. This method is shown to be suitable for open-ended scenarios where the number of 3D objects or object parts is not fixed and can grow over time. We show that the proposed method has a higher percentage of mean intersection over union, using a smaller number of learning instances. Second, we integrate this technique with a recently introduced argumentation-based online incremental learning method, thereby enabling the model to handle a high degree of occlusion. We show that the resulting model produces an explicit set of explanations for the 3D object category recognition task.
△ Less
Submitted 17 January, 2023;
originally announced January 2023.
-
Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models
Authors:
Songsong Xiong,
Georgios Tziafas,
Hamidreza Kasaei
Abstract:
Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-gra…
▽ More
Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-grained 3D datasets poses a significant problem in addressing this issue effectively. In this paper, we propose a hybrid multi-modal Vision Transformer (ViT) and Convolutional Neural Networks (CNN) approach to improve the performance of fine-grained visual classification (FGVC). To address the shortage of FGVC 3D datasets, we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe instances. Our approach was evaluated on both datasets, and the results indicate that it outperforms both CNN-only and ViT-only baselines, achieving a recognition accuracy of 94.50 % and 93.51 % on the restaurant and shoe datasets, respectively. Additionally, we have made our FGVC RGB-D datasets available to the research community to enable further experimentation and advancement. Furthermore, we successfully integrated our proposed method with a robot framework and demonstrated its potential as a fine-grained perception tool in both simulated and real-world robotic scenarios.
△ Less
Submitted 6 March, 2023; v1 submitted 3 October, 2022;
originally announced October 2022.
-
GraspCaps: A Capsule Network Approach for Familiar 6DoF Object Gras**
Authors:
Tomas van der Velde,
Hamed Ayoobi,
Hamidreza Kasaei
Abstract:
As robots become more widely available outside industrial settings, the need for reliable object gras** and manipulation is increasing. In such environments, robots must be able to grasp and manipulate novel objects in various situations. This paper presents GraspCaps, a novel architecture based on Capsule Networks for generating per-point 6D grasp configurations for familiar objects. GraspCaps…
▽ More
As robots become more widely available outside industrial settings, the need for reliable object gras** and manipulation is increasing. In such environments, robots must be able to grasp and manipulate novel objects in various situations. This paper presents GraspCaps, a novel architecture based on Capsule Networks for generating per-point 6D grasp configurations for familiar objects. GraspCaps extracts a rich feature vector of the objects present in the point cloud input, which is then used to generate per-point grasp vectors. This approach allows the network to learn specific gras** strategies for each object category. In addition to GraspCaps, the paper also presents a method for generating a large object-gras** dataset using simulated annealing. The obtained dataset is then used to train the GraspCaps network. Through extensive experiments, we evaluate the performance of the proposed approach, particularly in terms of the success rate of gras** familiar objects in challenging real and simulated scenarios. The experimental results showed that the overall object-gras** performance of the proposed approach is significantly better than the selected baseline. This superior performance highlights the effectiveness of the GraspCaps in achieving successful object gras** across various scenarios.
△ Less
Submitted 29 November, 2023; v1 submitted 7 October, 2022;
originally announced October 2022.
-
Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach
Authors:
Georgios Tziafas,
Hamidreza Kasaei
Abstract:
In this paper we present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation. A non-expert human user can prompt the robot using unconstrained natural language, providing a referring expression (REF), a question (VQA), or a grasp action instruction. The system tackles all cases in a task-agnostic fashion through the utilization of a shared library of…
▽ More
In this paper we present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation. A non-expert human user can prompt the robot using unconstrained natural language, providing a referring expression (REF), a question (VQA), or a grasp action instruction. The system tackles all cases in a task-agnostic fashion through the utilization of a shared library of primitive skills. Each primitive handles an independent sub-task, such as reasoning about visual attributes, spatial relation comprehension, logic and enumeration, as well as arm control. A language parser maps the input query to an executable program composed of such primitives, depending on the context. While some primitives are purely symbolic operations (e.g. counting), others are trainable neural functions (e.g. visual grounding), therefore marrying the interpretability and systematic generalization benefits of discrete symbolic approaches with the scalability and representational power of deep networks. We generate a 3D vision-and-language synthetic dataset of tabletop scenes in a simulation environment to train our approach and perform extensive evaluations in both synthetic and real-world scenes. Results showcase the benefits of our approach in terms of accuracy, sample-efficiency, and robustness to the user's vocabulary, while being transferable to real-world scenes with few-shot visual fine-tuning. Finally, we integrate our method with a robot framework and demonstrate how it can serve as an interpretable solution for an interactive object-picking task, both in simulation and with a real robot. We make our datasets available in https://gtziafas.github.io/neurosymbolic-manipulation.
△ Less
Submitted 7 May, 2023; v1 submitted 3 October, 2022;
originally announced October 2022.
-
Early or Late Fusion Matters: Efficient RGB-D Fusion in Vision Transformers for 3D Object Recognition
Authors:
Georgios Tziafas,
Hamidreza Kasaei
Abstract:
The Vision Transformer (ViT) architecture has established its place in computer vision literature, however, training ViTs for RGB-D object recognition remains an understudied topic, viewed in recent literature only through the lens of multi-task pretraining in multiple vision modalities. Such approaches are often computationally intensive, relying on the scale of multiple pretraining datasets to a…
▽ More
The Vision Transformer (ViT) architecture has established its place in computer vision literature, however, training ViTs for RGB-D object recognition remains an understudied topic, viewed in recent literature only through the lens of multi-task pretraining in multiple vision modalities. Such approaches are often computationally intensive, relying on the scale of multiple pretraining datasets to align RGB with 3D information. In this work, we propose a simple yet strong recipe for transferring pretrained ViTs in RGB-D domains for 3D object recognition, focusing on fusing RGB and depth representations encoded jointly by the ViT. Compared to previous works in multimodal Transformers, the key challenge here is to use the attested flexibility of ViTs to capture cross-modal interactions at the downstream and not the pretraining stage. We explore which depth representation is better in terms of resulting accuracy and compare early and late fusion techniques for aligning the RGB and depth modalities within the ViT architecture. Experimental results in the Washington RGB-D Objects dataset (ROD) demonstrate that in such RGB -> RGB-D scenarios, late fusion techniques work better than most popularly employed early fusion. With our transfer baseline, fusion ViTs score up to 95.4% top-1 accuracy in ROD, achieving new state-of-the-art results in this benchmark. We further show the benefits of using our multimodal fusion baseline over unimodal feature extractors in a synthetic-to-real visual adaptation as well as in an open-ended lifelong learning scenario in the ROD benchmark, where our model outperforms previous works by a margin of >8%. Finally, we integrate our method with a robot framework and demonstrate how it can serve as a perception utility in an interactive robot learning scenario, both in simulation and with a real robot.
△ Less
Submitted 7 March, 2023; v1 submitted 3 October, 2022;
originally announced October 2022.
-
IPPO: Obstacle Avoidance for Robotic Manipulators in Joint Space via Improved Proximal Policy Optimization
Authors:
Yongliang Wang,
Hamidreza Kasaei
Abstract:
Reaching tasks with random targets and obstacles is a challenging task for robotic manipulators. In this study, we propose a novel model-free reinforcement learning approach based on proximal policy optimization (PPO) for training a deep policy to map the task space to the joint space of a 6-DoF manipulator. To facilitate the training process in a large workspace, we develop an efficient represent…
▽ More
Reaching tasks with random targets and obstacles is a challenging task for robotic manipulators. In this study, we propose a novel model-free reinforcement learning approach based on proximal policy optimization (PPO) for training a deep policy to map the task space to the joint space of a 6-DoF manipulator. To facilitate the training process in a large workspace, we develop an efficient representation of environmental inputs and outputs. The calculation of the distance between obstacles and manipulator links is incorporated into the state representation using a geometry-based method. Additionally, to enhance the performance of the model in reaching tasks, we introduce the action ensembles method and design the policy to directly participate in value function updates in PPO. To overcome the challenges associated with training in real-robot environments, we develop a simulation environment in Gazebo to train the model as it produces a smaller Sim-to-Real gap compared to other simulators. However, training in Gazebo is time-intensive. To address this issue, we propose a Sim-to-Sim method to significantly reduce the training time. The trained model is then directly applied in a real-robot setup without fine-tuning. To evaluate the performance of the proposed approach, we perform several rounds of experiments in both simulated and real robots. We also compare the performance of the proposed approach with six baselines. The experimental results demonstrate the effectiveness of the proposed method in performing reaching tasks with and without obstacles. our method outperformed the selected baselines by a large margin in different reaching task scenarios. A video of these experiments has been attached to the paper as supplementary material.
△ Less
Submitted 9 February, 2023; v1 submitted 3 October, 2022;
originally announced October 2022.
-
Throwing Objects into A Moving Basket While Avoiding Obstacles
Authors:
Hamidreza Kasaei,
Mohammadreza Kasaei
Abstract:
The capabilities of a robot will be increased significantly by exploiting throwing behavior. In particular, throwing will enable robots to rapidly place the object into the target basket, located outside its feasible kinematic space, without traveling to the desired location. In previous approaches, the robot often learned a parameterized throwing kernel through analytical approaches, imitation le…
▽ More
The capabilities of a robot will be increased significantly by exploiting throwing behavior. In particular, throwing will enable robots to rapidly place the object into the target basket, located outside its feasible kinematic space, without traveling to the desired location. In previous approaches, the robot often learned a parameterized throwing kernel through analytical approaches, imitation learning, or hand-coding. There are many situations in which such approaches do not work/generalize well due to various object shapes, heterogeneous mass distribution, and also obstacles that might be presented in the environment. It is obvious that a method is needed to modulate the throwing kernel through its meta parameters. In this paper, we tackle object throwing problem through a deep reinforcement learning approach that enables robots to precisely throw objects into moving baskets while there are obstacles obstructing the path. To the best of our knowledge, we are the first group that addresses throwing objects with obstacle avoidance. Such a throwing skill not only increases the physical reachability of a robot arm but also improves the execution time. In particular, the robot detects the pose of the target object, basket, and obstacle at each time step, predicts the proper grasp configuration for the target object, and then infers appropriate parameters to throw the object into the basket. Due to safety constraints, we develop a simulation environment in Gazebo to train the robot and then use the learned policy in real-robot directly. To assess the performers of the proposed approach, we perform extensive sets of experiments in both simulation and real robots in three scenarios. Experimental results showed that the robot could precisely throw a target object into the basket outside its kinematic range and generalize well to new locations and objects without colliding with obstacles.
△ Less
Submitted 2 October, 2022;
originally announced October 2022.
-
Wasserstein Graph Distance Based on $L_1$-Approximated Tree Edit Distance between Weisfeiler-Lehman Subtrees
Authors:
Zhongxi Fang,
Jianming Huang,
Xun Su,
Hiroyuki Kasai
Abstract:
The Weisfeiler-Lehman (WL) test is a widely used algorithm in graph machine learning, including graph kernels, graph metrics, and graph neural networks. However, it focuses only on the consistency of the graph, which means that it is unable to detect slight structural differences. Consequently, this limits its ability to capture structural information, which also limits the performance of existing…
▽ More
The Weisfeiler-Lehman (WL) test is a widely used algorithm in graph machine learning, including graph kernels, graph metrics, and graph neural networks. However, it focuses only on the consistency of the graph, which means that it is unable to detect slight structural differences. Consequently, this limits its ability to capture structural information, which also limits the performance of existing models that rely on the WL test. This limitation is particularly severe for traditional metrics defined by the WL test, which cannot precisely capture slight structural differences. In this paper, we propose a novel graph metric called the Wasserstein WL Subtree (WWLS) distance to address this problem. Our approach leverages the WL subtree as structural information for node neighborhoods and defines node metrics using the $L_1$-approximated tree edit distance ($L_1$-TED) between WL subtrees of nodes. Subsequently, we combine the Wasserstein distance and the $L_1$-TED to define the WWLS distance, which can capture slight structural differences that may be difficult to detect using conventional metrics. We demonstrate that the proposed WWLS distance outperforms baselines in both metric validation and graph classification experiments.
△ Less
Submitted 1 May, 2023; v1 submitted 9 July, 2022;
originally announced July 2022.
-
On the Convergence of Semi-Relaxed Sinkhorn with Marginal Constraint and OT Distance Gaps
Authors:
Takumi Fukunaga,
Hiroyuki Kasai
Abstract:
This paper presents consideration of the Semi-Relaxed Sinkhorn (SR-Sinkhorn) algorithm for the semi-relaxed optimal transport (SROT) problem, which relaxes one marginal constraint of the standard OT problem. For evaluation of how the constraint relaxation affects the algorithm behavior and solution, it is vitally necessary to present the theoretical convergence analysis in terms not only of the fu…
▽ More
This paper presents consideration of the Semi-Relaxed Sinkhorn (SR-Sinkhorn) algorithm for the semi-relaxed optimal transport (SROT) problem, which relaxes one marginal constraint of the standard OT problem. For evaluation of how the constraint relaxation affects the algorithm behavior and solution, it is vitally necessary to present the theoretical convergence analysis in terms not only of the functional value gap, but also of the marginal constraint gap as well as the OT distance gap. However, no existing work has addressed all analyses simultaneously. To this end, this paper presents a comprehensive convergence analysis for SR-Sinkhorn. After presenting the $ε$-approximation of the functional value gap based on a new proof strategy and exploiting this proof strategy, we give the upper bound of the marginal constraint gap. We also provide its convergence to the $ε$-approximation when two distributions are in the probability simplex. Furthermore, the convergence analysis of the OT distance gap to the $ε$-approximation is given as assisted by the obtained marginal constraint gap. The latter two theoretical results are the first results presented in the literature related to the SROT problem.
△ Less
Submitted 27 May, 2022;
originally announced May 2022.
-
Block-coordinate Frank-Wolfe algorithm and convergence analysis for semi-relaxed optimal transport problem
Authors:
Takumi Fukunaga,
Hiroyuki Kasai
Abstract:
The optimal transport (OT) problem has been used widely for machine learning. It is necessary for computation of an OT problem to solve linear programming with tight mass-conservation constraints. These constraints prevent its application to large-scale problems. To address this issue, loosening such constraints enables us to propose the relaxed-OT method using a faster algorithm. This approach ha…
▽ More
The optimal transport (OT) problem has been used widely for machine learning. It is necessary for computation of an OT problem to solve linear programming with tight mass-conservation constraints. These constraints prevent its application to large-scale problems. To address this issue, loosening such constraints enables us to propose the relaxed-OT method using a faster algorithm. This approach has demonstrated its effectiveness for applications. However, it remains slow. As a superior alternative, we propose a fast block-coordinate Frank-Wolfe (BCFW) algorithm for a convex semi-relaxed OT. Specifically, we prove their upper bounds of the worst convergence iterations, and equivalence between the linearization duality gap and the Lagrangian duality gap. Additionally, we develop two fast variants of the proposed BCFW. Numerical experiments have demonstrated that our proposed algorithms are effective for color transfer and surpass state-of-the-art algorithms. This report presents a short version of arXiv:2103.05857.
△ Less
Submitted 27 May, 2022;
originally announced May 2022.
-
Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity Resolution
Authors:
Georgios Tziafas,
Hamidreza Kasaei
Abstract:
Service robots should be able to interact naturally with non-expert human users, not only to help them in various tasks but also to receive guidance in order to resolve ambiguities that might be present in the instruction. We consider the task of visual grounding, where the agent segments an object from a crowded scene given a natural language description. Modern holistic approaches to visual grou…
▽ More
Service robots should be able to interact naturally with non-expert human users, not only to help them in various tasks but also to receive guidance in order to resolve ambiguities that might be present in the instruction. We consider the task of visual grounding, where the agent segments an object from a crowded scene given a natural language description. Modern holistic approaches to visual grounding usually ignore language structure and struggle to cover generic domains, therefore relying heavily on large datasets. Additionally, their transfer performance in RGB-D datasets suffers due to high visual discrepancy between the benchmark and the target domains. Modular approaches marry learning with domain modeling and exploit the compositional nature of language to decouple visual representation from language parsing, but either rely on external parsers or are trained in an end-to-end fashion due to the lack of strong supervision. In this work, we seek to tackle these limitations by introducing a fully decoupled modular framework for compositional visual grounding of entities, attributes, and spatial relations. We exploit rich scene graph annotations generated in a synthetic domain and train each module independently. Our approach is evaluated both in simulation and in two real RGB-D scene datasets. Experimental results show that the decoupled nature of our framework allows for easy integration with domain adaptation approaches for Sim-To-Real visual recognition, offering a data-efficient, robust, and interpretable solution to visual grounding in robotic applications.
△ Less
Submitted 10 July, 2022; v1 submitted 24 May, 2022;
originally announced May 2022.
-
Lifelong Ensemble Learning based on Multiple Representations for Few-Shot Object Recognition
Authors:
Hamidreza Kasaei,
Songsong Xiong
Abstract:
Service robots are integrating more and more into our daily lives to help us with various tasks. In such environments, robots frequently face new objects while working in the environment and need to learn them in an open-ended fashion. Furthermore, such robots must be able to recognize a wide range of object categories. In this paper, we present a lifelong ensemble learning approach based on multi…
▽ More
Service robots are integrating more and more into our daily lives to help us with various tasks. In such environments, robots frequently face new objects while working in the environment and need to learn them in an open-ended fashion. Furthermore, such robots must be able to recognize a wide range of object categories. In this paper, we present a lifelong ensemble learning approach based on multiple representations to address the few-shot object recognition problem. In particular, we form ensemble methods based on deep representations and handcrafted 3D shape descriptors. To facilitate lifelong learning, each approach is equipped with a memory unit for storing and retrieving object information instantly. The proposed model is suitable for open-ended learning scenarios where the number of 3D object categories is not fixed and can grow over time. We have performed extensive sets of experiments to assess the performance of the proposed approach in offline, and open-ended scenarios. For the evaluation purpose, in addition to real object datasets, we generate a large synthetic household objects dataset consisting of 27000 views of 90 objects. Experimental results demonstrate the effectiveness of the proposed method on online few-shot 3D object recognition tasks, as well as its superior performance over the state-of-the-art open-ended learning approaches. Furthermore, our results show that while ensemble learning is modestly beneficial in offline settings, it is significantly beneficial in lifelong few-shot learning situations. Additionally, we demonstrated the effectiveness of our approach in both simulated and real-robot settings, where the robot rapidly learned new categories from limited examples.
△ Less
Submitted 9 January, 2024; v1 submitted 4 May, 2022;
originally announced May 2022.
-
Self-Supervised Learning for Joint Pushing and Gras** Policies in Highly Cluttered Environments
Authors:
Yongliang Wang,
Kamal Mokhtar,
Cock Heemskerk,
Hamidreza Kasaei
Abstract:
Robots often face situations where gras** a goal object is desirable but not feasible due to other present objects preventing the grasp action. We present a deep Reinforcement Learning approach to learn gras** and pushing policies for manipulating a goal object in highly cluttered environments to address this problem. In particular, a dual Reinforcement Learning model approach is proposed, whi…
▽ More
Robots often face situations where gras** a goal object is desirable but not feasible due to other present objects preventing the grasp action. We present a deep Reinforcement Learning approach to learn gras** and pushing policies for manipulating a goal object in highly cluttered environments to address this problem. In particular, a dual Reinforcement Learning model approach is proposed, which presents high resilience in handling complicated scenes, reaching an average of 98% task completion using primitive objects in a simulation environment. To evaluate the performance of the proposed approach, we performed two extensive sets of experiments in packed objects and a pile of object scenarios with a total of 1000 test runs in simulation. Experimental results showed that the proposed method worked very well in both scenarios and outperformed the recent state-of-the-art approaches. Demo video, trained models, and source code for the results reproducibility purpose are publicly available. https://sites.google.com/view/pushandgrasp/home
△ Less
Submitted 16 March, 2024; v1 submitted 4 March, 2022;
originally announced March 2022.
-
Lifelong 3D Object Recognition and Grasp Synthesis Using Dual Memory Recurrent Self-Organization Networks
Authors:
Krishnakumar Santhakumar,
Hamidreza Kasaei
Abstract:
Humans learn to recognize and manipulate new objects in lifelong settings without forgetting the previously gained knowledge under non-stationary and sequential conditions. In autonomous systems, the agents also need to mitigate similar behavior to continually learn the new object categories and adapt to new environments. In most conventional deep neural networks, this is not possible due to the p…
▽ More
Humans learn to recognize and manipulate new objects in lifelong settings without forgetting the previously gained knowledge under non-stationary and sequential conditions. In autonomous systems, the agents also need to mitigate similar behavior to continually learn the new object categories and adapt to new environments. In most conventional deep neural networks, this is not possible due to the problem of catastrophic forgetting, where the newly gained knowledge overwrites existing representations. Furthermore, most state-of-the-art models excel either in recognizing the objects or in grasp prediction, while both tasks use visual input. The combined architecture to tackle both tasks is very limited. In this paper, we proposed a hybrid model architecture consists of a dynamically growing dual-memory recurrent neural network (GDM) and an autoencoder to tackle object recognition and gras** simultaneously. The autoencoder network is responsible to extract a compact representation for a given object, which serves as input for the GDM learning, and is responsible to predict pixel-wise antipodal grasp configurations. The GDM part is designed to recognize the object in both instances and categories levels. We address the problem of catastrophic forgetting using the intrinsic memory replay, where the episodic memory periodically replays the neural activation trajectories in the absence of external sensory information. To extensively evaluate the proposed model in a lifelong setting, we generate a synthetic dataset due to lack of sequential 3D objects dataset. Experiment results demonstrated that the proposed model can learn both object representation and gras** simultaneously in continual learning scenarios.
△ Less
Submitted 23 January, 2022; v1 submitted 23 September, 2021;
originally announced September 2021.
-
Simultaneous Multi-View Object Recognition and Gras** in Open-Ended Domains
Authors:
Hamidreza Kasaei,
Sha Luo,
Remo Sasso,
Mohammadreza Kasaei
Abstract:
To aid humans in everyday tasks, robots need to know which objects exist in the scene, where they are, and how to grasp and manipulate them in different situations. Therefore, object recognition and gras** are two key functionalities for autonomous robots. Most state-of-the-art approaches treat object recognition and gras** as two separate problems, even though both use visual input. Furthermo…
▽ More
To aid humans in everyday tasks, robots need to know which objects exist in the scene, where they are, and how to grasp and manipulate them in different situations. Therefore, object recognition and gras** are two key functionalities for autonomous robots. Most state-of-the-art approaches treat object recognition and gras** as two separate problems, even though both use visual input. Furthermore, the knowledge of the robot is fixed after the training phase. In such cases, if the robot encounters new object categories, it must be retrained to incorporate new information without catastrophic forgetting. In order to resolve this problem, we propose a deep learning architecture with an augmented memory capacity to handle open-ended object recognition and gras** simultaneously. In particular, our approach takes multi-views of an object as input and jointly estimates pixel-wise grasp configuration as well as a deep scale- and rotation-invariant representation as output. The obtained representation is then used for open-ended object recognition through a meta-active learning technique. We demonstrate the ability of our approach to grasp never-seen-before objects and to rapidly learn new object categories using very few examples on-site in both simulation and real-world settings. A video of these experiments is available online at: https://youtu.be/n9SMpuEkOgk
△ Less
Submitted 6 December, 2022; v1 submitted 3 June, 2021;
originally announced June 2021.
-
Anonymous quantum sensing
Authors:
Hiroto Kasai,
Yuki Takeuchi,
Hideaki Hakoshima,
Yuichiro Matsuzaki,
Yasuhiro Tokura
Abstract:
A lot of attention has been paid to a quantum-sensing network for detecting magnetic fields in different positions. Recently, cryptographic quantum metrology was investigated where the information of the magnetic fields is transmitted in a secure way. However, sometimes, the positions where non-zero magnetic fields are generated could carry important information. Here, we propose an anonymous quan…
▽ More
A lot of attention has been paid to a quantum-sensing network for detecting magnetic fields in different positions. Recently, cryptographic quantum metrology was investigated where the information of the magnetic fields is transmitted in a secure way. However, sometimes, the positions where non-zero magnetic fields are generated could carry important information. Here, we propose an anonymous quantum sensor where an information of positions having non-zero magnetic fields is hidden after measuring magnetic fields with a quantum-sensing network. Suppose that agents are located in different positions and they have quantum sensors. After the quantum sensors are entangled, the agents implement quantum sensing that provides a phase information if non-zero magnetic fields exist, and POVM measurement is performed on quantum sensors. Importantly, even if the outcomes of the POVM measurement is stolen by an eavesdropper, information of the positions with non-zero magnetic fields is still unknown for the eavesdropper in our protocol. In addition, we evaluate the sensitivity of our proposed quantum sensors by using Fisher information when there are at most two positions having non-zero magnetic fields. We show that the sensitivity is finite unless these two (non-zero) magnetic fields have exactly the same amplitude. Our results pave the way for new applications of quantum-sensing network.
△ Less
Submitted 12 May, 2021;
originally announced May 2021.
-
Self-Imitation Learning by Planning
Authors:
Sha Luo,
Hamidreza Kasaei,
Lambert Schomaker
Abstract:
Imitation learning (IL) enables robots to acquire skills quickly by transferring expert knowledge, which is widely adopted in reinforcement learning (RL) to initialize exploration. However, in long-horizon motion planning tasks, a challenging problem in deploying IL and RL methods is how to generate and collect massive, broadly distributed data such that these methods can generalize effectively. I…
▽ More
Imitation learning (IL) enables robots to acquire skills quickly by transferring expert knowledge, which is widely adopted in reinforcement learning (RL) to initialize exploration. However, in long-horizon motion planning tasks, a challenging problem in deploying IL and RL methods is how to generate and collect massive, broadly distributed data such that these methods can generalize effectively. In this work, we solve this problem using our proposed approach called {self-imitation learning by planning (SILP)}, where demonstration data are collected automatically by planning on the visited states from the current policy. SILP is inspired by the observation that successfully visited states in the early reinforcement learning stage are collision-free nodes in the graph-search based motion planner, so we can plan and relabel robot's own trials as demonstrations for policy learning. Due to these self-generated demonstrations, we relieve the human operator from the laborious data preparation process required by IL and RL methods in solving complex motion planning tasks. The evaluation results show that our SILP method achieves higher success rates and enhances sample efficiency compared to selected baselines, and the policy learned in simulation performs well in a real-world placement task with changing goals and obstacles.
△ Less
Submitted 26 March, 2021; v1 submitted 25 March, 2021;
originally announced March 2021.
-
MVGrasp: Real-Time Multi-View 3D Object Gras** in Highly Cluttered Environments
Authors:
Hamidreza Kasaei,
Mohammadreza Kasaei
Abstract:
Nowadays robots play an increasingly important role in our daily life. In human-centered environments, robots often encounter piles of objects, packed items, or isolated objects. Therefore, a robot must be able to grasp and manipulate different objects in various situations to help humans with daily tasks. In this paper, we propose a multi-view deep learning approach to handle robust object graspi…
▽ More
Nowadays robots play an increasingly important role in our daily life. In human-centered environments, robots often encounter piles of objects, packed items, or isolated objects. Therefore, a robot must be able to grasp and manipulate different objects in various situations to help humans with daily tasks. In this paper, we propose a multi-view deep learning approach to handle robust object gras** in human-centric domains. In particular, our approach takes a point cloud of an arbitrary object as an input, and then, generates orthographic views of the given object. The obtained views are finally used to estimate pixel-wise grasp synthesis for each object. We train the model end-to-end using a small object grasp dataset and test it on both simulations and real-world data without any further fine-tuning. To evaluate the performance of the proposed approach, we performed extensive sets of experiments in three scenarios, including isolated objects, packed items, and pile of objects. Experimental results show that our approach performed very well in all simulation and real-robot scenarios, and is able to achieve reliable closed-loop gras** of novel objects across various scene configurations.
△ Less
Submitted 5 October, 2022; v1 submitted 19 March, 2021;
originally announced March 2021.
-
MORE: Simultaneous Multi-View 3D Object Recognition and Pose Estimation
Authors:
Tommaso Parisotto,
Subhaditya Mukherjee,
Hamidreza Kasaei
Abstract:
Simultaneous object recognition and pose estimation are two key functionalities for robots to safely interact with humans as well as environments. Although both object recognition and pose estimation use visual input, most state-of-the-art tackles them as two separate problems since the former needs a view-invariant representation while object pose estimation necessitates a view-dependent descript…
▽ More
Simultaneous object recognition and pose estimation are two key functionalities for robots to safely interact with humans as well as environments. Although both object recognition and pose estimation use visual input, most state-of-the-art tackles them as two separate problems since the former needs a view-invariant representation while object pose estimation necessitates a view-dependent description. Nowadays, multi-view Convolutional Neural Network (MVCNN) approaches show state-of-the-art classification performance. Although MVCNN object recognition has been widely explored, there has been very little research on multi-view object pose estimation methods, and even less on addressing these two problems simultaneously. The pose of virtual cameras in MVCNN methods is often predefined in advance, leading to bound the application of such approaches. In this paper, we propose an approach capable of handling object recognition and pose estimation simultaneously. In particular, we develop a deep object-agnostic entropy estimation model, capable of predicting the best viewpoints of a given 3D object. The obtained views of the object are then fed to the network to simultaneously predict the pose and category label of the target object. Experimental results showed that the views obtained from such positions are descriptive enough to achieve a good accuracy score. Furthermore, we designed a real-life serve drink scenario to demonstrate how well the proposed approach worked in real robot tasks. Code is available online at: github.com/SubhadityaMukherjee/more_mvcnn
△ Less
Submitted 7 April, 2023; v1 submitted 17 March, 2021;
originally announced March 2021.
-
Few-Shot Visual Grounding for Natural Human-Robot Interaction
Authors:
Giorgos Tziafas,
Hamidreza Kasaei
Abstract:
Natural Human-Robot Interaction (HRI) is one of the key components for service robots to be able to work in human-centric environments. In such dynamic environments, the robot needs to understand the intention of the user to accomplish a task successfully. Towards addressing this point, we propose a software architecture that segments a target object from a crowded scene, indicated verbally by a h…
▽ More
Natural Human-Robot Interaction (HRI) is one of the key components for service robots to be able to work in human-centric environments. In such dynamic environments, the robot needs to understand the intention of the user to accomplish a task successfully. Towards addressing this point, we propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user. At the core of our system, we employ a multi-modal deep neural network for visual grounding. Unlike most grounding methods that tackle the challenge using pre-trained object detectors via a two-stepped process, we develop a single stage zero-shot model that is able to provide predictions in unseen data. We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets. Experimental results showed that the proposed model performs well in terms of accuracy and speed, while showcasing robustness to variation in the natural language input.
△ Less
Submitted 31 March, 2021; v1 submitted 17 March, 2021;
originally announced March 2021.
-
Fast block-coordinate Frank-Wolfe algorithm for semi-relaxed optimal transport
Authors:
Takumi Fukunaga,
Hiroyuki Kasai
Abstract:
Optimal transport (OT), which provides a distance between two probability distributions by considering their spatial locations, has been applied to widely diverse applications. Computing an OT problem requires solution of linear programming with tight mass-conservation constraints. This requirement hinders its application to large-scale problems. To alleviate this issue, the recently proposed rela…
▽ More
Optimal transport (OT), which provides a distance between two probability distributions by considering their spatial locations, has been applied to widely diverse applications. Computing an OT problem requires solution of linear programming with tight mass-conservation constraints. This requirement hinders its application to large-scale problems. To alleviate this issue, the recently proposed relaxed-OT approach uses a faster algorithm by relaxing such constraints. Its effectiveness for practical applications has been demonstrated. Nevertheless, it still exhibits slow convergence. To this end, addressing a convex semi-relaxed OT, we propose a fast block-coordinate Frank-Wolfe (BCFW) algorithm, which gives sparse solutions. Specifically, we provide their upper bounds of the worst convergence iterations, and equivalence between the linearization duality gap and the Lagrangian duality gap. Three fast variants of the proposed BCFW are also proposed. Numerical evaluations in color transfer problem demonstrate that the proposed algorithms outperform state-of-the-art algorithms across different settings.
△ Less
Submitted 9 March, 2021;
originally announced March 2021.
-
Manifold optimization for non-linear optimal transport problems
Authors:
Bamdev Mishra,
N T V Satyadev,
Hiroyuki Kasai,
Pratik Jawanpuria
Abstract:
Optimal transport (OT) has recently found widespread interest in machine learning. It allows to define novel distances between probability measures, which have shown promise in several applications. In this work, we discuss how to computationally approach general non-linear OT problems within the framework of Riemannian manifold optimization. The basis of this is the manifold of doubly stochastic…
▽ More
Optimal transport (OT) has recently found widespread interest in machine learning. It allows to define novel distances between probability measures, which have shown promise in several applications. In this work, we discuss how to computationally approach general non-linear OT problems within the framework of Riemannian manifold optimization. The basis of this is the manifold of doubly stochastic matrices (and their generalization). Even though the manifold geometry is not new, surprisingly, its usefulness for solving general non-linear OT problems has not been popular. To this end, we specifically discuss optimization-related ingredients that allow modeling the OT problem on smooth Riemannian manifolds by exploiting the geometry of the search space. We also discuss extensions where we reuse the developed optimization ingredients. We make available the Manifold optimization-based Optimal Transport, or MOT, repository with codes useful in solving OT problems in Python and Matlab. The codes are available at \url{https://github.com/SatyadevNtv/MOT}.
△ Less
Submitted 8 October, 2021; v1 submitted 1 March, 2021;
originally announced March 2021.
-
LCS Graph Kernel Based on Wasserstein Distance in Longest Common Subsequence Metric Space
Authors:
Jianming Huang,
Zhongxi Fang,
Hiroyuki Kasai
Abstract:
For graph learning tasks, many existing methods utilize a message-passing mechanism where vertex features are updated iteratively by aggregation of neighbor information. This strategy provides an efficient means for graph features extraction, but obtained features after many iterations might contain too much information from other vertices, and tend to be similar to each other. This makes their re…
▽ More
For graph learning tasks, many existing methods utilize a message-passing mechanism where vertex features are updated iteratively by aggregation of neighbor information. This strategy provides an efficient means for graph features extraction, but obtained features after many iterations might contain too much information from other vertices, and tend to be similar to each other. This makes their representations less expressive. Learning graphs using paths, on the other hand, can be less adversely affected by this problem because it does not involve all vertex neighbors. However, most of them can only compare paths with the same length, which might engender information loss. To resolve this difficulty, we propose a new Graph Kernel based on a Longest Common Subsequence (LCS) similarity. Moreover, we found that the widely-used R-convolution framework is unsuitable for path-based Graph Kernel because a huge number of comparisons between dissimilar paths might deteriorate graph distances calculation. Therefore, we propose a novel metric space by exploiting the proposed LCS-based similarity, and compute a new Wasserstein-based graph distance in this metric space, which emphasizes more the comparison between similar paths. Furthermore, to reduce the computational cost, we propose an adjacent point merging operation to sparsify point clouds in the metric space.
△ Less
Submitted 29 October, 2021; v1 submitted 7 December, 2020;
originally announced December 2020.
-
Wasserstein k-means with sparse simplex projection
Authors:
Takumi Fukunaga,
Hiroyuki Kasai
Abstract:
This paper presents a proposal of a faster Wasserstein $k$-means algorithm for histogram data by reducing Wasserstein distance computations and exploiting sparse simplex projection. We shrink data samples, centroids, and the ground cost matrix, which leads to considerable reduction of the computations used to solve optimal transport problems without loss of clustering quality. Furthermore, we dyna…
▽ More
This paper presents a proposal of a faster Wasserstein $k$-means algorithm for histogram data by reducing Wasserstein distance computations and exploiting sparse simplex projection. We shrink data samples, centroids, and the ground cost matrix, which leads to considerable reduction of the computations used to solve optimal transport problems without loss of clustering quality. Furthermore, we dynamically reduced the computational complexity by removing lower-valued data samples and harnessing sparse simplex projection while kee** the degradation of clustering quality lower. We designate this proposed algorithm as sparse simplex projection based Wasserstein $k$-means, or SSPW $k$-means. Numerical evaluations conducted with comparison to results obtained using Wasserstein $k$-means algorithm demonstrate the effectiveness of the proposed SSPW $k$-means for real-world datasets
△ Less
Submitted 25 November, 2020;
originally announced November 2020.
-
Consistency-aware and Inconsistency-aware Graph-based Multi-view Clustering
Authors:
Mitsuhiko Horie,
Hiroyuki Kasai
Abstract:
Multi-view data analysis has gained increasing popularity because multi-view data are frequently encountered in machine learning applications. A simple but promising approach for clustering of multi-view data is multi-view clustering (MVC), which has been developed extensively to classify given subjects into some clustered groups by learning latent common features that are shared across multi-view…
▽ More
Multi-view data analysis has gained increasing popularity because multi-view data are frequently encountered in machine learning applications. A simple but promising approach for clustering of multi-view data is multi-view clustering (MVC), which has been developed extensively to classify given subjects into some clustered groups by learning latent common features that are shared across multi-view data. Among existing approaches, graph-based multi-view clustering (GMVC) achieves state-of-the-art performance by leveraging a shared graph matrix called the unified matrix. However, existing methods including GMVC do not explicitly address inconsistent parts of input graph matrices. Consequently, they are adversely affected by unacceptable clustering performance. To this end, this paper proposes a new GMVC method that incorporates consistent and inconsistent parts lying across multiple views. This proposal is designated as CI-GMVC. Numerical evaluations of real-world datasets demonstrate the effectiveness of the proposed CI-GMVC.
△ Less
Submitted 25 November, 2020;
originally announced November 2020.
-
Graph embedding using multi-layer adjacent point merging model
Authors:
Jianming Huang,
Hiroyuki Kasai
Abstract:
For graph classification tasks, many traditional kernel methods focus on measuring the similarity between graphs. These methods have achieved great success on resolving graph isomorphism problems. However, in some classification problems, the graph class depends on not only the topological similarity of the whole graph, but also constituent subgraph patterns. To this end, we propose a novel graph…
▽ More
For graph classification tasks, many traditional kernel methods focus on measuring the similarity between graphs. These methods have achieved great success on resolving graph isomorphism problems. However, in some classification problems, the graph class depends on not only the topological similarity of the whole graph, but also constituent subgraph patterns. To this end, we propose a novel graph embedding method using a multi-layer adjacent point merging model. This embedding method allows us to extract different subgraph patterns from train-data. Then we present a flexible loss function for feature selection which enhances the robustness of our method for different classification problems. Finally, numerical evaluations demonstrate that our proposed method outperforms many state-of-the-art methods.
△ Less
Submitted 17 February, 2021; v1 submitted 28 October, 2020;
originally announced October 2020.
-
Open-Ended Fine-Grained 3D Object Categorization by Combining Shape and Texture Features in Multiple Colorspaces
Authors:
Nils Keunecke,
S. Hamidreza Kasaei
Abstract:
As a consequence of an ever-increasing number of service robots, there is a growing demand for highly accurate real-time 3D object recognition. Considering the expansion of robot applications in more complex and dynamic environments,it is evident that it is not possible to pre-program all object categories and anticipate all exceptions in advance. Therefore, robots should have the functionality to…
▽ More
As a consequence of an ever-increasing number of service robots, there is a growing demand for highly accurate real-time 3D object recognition. Considering the expansion of robot applications in more complex and dynamic environments,it is evident that it is not possible to pre-program all object categories and anticipate all exceptions in advance. Therefore, robots should have the functionality to learn about new object categories in an open-ended fashion while working in the environment.Towards this goal, we propose a deep transfer learning approach to generate a scale- and pose-invariant object representation by considering shape and texture information in multiple colorspaces. The obtained global object representation is then fed to an instance-based object category learning and recognition,where a non-expert human user exists in the learning loop and can interactively guide the process of experience acquisition by teaching new object categories, or by correcting insufficient or erroneous categories. In this work, shape information encodes the common patterns of all categories, while texture information is used to describes the appearance of each instance in detail.Multiple color space combinations and network architectures are evaluated to find the most descriptive system. Experimental results showed that the proposed network architecture out-performed the selected state-of-the-art approaches in terms of object classification accuracy and scalability. Furthermore, we performed a real robot experiment in the context of serve-a-beer scenario to show the real-time performance of the proposed approach.
△ Less
Submitted 28 May, 2021; v1 submitted 19 September, 2020;
originally announced September 2020.
-
3D_DEN: Open-ended 3D Object Recognition using Dynamically Expandable Networks
Authors:
Sudhakaran Jain,
Hamidreza Kasaei
Abstract:
Service robots, in general, have to work independently and adapt to the dynamic changes happening in the environment in real-time. One important aspect in such scenarios is to continually learn to recognize newer object categories when they become available. This combines two main research problems namely continual learning and 3D object recognition. Most of the existing research approaches includ…
▽ More
Service robots, in general, have to work independently and adapt to the dynamic changes happening in the environment in real-time. One important aspect in such scenarios is to continually learn to recognize newer object categories when they become available. This combines two main research problems namely continual learning and 3D object recognition. Most of the existing research approaches include the use of deep Convolutional Neural Networks (CNNs) focusing on image datasets. A modified approach might be needed for continually learning 3D object categories. A major concern in using CNNs is the problem of catastrophic forgetting when a model tries to learn a new task. Despite various proposed solutions to mitigate this problem, there still exist some downsides of such solutions, e.g., computational complexity, especially when learning substantial number of tasks. These downsides can pose major problems in robotic scenarios where real-time response plays an essential role. Towards addressing this challenge, we propose a new deep transfer learning approach based on a dynamic architectural method to make robots capable of open-ended learning about new 3D object categories. Furthermore, we make sure that the mentioned downsides are minimized to a great extent. Experimental results showed that the proposed model outperformed state-of-the-art approaches with regards to accuracy and also substantially minimizes computational overhead.
△ Less
Submitted 15 March, 2021; v1 submitted 15 September, 2020;
originally announced September 2020.
-
Local-HDP: Interactive Open-Ended 3D Object Categorization in Real-Time Robotic Scenarios
Authors:
H. Ayoobi,
H. Kasaei,
M. Cao,
R. Verbrugge,
B. Verheij
Abstract:
We introduce a non-parametric hierarchical Bayesian approach for open-ended 3D object categorization, named the Local Hierarchical Dirichlet Process (Local-HDP). This method allows an agent to learn independent topics for each category incrementally and to adapt to the environment in time. Hierarchical Bayesian approaches like Latent Dirichlet Allocation (LDA) can transform low-level features to h…
▽ More
We introduce a non-parametric hierarchical Bayesian approach for open-ended 3D object categorization, named the Local Hierarchical Dirichlet Process (Local-HDP). This method allows an agent to learn independent topics for each category incrementally and to adapt to the environment in time. Hierarchical Bayesian approaches like Latent Dirichlet Allocation (LDA) can transform low-level features to high-level conceptual topics for 3D object categorization. However, the efficiency and accuracy of LDA-based approaches depend on the number of topics that is chosen manually. Moreover, fixing the number of topics for all categories can lead to overfitting or underfitting of the model. In contrast, the proposed Local-HDP can autonomously determine the number of topics for each category. Furthermore, the online variational inference method has been adapted for fast posterior approximation in the Local-HDP model. Experiments show that the proposed Local-HDP method outperforms other state-of-the-art approaches in terms of accuracy, scalability, and memory efficiency by a large margin. Moreover, two robotic experiments have been conducted to show the applicability of the proposed approach in real-time applications.
△ Less
Submitted 11 April, 2021; v1 submitted 2 September, 2020;
originally announced September 2020.
-
The State of Lifelong Learning in Service Robots: Current Bottlenecks in Object Perception and Manipulation
Authors:
S. Hamidreza Kasaei,
Jorik Melsen,
Floris van Beers,
Christiaan Steenkist,
Klemen Voncina
Abstract:
Service robots are appearing more and more in our daily life. The development of service robots combines multiple fields of research, from object perception to object manipulation. The state-of-the-art continues to improve to make a proper coupling between object perception and manipulation. This coupling is necessary for service robots not only to perform various tasks in a reasonable amount of t…
▽ More
Service robots are appearing more and more in our daily life. The development of service robots combines multiple fields of research, from object perception to object manipulation. The state-of-the-art continues to improve to make a proper coupling between object perception and manipulation. This coupling is necessary for service robots not only to perform various tasks in a reasonable amount of time but also to continually adapt to new environments and safely interact with non-expert human users. Nowadays, robots are able to recognize various objects, and quickly plan a collision-free trajectory to grasp a target object in predefined settings. Besides, in most of the cases, there is a reliance on large amounts of training data. Therefore, the knowledge of such robots is fixed after the training phase, and any changes in the environment require complicated, time-consuming, and expensive robot re-programming by human experts. Therefore, these approaches are still too rigid for real-life applications in unstructured environments, where a significant portion of the environment is unknown and cannot be directly sensed or controlled. In such environments, no matter how extensive the training data used for batch learning, a robot will always face new objects. Therefore, apart from batch learning, the robot should be able to continually learn about new object categories and grasp affordances from very few training examples on-site. Moreover, apart from robot self-learning, non-expert users could interactively guide the process of experience acquisition by teaching new concepts, or by correcting insufficient or erroneous concepts. In this way, the robot will constantly learn how to help humans in everyday tasks by gaining more and more experiences without the need for re-programming.
△ Less
Submitted 6 May, 2021; v1 submitted 18 March, 2020;
originally announced March 2020.
-
Tetragonality induced superconductivity in anti-ThCr$_2$Si$_2$-type $RE_2$O$_2$Bi ($RE$ = rare earth) with Bi square net
Authors:
Ryosuke Sei,
Hideyuki Kawasoko,
Kota Matsumoto,
Masato Arimitsu,
Kyohei Terakado,
Daichi Oka,
Shintaro Fukuda,
Noriaki Kimura,
Hidetaka Kasai,
Eiji Nishibori,
Kenji Ohoyama,
Akinori Hoshikawa,
Toru Ishigaki,
Tetsuya Hasegawa,
Tomoteru Fukumura
Abstract:
We report a series of layered superconductors, anti-ThCr$_2$Si$_2$-type $RE_2$O$_2$Bi ($RE$ = rare earth), composed of electrically conductive Bi square nets and magnetic insulating $RE_2$O$_2$ layers. The superconductivity was induced by separating Bi square nets as a result of excess oxygen incorporation, irrespective of the presence of magnetic ordering in $RE_2$O$_2$ layers. Intriguingly, the…
▽ More
We report a series of layered superconductors, anti-ThCr$_2$Si$_2$-type $RE_2$O$_2$Bi ($RE$ = rare earth), composed of electrically conductive Bi square nets and magnetic insulating $RE_2$O$_2$ layers. The superconductivity was induced by separating Bi square nets as a result of excess oxygen incorporation, irrespective of the presence of magnetic ordering in $RE_2$O$_2$ layers. Intriguingly, the transition temperature of all $RE_2$O$_2$Bi including nonmagnetic Y$_2$O$_2$Bi was approximately scaled by the unit cell tetragonality ($c$/$a$), implying a key role of relative separation of the Bi square nets to induce the superconductivity.
△ Less
Submitted 11 March, 2020;
originally announced March 2020.
-
Learning to Grasp 3D Objects using Deep Residual U-Nets
Authors:
Yikun Li,
Lambert Schomaker,
S. Hamidreza Kasaei
Abstract:
Grasp synthesis is one of the challenging tasks for any robot object manipulation task. In this paper, we present a new deep learning-based grasp synthesis approach for 3D objects. In particular, we propose an end-to-end 3D Convolutional Neural Network to predict the objects' graspable areas. We named our approach Res-U-Net since the architecture of the network is designed based on U-Net structure…
▽ More
Grasp synthesis is one of the challenging tasks for any robot object manipulation task. In this paper, we present a new deep learning-based grasp synthesis approach for 3D objects. In particular, we propose an end-to-end 3D Convolutional Neural Network to predict the objects' graspable areas. We named our approach Res-U-Net since the architecture of the network is designed based on U-Net structure and residual network-styled blocks. It devised to plan 6-DOF grasps for any desired object, be efficient to compute and use, and be robust against varying point cloud density and Gaussian noise. We have performed extensive experiments to assess the performance of the proposed approach concerning graspable part detection, grasp success rate, and robustness to varying point cloud density and Gaussian noise. Experiments validate the promising performance of the proposed architecture in all aspects. A video showing the performance of our approach in the simulation environment can be found at: http://youtu.be/5_yAJCc8owo
△ Less
Submitted 12 September, 2020; v1 submitted 10 February, 2020;
originally announced February 2020.
-
Investigating the Importance of Shape Features, Color Constancy, Color Spaces and Similarity Measures in Open-Ended 3D Object Recognition
Authors:
S. Hamidreza Kasaei,
Maryam Ghorbani,
Jits Schilperoort,
Wessel van der Rest
Abstract:
Despite the recent success of state-of-the-art 3D object recognition approaches, service robots are frequently failed to recognize many objects in real human-centric environments. For these robots, object recognition is a challenging task due to the high demand for accurate and real-time response under changing and unpredictable environmental conditions. Most of the recent approaches use either th…
▽ More
Despite the recent success of state-of-the-art 3D object recognition approaches, service robots are frequently failed to recognize many objects in real human-centric environments. For these robots, object recognition is a challenging task due to the high demand for accurate and real-time response under changing and unpredictable environmental conditions. Most of the recent approaches use either the shape information only and ignore the role of color information or vice versa. Furthermore, they mainly utilize the $L_n$ Minkowski family functions to measure the similarity of two object views, while there are various distance measures that are applicable to compare two object views. In this paper, we explore the importance of shape information, color constancy, color spaces, and various similarity measures in open-ended 3D object recognition. Towards this goal, we extensively evaluate the performance of object recognition approaches in three different configurations, including \textit{color-only}, \textit{shape-only}, and \textit{ combinations of color and shape}, in both offline and online settings. Experimental results concerning scalability, memory usage, and object recognition performance show that all of the \textit{combinations of color and shape} yields significant improvements over the \textit{shape-only} and \textit{color-only} approaches. The underlying reason is that color information is an important feature to distinguish objects that have very similar geometric properties with different colors and vice versa. Moreover, by combining color and shape information, we demonstrate that the robot can learn new object categories from very few training examples in a real-world setting.
△ Less
Submitted 26 September, 2020; v1 submitted 10 February, 2020;
originally announced February 2020.
-
Accelerating Reinforcement Learning for Reaching using Continuous Curriculum Learning
Authors:
Sha Luo,
Hamidreza Kasaei,
Lambert Schomaker
Abstract:
Reinforcement learning has shown great promise in the training of robot behavior due to the sequential decision making characteristics. However, the required enormous amount of interactive and informative training data provides the major stumbling block for progress. In this study, we focus on accelerating reinforcement learning (RL) training and improving the performance of multi-goal reaching ta…
▽ More
Reinforcement learning has shown great promise in the training of robot behavior due to the sequential decision making characteristics. However, the required enormous amount of interactive and informative training data provides the major stumbling block for progress. In this study, we focus on accelerating reinforcement learning (RL) training and improving the performance of multi-goal reaching tasks. Specifically, we propose a precision-based continuous curriculum learning (PCCL) method in which the requirements are gradually adjusted during the training process, instead of fixing the parameter in a static schedule. To this end, we explore various continuous curriculum strategies for controlling a training process. This approach is tested using a Universal Robot 5e in both simulation and real-world multi-goal reach experiments. Experimental results support the hypothesis that a static training schedule is suboptimal, and using an appropriate decay function for curriculum learning provides superior results in a faster way.
△ Less
Submitted 21 December, 2020; v1 submitted 7 February, 2020;
originally announced February 2020.
-
Interactive Open-Ended Learning for 3D Object Recognition
Authors:
S. Hamidreza Kasaei
Abstract:
The thesis contributes in several important ways to the research area of 3D object category learning and recognition. To cope with the mentioned limitations, we look at human cognition, in particular at the fact that human beings learn to recognize object categories ceaselessly over time. This ability to refine knowledge from the set of accumulated experiences facilitates the adaptation to new env…
▽ More
The thesis contributes in several important ways to the research area of 3D object category learning and recognition. To cope with the mentioned limitations, we look at human cognition, in particular at the fact that human beings learn to recognize object categories ceaselessly over time. This ability to refine knowledge from the set of accumulated experiences facilitates the adaptation to new environments. Inspired by this capability, we seek to create a cognitive object perception and perceptual learning architecture that can learn 3D object categories in an open-ended fashion. In this context, ``open-ended'' implies that the set of categories to be learned is not known in advance, and the training instances are extracted from actual experiences of a robot, and thus become gradually available, rather than being available since the beginning of the learning process. In particular, this architecture provides perception capabilities that will allow robots to incrementally learn object categories from the set of accumulated experiences and reason about how to perform complex tasks. This framework integrates detection, tracking, teaching, learning, and recognition of objects. An extensive set of systematic experiments, in multiple experimental settings, was carried out to thoroughly evaluate the described learning approaches. Experimental results show that the proposed system is able to interact with human users, learn new object categories over time, as well as perform complex tasks. The contributions presented in this thesis have been fully implemented and evaluated on different standard object and scene datasets and empirically evaluated on different robotic platforms.
△ Less
Submitted 19 December, 2019;
originally announced December 2019.