Search | arXiv e-print repository

arXiv:2406.18746 [pdf, other]

Lifelong Robot Library Learning: Bootstrap** Composable and Generalizable Skills for Embodied Control with Language Models

Authors: Georgios Tziafas, Hamidreza Kasaei

Abstract: Large Language Models (LLMs) have emerged as a new paradigm for embodied reasoning and control, most recently by generating robot policy code that utilizes a custom library of vision and control primitive skills. However, prior arts fix their skills library and steer the LLM with carefully hand-crafted prompt engineering, limiting the agent to a stationary range of addressable tasks. In this work,… ▽ More Large Language Models (LLMs) have emerged as a new paradigm for embodied reasoning and control, most recently by generating robot policy code that utilizes a custom library of vision and control primitive skills. However, prior arts fix their skills library and steer the LLM with carefully hand-crafted prompt engineering, limiting the agent to a stationary range of addressable tasks. In this work, we introduce LRLL, an LLM-based lifelong learning agent that continuously grows the robot skill library to tackle manipulation tasks of ever-growing complexity. LRLL achieves this with four novel contributions: 1) a soft memory module that allows dynamic storage and retrieval of past experiences to serve as context, 2) a self-guided exploration policy that proposes new tasks in simulation, 3) a skill abstractor that distills recent experiences into new library skills, and 4) a lifelong learning algorithm for enabling human users to bootstrap new skills with minimal online interaction. LRLL continuously transfers knowledge from the memory to the library, building composable, general and interpretable policies, while bypassing gradient-based optimization, thus relieving the learner from catastrophic forgetting. Empirical evaluation in a simulated tabletop environment shows that LRLL outperforms end-to-end and vanilla LLM approaches in the lifelong setup while learning skills that are transferable to the real world. Project material will become available at the webpage https://gtziafas.github.io/LRLL_project. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Published ICRA-24

arXiv:2406.18742 [pdf, other]

3D Feature Distillation with Object-Centric Priors

Authors: Georgios Tziafas, Yucheng Xu, Zhibin Li, Hamidreza Kasaei

Abstract: Grounding natural language to the physical world is a ubiquitous topic with a wide range of applications in computer vision and robotics. Recently, 2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images. Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural f… ▽ More Grounding natural language to the physical world is a ubiquitous topic with a wide range of applications in computer vision and robotics. Recently, 2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images. Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific and hence lack generalization, or focus on indoor room scan data that require access to multiple camera views, which is not practical in robot manipulation scenarios. Additionally, related methods typically fuse features at pixel-level and assume that all camera views are equally informative. In this work, we show that this approach leads to sub-optimal 3D features, both in terms of grounding accuracy, as well as segmentation crispness. To alleviate this, we propose a multi-view feature fusion strategy that employs object-centric priors to eliminate uninformative views based on semantic information, and fuse features at object-level via instance segmentation masks. To distill our object-centric 3D features, we generate a large-scale synthetic multi-view dataset of cluttered tabletop scenes, spawning 15k scenes from over 3300 unique object instances, which we make publicly available. We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency, while doing so from single-view RGB-D, thus departing from the assumption of multiple camera views at test time. Finally, we show that our approach can generalize to novel tabletop domains and be re-purposed for 3D instance segmentation without fine-tuning, and demonstrate its utility for language-guided robotic gras** in clutter △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Submitted CoRL-24

arXiv:2406.18722 [pdf, other]

Towards Open-World Gras** with Large Vision-Language Models

Authors: Georgios Tziafas, Hamidreza Kasaei

Abstract: The ability to grasp objects in-the-wild from open-ended language instructions constitutes a fundamental challenge in robotics. An open-world gras** system should be able to combine high-level contextual with low-level physical-geometric reasoning in order to be applicable in arbitrary scenarios. Recent works exploit the web-scale knowledge inherent in large language models (LLMs) to plan and re… ▽ More The ability to grasp objects in-the-wild from open-ended language instructions constitutes a fundamental challenge in robotics. An open-world gras** system should be able to combine high-level contextual with low-level physical-geometric reasoning in order to be applicable in arbitrary scenarios. Recent works exploit the web-scale knowledge inherent in large language models (LLMs) to plan and reason in robotic context, but rely on external vision and action models to ground such knowledge into the environment and parameterize actuation. This setup suffers from two major bottlenecks: a) the LLM's reasoning capacity is constrained by the quality of visual grounding, and b) LLMs do not contain low-level spatial understanding of the world, which is essential for gras** in contact-rich scenarios. In this work we demonstrate that modern vision-language models (VLMs) are capable of tackling such limitations, as they are implicitly grounded and can jointly reason about semantics and geometry. We propose OWG, an open-world gras** pipeline that combines VLMs with segmentation and grasp synthesis models to unlock grounded world understanding in three stages: open-ended referring segmentation, grounded grasp planning and grasp ranking via contact reasoning, all of which can be applied zero-shot via suitable visual prompting mechanisms. We conduct extensive evaluation in cluttered indoor scene datasets to showcase OWG's robustness in grounding from open-ended language, as well as open-world robotic gras** experiments in both simulation and hardware that demonstrate superior performance compared to previous supervised and zero-shot LLM-based methods. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Submitted CoRL24

arXiv:2402.16045 [pdf, other]

Harnessing the Synergy between Pushing, Gras**, and Throwing to Enhance Object Manipulation in Cluttered Scenarios

Authors: Hamidreza Kasaei, Mohammadreza Kasaei

Abstract: In this work, we delve into the intricate synergy among non-prehensile actions like pushing, and prehensile actions such as gras** and throwing, within the domain of robotic manipulation. We introduce an innovative approach to learning these synergies by leveraging model-free deep reinforcement learning. The robot's workflow involves detecting the pose of the target object and the basket at each… ▽ More In this work, we delve into the intricate synergy among non-prehensile actions like pushing, and prehensile actions such as gras** and throwing, within the domain of robotic manipulation. We introduce an innovative approach to learning these synergies by leveraging model-free deep reinforcement learning. The robot's workflow involves detecting the pose of the target object and the basket at each time step, predicting the optimal push configuration to isolate the target object, determining the appropriate grasp configuration, and inferring the necessary parameters for an accurate throw into the basket. This empowers robots to skillfully reconfigure cluttered scenarios through pushing, creating space for collision-free gras** actions. Simultaneously, we integrate throwing behavior, showcasing how this action significantly extends the robot's operational reach. Ensuring safety, we developed a simulation environment in Gazebo for robot training, applying the learned policy directly to our real robot. Notably, this work represents a pioneering effort to learn the synergy between pushing, gras**, and throwing actions. Extensive experimentation in both simulated and real-robot scenarios substantiates the effectiveness of our approach across diverse settings. Our approach achieves a success rate exceeding 80\% in both simulated and real-world scenarios. A video showcasing our experiments is available online at: https://youtu.be/q1l4BJVDbRw △ Less

Submitted 25 February, 2024; originally announced February 2024.

Comments: This paper has been accepted at the 2024 IEEE International Conference on Robotics and Automation (ICRA 2024)

arXiv:2311.05779 [pdf, other]

Language-guided Robot Gras**: CLIP-based Referring Grasp Synthesis in Clutter

Authors: Georgios Tziafas, Yucheng Xu, Arushi Goel, Mohammadreza Kasaei, Zhibin Li, Hamidreza Kasaei

Abstract: Robots operating in human-centric environments require the integration of visual grounding and gras** capabilities to effectively manipulate objects based on user instructions. This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes. Existing approaches often employ multi-stage pipelines that firs… ▽ More Robots operating in human-centric environments require the integration of visual grounding and gras** capabilities to effectively manipulate objects based on user instructions. This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes. Existing approaches often employ multi-stage pipelines that first segment the referred object and then propose a suitable grasp, and are evaluated in private datasets or simulators that do not capture the complexity of natural indoor scenes. To address these limitations, we develop a challenging benchmark based on cluttered indoor scenes from OCID dataset, for which we generate referring expressions and connect them with 4-DoF grasp poses. Further, we propose a novel end-to-end model (CROG) that leverages the visual grounding capabilities of CLIP to learn grasp synthesis directly from image-text pairs. Our results show that vanilla integration of CLIP with pretrained models transfers poorly in our challenging benchmark, while CROG achieves significant improvements both in terms of grounding and gras**. Extensive robot experiments in both simulation and hardware demonstrate the effectiveness of our approach in challenging interactive object gras** scenarios that include clutter. △ Less

Submitted 9 November, 2023; originally announced November 2023.

Comments: Poster CoRL 2023. Dataset and code available here: https://github.com/gtziafas/OCID-VLG

arXiv:2310.16123 [pdf, other]

Anchor Space Optimal Transport: Accelerating Batch Processing of Multiple OT Problems

Authors: Jianming Huang, Xun Su, Zhongxi Fang, Hiroyuki Kasai

Abstract: The optimal transport (OT) theory provides an effective way to compare probability distributions on a defined metric space, but it suffers from cubic computational complexity. Although the Sinkhorn's algorithm greatly reduces the computational complexity of OT solutions, the solutions of multiple OT problems are still time-consuming and memory-comsuming in practice. However, many works on the comp… ▽ More The optimal transport (OT) theory provides an effective way to compare probability distributions on a defined metric space, but it suffers from cubic computational complexity. Although the Sinkhorn's algorithm greatly reduces the computational complexity of OT solutions, the solutions of multiple OT problems are still time-consuming and memory-comsuming in practice. However, many works on the computational acceleration of OT are usually based on the premise of a single OT problem, ignoring the potential common characteristics of the distributions in a mini-batch. Therefore, we propose a translated OT problem designated as the anchor space optimal transport (ASOT) problem, which is specially designed for batch processing of multiple OT problem solutions. For the proposed ASOT problem, the distributions will be mapped into a shared anchor point space, which learns the potential common characteristics and thus help accelerate OT batch processing. Based on the proposed ASOT, the Wasserstein distance error to the original OT problem is proven to be bounded by ground cost errors. Building upon this, we propose three methods to learn an anchor space minimizing the distance error, each of which has its application background. Numerical experiments on real-world datasets show that our proposed methods can greatly reduce computational time while maintaining reasonable approximation performance. △ Less

Submitted 24 October, 2023; originally announced October 2023.

Comments: 26 pages, 4 figures, 6 tables

arXiv:2310.07937 [pdf, other]

Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation using Large Language Models

Authors: Bangguo Yu, Hamidreza Kasaei, Ming Cao

Abstract: In advanced human-robot interaction tasks, visual target navigation is crucial for autonomous robots navigating unknown environments. While numerous approaches have been developed in the past, most are designed for single-robot operations, which often suffer from reduced efficiency and robustness due to environmental complexities. Furthermore, learning policies for multi-robot collaboration are re… ▽ More In advanced human-robot interaction tasks, visual target navigation is crucial for autonomous robots navigating unknown environments. While numerous approaches have been developed in the past, most are designed for single-robot operations, which often suffer from reduced efficiency and robustness due to environmental complexities. Furthermore, learning policies for multi-robot collaboration are resource-intensive. To address these challenges, we propose Co-NavGPT, an innovative framework that integrates Large Language Models (LLMs) as a global planner for multi-robot cooperative visual target navigation. Co-NavGPT encodes the explored environment data into prompts, enhancing LLMs' scene comprehension. It then assigns exploration frontiers to each robot for efficient target search. Experimental results on Habitat-Matterport 3D (HM3D) demonstrate that Co-NavGPT surpasses existing models in success rates and efficiency without any learning process, demonstrating the vast potential of LLMs in multi-robot collaboration domains. The supplementary video, prompts, and code can be accessed via the following link: https://sites.google.com/view/co-navgpt △ Less

Submitted 25 December, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: 7 pages, 4 figures, conference

arXiv:2307.00247 [pdf, other]

Safe Screening for Unbalanced Optimal Transport

Authors: Xun Su, Zhongxi Fang, Hiroyuki Kasai

Abstract: This paper introduces a framework that utilizes the Safe Screening technique to accelerate the optimization process of the Unbalanced Optimal Transport (UOT) problem by proactively identifying and eliminating zero elements in the sparse solutions. We demonstrate the feasibility of applying Safe Screening to the UOT problem with $\ell_2$-penalty and KL-penalty by conducting an analysis of the solut… ▽ More This paper introduces a framework that utilizes the Safe Screening technique to accelerate the optimization process of the Unbalanced Optimal Transport (UOT) problem by proactively identifying and eliminating zero elements in the sparse solutions. We demonstrate the feasibility of applying Safe Screening to the UOT problem with $\ell_2$-penalty and KL-penalty by conducting an analysis of the solution's bounds and considering the local strong convexity of the dual problem. Considering the specific structural characteristics of the UOT in comparison to general Lasso problems on the index matrix, we specifically propose a novel approximate projection, an elliptical safe region construction, and a two-hyperplane relaxation method. These enhancements significantly improve the screening efficiency for the UOT's without altering the algorithm's complexity. △ Less

Submitted 1 July, 2023; originally announced July 2023.

arXiv:2306.15919 [pdf, other]

Fine-grained 3D object recognition: an approach and experiments

Authors: Junhyung Jo, Hamidreza Kasaei

Abstract: Three-dimensional (3D) object recognition technology is being used as a core technology in advanced technologies such as autonomous driving of automobiles. There are two sets of approaches for 3D object recognition: (i) hand-crafted approaches like Global Orthographic Object Descriptor (GOOD), and (ii) deep learning-based approaches such as MobileNet and VGG. However, it is needed to know which of… ▽ More Three-dimensional (3D) object recognition technology is being used as a core technology in advanced technologies such as autonomous driving of automobiles. There are two sets of approaches for 3D object recognition: (i) hand-crafted approaches like Global Orthographic Object Descriptor (GOOD), and (ii) deep learning-based approaches such as MobileNet and VGG. However, it is needed to know which of these approaches works better in an open-ended domain where the number of known categories increases over time, and the system should learn about new object categories using few training examples. In this paper, we first implemented an offline 3D object recognition system that takes an object view as input and generates category labels as output. In the offline stage, instance-based learning (IBL) is used to form a new category and we use K-fold cross-validation to evaluate the obtained object recognition performance. We then test the proposed approach in an online fashion by integrating the code into a simulated teacher test. As a result, we concluded that the approach using deep learning features is more suitable for open-ended fashion. Moreover, we observed that concatenating the hand-crafted and deep learning features increases the classification accuracy. △ Less

Submitted 28 June, 2023; originally announced June 2023.

arXiv:2304.07236 [pdf, other]

Learning Perceptive Bipedal Locomotion over Irregular Terrain

Authors: Bart van Marum, Matthia Sabatelli, Hamidreza Kasaei

Abstract: In this paper we propose a novel bipedal locomotion controller that uses noisy exteroception to traverse a wide variety of terrains. Building on the cutting-edge advancements in attention based belief encoding for quadrupedal locomotion, our work extends these methods to the bipedal domain, resulting in a robust and reliable internal belief of the terrain ahead despite noisy sensor inputs. Additio… ▽ More In this paper we propose a novel bipedal locomotion controller that uses noisy exteroception to traverse a wide variety of terrains. Building on the cutting-edge advancements in attention based belief encoding for quadrupedal locomotion, our work extends these methods to the bipedal domain, resulting in a robust and reliable internal belief of the terrain ahead despite noisy sensor inputs. Additionally, we present a reward function that allows the controller to successfully traverse irregular terrain. We compare our method with a proprioceptive baseline and show that our method is able to traverse a wide variety of terrains and greatly outperforms the state-of-the-art in terms of robustness, speed and efficiency. △ Less

Submitted 14 April, 2023; originally announced April 2023.

Comments: 8 pages, 10 figures

arXiv:2304.05506 [pdf, ps, other]

doi 10.1109/ICRA48891.2023.10161059

Frontier Semantic Exploration for Visual Target Navigation

Authors: Bangguo Yu, Hamidreza Kasaei, Ming Cao

Abstract: This work focuses on the problem of visual target navigation, which is very important for autonomous robots as it is closely related to high-level tasks. To find a special object in unknown environments, classical and learning-based approaches are fundamental components of navigation that have been investigated thoroughly in the past. However, due to the difficulty in the representation of complic… ▽ More This work focuses on the problem of visual target navigation, which is very important for autonomous robots as it is closely related to high-level tasks. To find a special object in unknown environments, classical and learning-based approaches are fundamental components of navigation that have been investigated thoroughly in the past. However, due to the difficulty in the representation of complicated scenes and the learning of the navigation policy, previous methods are still not adequate, especially for large unknown scenes. Hence, we propose a novel framework for visual target navigation using the frontier semantic policy. In this proposed framework, the semantic map and the frontier map are built from the current observation of the environment. Using the features of the maps and object category, deep reinforcement learning enables to learn a frontier semantic policy which can be used to select a frontier cell as a long-term goal to explore the environment efficiently. Experiments on Gibson and Habitat-Matterport 3D (HM3D) demonstrate that the proposed framework significantly outperforms existing map-based methods in terms of success rate and efficiency. Ablation analysis also indicates that the proposed approach learns a more efficient exploration policy based on the frontiers. A demonstration is provided to verify the applicability of applying our model to real-world transfer. The supplementary video and code can be accessed via the following link: https://sites.google.com/view/fsevn. △ Less

Submitted 25 December, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

Comments: 7 pages

Journal ref: 2023 IEEE International Conference on Robotics and Automation (ICRA)

arXiv:2304.05501 [pdf, ps, other]

doi 10.1109/IROS55552.2023.10342512

L3MVN: Leveraging Large Language Models for Visual Target Navigation

Authors: Bangguo Yu, Hamidreza Kasaei, Ming Cao

Abstract: Visual target navigation in unknown environments is a crucial problem in robotics. Despite extensive investigation of classical and learning-based approaches in the past, robots lack common-sense knowledge about household objects and layouts. Prior state-of-the-art approaches to this task rely on learning the priors during the training and typically require significant expensive resources and time… ▽ More Visual target navigation in unknown environments is a crucial problem in robotics. Despite extensive investigation of classical and learning-based approaches in the past, robots lack common-sense knowledge about household objects and layouts. Prior state-of-the-art approaches to this task rely on learning the priors during the training and typically require significant expensive resources and time for learning. To address this, we propose a new framework for visual target navigation that leverages Large Language Models (LLM) to impart common sense for object searching. Specifically, we introduce two paradigms: (i) zero-shot and (ii) feed-forward approaches that use language to find the relevant frontier from the semantic map as a long-term goal and explore the environment efficiently. Our analysis demonstrates the notable zero-shot generalization and transfer capabilities from the use of language. Experiments on Gibson and Habitat-Matterport 3D (HM3D) demonstrate that the proposed framework significantly outperforms existing map-based methods in terms of success rate and generalization. Ablation analysis also indicates that the common-sense knowledge from the language model leads to more efficient semantic exploration. Finally, we provide a real robot experiment to verify the applicability of our framework in real-world scenarios. The supplementary video and code can be accessed via the following link: https://sites.google.com/view/l3mvn. △ Less

Submitted 25 December, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

Comments: 7 pages

Journal ref: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:2303.05323 [pdf, other]

Controllable Video Generation by Learning the Underlying Dynamical System with Neural ODE

Authors: Yucheng Xu, Li Nanbo, Arushi Goel, Zijian Guo, Zonghai Yao, Hamidreza Kasaei, Mohammadreze Kasaei, Zhibin Li

Abstract: Videos depict the change of complex dynamical systems over time in the form of discrete image sequences. Generating controllable videos by learning the dynamical system is an important yet underexplored topic in the computer vision community. This paper presents a novel framework, TiV-ODE, to generate highly controllable videos from a static image and a text caption. Specifically, our framework le… ▽ More Videos depict the change of complex dynamical systems over time in the form of discrete image sequences. Generating controllable videos by learning the dynamical system is an important yet underexplored topic in the computer vision community. This paper presents a novel framework, TiV-ODE, to generate highly controllable videos from a static image and a text caption. Specifically, our framework leverages the ability of Neural Ordinary Differential Equations~(Neural ODEs) to represent complex dynamical systems as a set of nonlinear ordinary differential equations. The resulting framework is capable of generating videos with both desired dynamics and content. Experiments demonstrate the ability of the proposed method in generating highly controllable and visually consistent videos, and its capability of modeling dynamical systems. Overall, this work is a significant step towards develo** advanced controllable video generation models that can handle complex and dynamic scenes. △ Less

Submitted 4 April, 2023; v1 submitted 9 March, 2023; originally announced March 2023.

arXiv:2302.07824 [pdf, other]

Instance-wise Grasp Synthesis for Robotic Gras**

Authors: Yucheng Xu, Mohammadreza Kasaei, Hamidreza Kasaei, Zhibin Li

Abstract: Generating high-quality instance-wise grasp configurations provides critical information of how to grasp specific objects in a multi-object environment and is of high importance for robot manipulation tasks. This work proposed a novel \textbf{S}ingle-\textbf{S}tage \textbf{G}rasp (SSG) synthesis network, which performs high-quality instance-wise grasp synthesis in a single stage: instance mask and… ▽ More Generating high-quality instance-wise grasp configurations provides critical information of how to grasp specific objects in a multi-object environment and is of high importance for robot manipulation tasks. This work proposed a novel \textbf{S}ingle-\textbf{S}tage \textbf{G}rasp (SSG) synthesis network, which performs high-quality instance-wise grasp synthesis in a single stage: instance mask and grasp configurations are generated for each object simultaneously. Our method outperforms state-of-the-art on robotic grasp prediction based on the OCID-Grasp dataset, and performs competitively on the JACQUARD dataset. The benchmarking results showed significant improvements compared to the baseline on the accuracy of generated grasp configurations. The performance of the proposed method has been validated through both extensive simulations and real robot experiments for three tasks including single object pick-and-place, grasp synthesis in cluttered environments and table cleaning task. △ Less

Submitted 15 February, 2023; originally announced February 2023.

arXiv:2301.07037 [pdf, other]

Explain What You See: Open-Ended Segmentation and Recognition of Occluded 3D Objects

Authors: H. Ayoobi, H. Kasaei, M. Cao, R. Verbrugge, B. Verheij

Abstract: Local-HDP (for Local Hierarchical Dirichlet Process) is a hierarchical Bayesian method that has recently been used for open-ended 3D object category recognition. This method has been proven to be efficient in real-time robotic applications. However, the method is not robust to a high degree of occlusion. We address this limitation in two steps. First, we propose a novel semantic 3D object-parts se… ▽ More Local-HDP (for Local Hierarchical Dirichlet Process) is a hierarchical Bayesian method that has recently been used for open-ended 3D object category recognition. This method has been proven to be efficient in real-time robotic applications. However, the method is not robust to a high degree of occlusion. We address this limitation in two steps. First, we propose a novel semantic 3D object-parts segmentation method that has the flexibility of Local-HDP. This method is shown to be suitable for open-ended scenarios where the number of 3D objects or object parts is not fixed and can grow over time. We show that the proposed method has a higher percentage of mean intersection over union, using a smaller number of learning instances. Second, we integrate this technique with a recently introduced argumentation-based online incremental learning method, thereby enabling the model to handle a high degree of occlusion. We show that the resulting model produces an explicit set of explanations for the 3D object category recognition task. △ Less

Submitted 17 January, 2023; originally announced January 2023.

Comments: Accepted at ICRA 2023 Conference

arXiv:2210.04613 [pdf, other]

Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models

Authors: Songsong Xiong, Georgios Tziafas, Hamidreza Kasaei

Abstract: Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-gra… ▽ More Robots operating in human-centered environments, such as retail stores, restaurants, and households, are often required to distinguish between similar objects in different contexts with a high degree of accuracy. However, fine-grained object recognition remains a challenge in robotics due to the high intra-category and low inter-category dissimilarities. In addition, the limited number of fine-grained 3D datasets poses a significant problem in addressing this issue effectively. In this paper, we propose a hybrid multi-modal Vision Transformer (ViT) and Convolutional Neural Networks (CNN) approach to improve the performance of fine-grained visual classification (FGVC). To address the shortage of FGVC 3D datasets, we generated two synthetic datasets. The first dataset consists of 20 categories related to restaurants with a total of 100 instances, while the second dataset contains 120 shoe instances. Our approach was evaluated on both datasets, and the results indicate that it outperforms both CNN-only and ViT-only baselines, achieving a recognition accuracy of 94.50 % and 93.51 % on the restaurant and shoe datasets, respectively. Additionally, we have made our FGVC RGB-D datasets available to the research community to enable further experimentation and advancement. Furthermore, we successfully integrated our proposed method with a robot framework and demonstrated its potential as a fine-grained perception tool in both simulated and real-world robotic scenarios. △ Less

Submitted 6 March, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

arXiv:2210.03628 [pdf, other]

GraspCaps: A Capsule Network Approach for Familiar 6DoF Object Gras**

Authors: Tomas van der Velde, Hamed Ayoobi, Hamidreza Kasaei

Abstract: As robots become more widely available outside industrial settings, the need for reliable object gras** and manipulation is increasing. In such environments, robots must be able to grasp and manipulate novel objects in various situations. This paper presents GraspCaps, a novel architecture based on Capsule Networks for generating per-point 6D grasp configurations for familiar objects. GraspCaps… ▽ More As robots become more widely available outside industrial settings, the need for reliable object gras** and manipulation is increasing. In such environments, robots must be able to grasp and manipulate novel objects in various situations. This paper presents GraspCaps, a novel architecture based on Capsule Networks for generating per-point 6D grasp configurations for familiar objects. GraspCaps extracts a rich feature vector of the objects present in the point cloud input, which is then used to generate per-point grasp vectors. This approach allows the network to learn specific gras** strategies for each object category. In addition to GraspCaps, the paper also presents a method for generating a large object-gras** dataset using simulated annealing. The obtained dataset is then used to train the GraspCaps network. Through extensive experiments, we evaluate the performance of the proposed approach, particularly in terms of the success rate of gras** familiar objects in challenging real and simulated scenarios. The experimental results showed that the overall object-gras** performance of the proposed approach is significantly better than the selected baseline. This superior performance highlights the effectiveness of the GraspCaps in achieving successful object gras** across various scenarios. △ Less

Submitted 29 November, 2023; v1 submitted 7 October, 2022; originally announced October 2022.

Comments: Submitted to CVPR 2023, Supplementary video: https://youtu.be/d13rEhKgApI?si=EhgbDI84nlXL5V2M

arXiv:2210.00858 [pdf, other]

Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach

Authors: Georgios Tziafas, Hamidreza Kasaei

Abstract: In this paper we present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation. A non-expert human user can prompt the robot using unconstrained natural language, providing a referring expression (REF), a question (VQA), or a grasp action instruction. The system tackles all cases in a task-agnostic fashion through the utilization of a shared library of… ▽ More In this paper we present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation. A non-expert human user can prompt the robot using unconstrained natural language, providing a referring expression (REF), a question (VQA), or a grasp action instruction. The system tackles all cases in a task-agnostic fashion through the utilization of a shared library of primitive skills. Each primitive handles an independent sub-task, such as reasoning about visual attributes, spatial relation comprehension, logic and enumeration, as well as arm control. A language parser maps the input query to an executable program composed of such primitives, depending on the context. While some primitives are purely symbolic operations (e.g. counting), others are trainable neural functions (e.g. visual grounding), therefore marrying the interpretability and systematic generalization benefits of discrete symbolic approaches with the scalability and representational power of deep networks. We generate a 3D vision-and-language synthetic dataset of tabletop scenes in a simulation environment to train our approach and perform extensive evaluations in both synthetic and real-world scenes. Results showcase the benefits of our approach in terms of accuracy, sample-efficiency, and robustness to the user's vocabulary, while being transferable to real-world scenes with few-shot visual fine-tuning. Finally, we integrate our method with a robot framework and demonstrate how it can serve as an interpretable solution for an interactive object-picking task, both in simulation and with a real robot. We make our datasets available in https://gtziafas.github.io/neurosymbolic-manipulation. △ Less

Submitted 7 May, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: Submitted T-RO

arXiv:2210.00843 [pdf, other]

Early or Late Fusion Matters: Efficient RGB-D Fusion in Vision Transformers for 3D Object Recognition

Authors: Georgios Tziafas, Hamidreza Kasaei

Abstract: The Vision Transformer (ViT) architecture has established its place in computer vision literature, however, training ViTs for RGB-D object recognition remains an understudied topic, viewed in recent literature only through the lens of multi-task pretraining in multiple vision modalities. Such approaches are often computationally intensive, relying on the scale of multiple pretraining datasets to a… ▽ More The Vision Transformer (ViT) architecture has established its place in computer vision literature, however, training ViTs for RGB-D object recognition remains an understudied topic, viewed in recent literature only through the lens of multi-task pretraining in multiple vision modalities. Such approaches are often computationally intensive, relying on the scale of multiple pretraining datasets to align RGB with 3D information. In this work, we propose a simple yet strong recipe for transferring pretrained ViTs in RGB-D domains for 3D object recognition, focusing on fusing RGB and depth representations encoded jointly by the ViT. Compared to previous works in multimodal Transformers, the key challenge here is to use the attested flexibility of ViTs to capture cross-modal interactions at the downstream and not the pretraining stage. We explore which depth representation is better in terms of resulting accuracy and compare early and late fusion techniques for aligning the RGB and depth modalities within the ViT architecture. Experimental results in the Washington RGB-D Objects dataset (ROD) demonstrate that in such RGB -> RGB-D scenarios, late fusion techniques work better than most popularly employed early fusion. With our transfer baseline, fusion ViTs score up to 95.4% top-1 accuracy in ROD, achieving new state-of-the-art results in this benchmark. We further show the benefits of using our multimodal fusion baseline over unimodal feature extractors in a synthetic-to-real visual adaptation as well as in an open-ended lifelong learning scenario in the ROD benchmark, where our model outperforms previous works by a margin of >8%. Finally, we integrate our method with a robot framework and demonstrate how it can serve as a perception utility in an interactive robot learning scenario, both in simulation and with a real robot. △ Less

Submitted 7 March, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: Submitted IROS 23. Supplementary video here: https://youtu.be/L2gkDPkHsfo

arXiv:2210.00803 [pdf, other]

IPPO: Obstacle Avoidance for Robotic Manipulators in Joint Space via Improved Proximal Policy Optimization

Authors: Yongliang Wang, Hamidreza Kasaei

Abstract: Reaching tasks with random targets and obstacles is a challenging task for robotic manipulators. In this study, we propose a novel model-free reinforcement learning approach based on proximal policy optimization (PPO) for training a deep policy to map the task space to the joint space of a 6-DoF manipulator. To facilitate the training process in a large workspace, we develop an efficient represent… ▽ More Reaching tasks with random targets and obstacles is a challenging task for robotic manipulators. In this study, we propose a novel model-free reinforcement learning approach based on proximal policy optimization (PPO) for training a deep policy to map the task space to the joint space of a 6-DoF manipulator. To facilitate the training process in a large workspace, we develop an efficient representation of environmental inputs and outputs. The calculation of the distance between obstacles and manipulator links is incorporated into the state representation using a geometry-based method. Additionally, to enhance the performance of the model in reaching tasks, we introduce the action ensembles method and design the policy to directly participate in value function updates in PPO. To overcome the challenges associated with training in real-robot environments, we develop a simulation environment in Gazebo to train the model as it produces a smaller Sim-to-Real gap compared to other simulators. However, training in Gazebo is time-intensive. To address this issue, we propose a Sim-to-Sim method to significantly reduce the training time. The trained model is then directly applied in a real-robot setup without fine-tuning. To evaluate the performance of the proposed approach, we perform several rounds of experiments in both simulated and real robots. We also compare the performance of the proposed approach with six baselines. The experimental results demonstrate the effectiveness of the proposed method in performing reaching tasks with and without obstacles. our method outperformed the selected baselines by a large margin in different reaching task scenarios. A video of these experiments has been attached to the paper as supplementary material. △ Less

Submitted 9 February, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

arXiv:2210.00609 [pdf, other]

Throwing Objects into A Moving Basket While Avoiding Obstacles

Authors: Hamidreza Kasaei, Mohammadreza Kasaei

Abstract: The capabilities of a robot will be increased significantly by exploiting throwing behavior. In particular, throwing will enable robots to rapidly place the object into the target basket, located outside its feasible kinematic space, without traveling to the desired location. In previous approaches, the robot often learned a parameterized throwing kernel through analytical approaches, imitation le… ▽ More The capabilities of a robot will be increased significantly by exploiting throwing behavior. In particular, throwing will enable robots to rapidly place the object into the target basket, located outside its feasible kinematic space, without traveling to the desired location. In previous approaches, the robot often learned a parameterized throwing kernel through analytical approaches, imitation learning, or hand-coding. There are many situations in which such approaches do not work/generalize well due to various object shapes, heterogeneous mass distribution, and also obstacles that might be presented in the environment. It is obvious that a method is needed to modulate the throwing kernel through its meta parameters. In this paper, we tackle object throwing problem through a deep reinforcement learning approach that enables robots to precisely throw objects into moving baskets while there are obstacles obstructing the path. To the best of our knowledge, we are the first group that addresses throwing objects with obstacle avoidance. Such a throwing skill not only increases the physical reachability of a robot arm but also improves the execution time. In particular, the robot detects the pose of the target object, basket, and obstacle at each time step, predicts the proper grasp configuration for the target object, and then infers appropriate parameters to throw the object into the basket. Due to safety constraints, we develop a simulation environment in Gazebo to train the robot and then use the learned policy in real-robot directly. To assess the performers of the proposed approach, we perform extensive sets of experiments in both simulation and real robots in three scenarios. Experimental results showed that the robot could precisely throw a target object into the basket outside its kinematic range and generalize well to new locations and objects without colliding with obstacles. △ Less

Submitted 2 October, 2022; originally announced October 2022.

Comments: The video of our experiments can be found at https://youtu.be/VmIFF__c_84

arXiv:2207.04216 [pdf, other]

Wasserstein Graph Distance Based on $L_1$-Approximated Tree Edit Distance between Weisfeiler-Lehman Subtrees

Authors: Zhongxi Fang, Jianming Huang, Xun Su, Hiroyuki Kasai

Abstract: The Weisfeiler-Lehman (WL) test is a widely used algorithm in graph machine learning, including graph kernels, graph metrics, and graph neural networks. However, it focuses only on the consistency of the graph, which means that it is unable to detect slight structural differences. Consequently, this limits its ability to capture structural information, which also limits the performance of existing… ▽ More The Weisfeiler-Lehman (WL) test is a widely used algorithm in graph machine learning, including graph kernels, graph metrics, and graph neural networks. However, it focuses only on the consistency of the graph, which means that it is unable to detect slight structural differences. Consequently, this limits its ability to capture structural information, which also limits the performance of existing models that rely on the WL test. This limitation is particularly severe for traditional metrics defined by the WL test, which cannot precisely capture slight structural differences. In this paper, we propose a novel graph metric called the Wasserstein WL Subtree (WWLS) distance to address this problem. Our approach leverages the WL subtree as structural information for node neighborhoods and defines node metrics using the $L_1$-approximated tree edit distance ($L_1$-TED) between WL subtrees of nodes. Subsequently, we combine the Wasserstein distance and the $L_1$-TED to define the WWLS distance, which can capture slight structural differences that may be difficult to detect using conventional metrics. We demonstrate that the proposed WWLS distance outperforms baselines in both metric validation and graph classification experiments. △ Less

Submitted 1 May, 2023; v1 submitted 9 July, 2022; originally announced July 2022.

arXiv:2205.13846 [pdf, ps, other]

On the Convergence of Semi-Relaxed Sinkhorn with Marginal Constraint and OT Distance Gaps

Authors: Takumi Fukunaga, Hiroyuki Kasai

Abstract: This paper presents consideration of the Semi-Relaxed Sinkhorn (SR-Sinkhorn) algorithm for the semi-relaxed optimal transport (SROT) problem, which relaxes one marginal constraint of the standard OT problem. For evaluation of how the constraint relaxation affects the algorithm behavior and solution, it is vitally necessary to present the theoretical convergence analysis in terms not only of the fu… ▽ More This paper presents consideration of the Semi-Relaxed Sinkhorn (SR-Sinkhorn) algorithm for the semi-relaxed optimal transport (SROT) problem, which relaxes one marginal constraint of the standard OT problem. For evaluation of how the constraint relaxation affects the algorithm behavior and solution, it is vitally necessary to present the theoretical convergence analysis in terms not only of the functional value gap, but also of the marginal constraint gap as well as the OT distance gap. However, no existing work has addressed all analyses simultaneously. To this end, this paper presents a comprehensive convergence analysis for SR-Sinkhorn. After presenting the $ε$-approximation of the functional value gap based on a new proof strategy and exploiting this proof strategy, we give the upper bound of the marginal constraint gap. We also provide its convergence to the $ε$-approximation when two distributions are in the probability simplex. Furthermore, the convergence analysis of the OT distance gap to the $ε$-approximation is given as assisted by the obtained marginal constraint gap. The latter two theoretical results are the first results presented in the literature related to the SROT problem. △ Less

Submitted 27 May, 2022; originally announced May 2022.

arXiv:2205.13766 [pdf, ps, other]

doi 10.1109/ICASSP43922.2022.9746032

Block-coordinate Frank-Wolfe algorithm and convergence analysis for semi-relaxed optimal transport problem

Authors: Takumi Fukunaga, Hiroyuki Kasai

Abstract: The optimal transport (OT) problem has been used widely for machine learning. It is necessary for computation of an OT problem to solve linear programming with tight mass-conservation constraints. These constraints prevent its application to large-scale problems. To address this issue, loosening such constraints enables us to propose the relaxed-OT method using a faster algorithm. This approach ha… ▽ More The optimal transport (OT) problem has been used widely for machine learning. It is necessary for computation of an OT problem to solve linear programming with tight mass-conservation constraints. These constraints prevent its application to large-scale problems. To address this issue, loosening such constraints enables us to propose the relaxed-OT method using a faster algorithm. This approach has demonstrated its effectiveness for applications. However, it remains slow. As a superior alternative, we propose a fast block-coordinate Frank-Wolfe (BCFW) algorithm for a convex semi-relaxed OT. Specifically, we prove their upper bounds of the worst convergence iterations, and equivalence between the linearization duality gap and the Lagrangian duality gap. Additionally, we develop two fast variants of the proposed BCFW. Numerical experiments have demonstrated that our proposed algorithms are effective for color transfer and surpass state-of-the-art algorithms. This report presents a short version of arXiv:2103.05857. △ Less

Submitted 27 May, 2022; originally announced May 2022.

Comments: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2022). arXiv admin note: substantial text overlap with arXiv:2103.05857

arXiv:2205.12089 [pdf, other]

Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity Resolution

Authors: Georgios Tziafas, Hamidreza Kasaei

Abstract: Service robots should be able to interact naturally with non-expert human users, not only to help them in various tasks but also to receive guidance in order to resolve ambiguities that might be present in the instruction. We consider the task of visual grounding, where the agent segments an object from a crowded scene given a natural language description. Modern holistic approaches to visual grou… ▽ More Service robots should be able to interact naturally with non-expert human users, not only to help them in various tasks but also to receive guidance in order to resolve ambiguities that might be present in the instruction. We consider the task of visual grounding, where the agent segments an object from a crowded scene given a natural language description. Modern holistic approaches to visual grounding usually ignore language structure and struggle to cover generic domains, therefore relying heavily on large datasets. Additionally, their transfer performance in RGB-D datasets suffers due to high visual discrepancy between the benchmark and the target domains. Modular approaches marry learning with domain modeling and exploit the compositional nature of language to decouple visual representation from language parsing, but either rely on external parsers or are trained in an end-to-end fashion due to the lack of strong supervision. In this work, we seek to tackle these limitations by introducing a fully decoupled modular framework for compositional visual grounding of entities, attributes, and spatial relations. We exploit rich scene graph annotations generated in a synthetic domain and train each module independently. Our approach is evaluated both in simulation and in two real RGB-D scene datasets. Experimental results show that the decoupled nature of our framework allows for easy integration with domain adaptation approaches for Sim-To-Real visual recognition, offering a data-efficient, robust, and interpretable solution to visual grounding in robotic applications. △ Less

Submitted 10 July, 2022; v1 submitted 24 May, 2022; originally announced May 2022.

Comments: Accepted CoLLAs 2022

arXiv:2205.01982 [pdf, other]

Lifelong Ensemble Learning based on Multiple Representations for Few-Shot Object Recognition

Authors: Hamidreza Kasaei, Songsong Xiong

Abstract: Service robots are integrating more and more into our daily lives to help us with various tasks. In such environments, robots frequently face new objects while working in the environment and need to learn them in an open-ended fashion. Furthermore, such robots must be able to recognize a wide range of object categories. In this paper, we present a lifelong ensemble learning approach based on multi… ▽ More Service robots are integrating more and more into our daily lives to help us with various tasks. In such environments, robots frequently face new objects while working in the environment and need to learn them in an open-ended fashion. Furthermore, such robots must be able to recognize a wide range of object categories. In this paper, we present a lifelong ensemble learning approach based on multiple representations to address the few-shot object recognition problem. In particular, we form ensemble methods based on deep representations and handcrafted 3D shape descriptors. To facilitate lifelong learning, each approach is equipped with a memory unit for storing and retrieving object information instantly. The proposed model is suitable for open-ended learning scenarios where the number of 3D object categories is not fixed and can grow over time. We have performed extensive sets of experiments to assess the performance of the proposed approach in offline, and open-ended scenarios. For the evaluation purpose, in addition to real object datasets, we generate a large synthetic household objects dataset consisting of 27000 views of 90 objects. Experimental results demonstrate the effectiveness of the proposed method on online few-shot 3D object recognition tasks, as well as its superior performance over the state-of-the-art open-ended learning approaches. Furthermore, our results show that while ensemble learning is modestly beneficial in offline settings, it is significantly beneficial in lifelong few-shot learning situations. Additionally, we demonstrated the effectiveness of our approach in both simulated and real-robot settings, where the robot rapidly learned new categories from limited examples. △ Less

Submitted 9 January, 2024; v1 submitted 4 May, 2022; originally announced May 2022.

Comments: The paper has been accepted for publication in the Robotics and Autonomous Systems journal

arXiv:2203.02511 [pdf, other]

Self-Supervised Learning for Joint Pushing and Gras** Policies in Highly Cluttered Environments

Authors: Yongliang Wang, Kamal Mokhtar, Cock Heemskerk, Hamidreza Kasaei

Abstract: Robots often face situations where gras** a goal object is desirable but not feasible due to other present objects preventing the grasp action. We present a deep Reinforcement Learning approach to learn gras** and pushing policies for manipulating a goal object in highly cluttered environments to address this problem. In particular, a dual Reinforcement Learning model approach is proposed, whi… ▽ More Robots often face situations where gras** a goal object is desirable but not feasible due to other present objects preventing the grasp action. We present a deep Reinforcement Learning approach to learn gras** and pushing policies for manipulating a goal object in highly cluttered environments to address this problem. In particular, a dual Reinforcement Learning model approach is proposed, which presents high resilience in handling complicated scenes, reaching an average of 98% task completion using primitive objects in a simulation environment. To evaluate the performance of the proposed approach, we performed two extensive sets of experiments in packed objects and a pile of object scenarios with a total of 1000 test runs in simulation. Experimental results showed that the proposed method worked very well in both scenarios and outperformed the recent state-of-the-art approaches. Demo video, trained models, and source code for the results reproducibility purpose are publicly available. https://sites.google.com/view/pushandgrasp/home △ Less

Submitted 16 March, 2024; v1 submitted 4 March, 2022; originally announced March 2022.

Comments: This paper has been accepted for publication at the ICRA2024 conference

arXiv:2109.11544 [pdf, other]

Lifelong 3D Object Recognition and Grasp Synthesis Using Dual Memory Recurrent Self-Organization Networks

Authors: Krishnakumar Santhakumar, Hamidreza Kasaei

Abstract: Humans learn to recognize and manipulate new objects in lifelong settings without forgetting the previously gained knowledge under non-stationary and sequential conditions. In autonomous systems, the agents also need to mitigate similar behavior to continually learn the new object categories and adapt to new environments. In most conventional deep neural networks, this is not possible due to the p… ▽ More Humans learn to recognize and manipulate new objects in lifelong settings without forgetting the previously gained knowledge under non-stationary and sequential conditions. In autonomous systems, the agents also need to mitigate similar behavior to continually learn the new object categories and adapt to new environments. In most conventional deep neural networks, this is not possible due to the problem of catastrophic forgetting, where the newly gained knowledge overwrites existing representations. Furthermore, most state-of-the-art models excel either in recognizing the objects or in grasp prediction, while both tasks use visual input. The combined architecture to tackle both tasks is very limited. In this paper, we proposed a hybrid model architecture consists of a dynamically growing dual-memory recurrent neural network (GDM) and an autoencoder to tackle object recognition and gras** simultaneously. The autoencoder network is responsible to extract a compact representation for a given object, which serves as input for the GDM learning, and is responsible to predict pixel-wise antipodal grasp configurations. The GDM part is designed to recognize the object in both instances and categories levels. We address the problem of catastrophic forgetting using the intrinsic memory replay, where the episodic memory periodically replays the neural activation trajectories in the absence of external sensory information. To extensively evaluate the proposed model in a lifelong setting, we generate a synthetic dataset due to lack of sequential 3D objects dataset. Experiment results demonstrated that the proposed model can learn both object representation and gras** simultaneously in continual learning scenarios. △ Less

Submitted 23 January, 2022; v1 submitted 23 September, 2021; originally announced September 2021.

arXiv:2106.01866 [pdf, other]

Simultaneous Multi-View Object Recognition and Gras** in Open-Ended Domains

Authors: Hamidreza Kasaei, Sha Luo, Remo Sasso, Mohammadreza Kasaei

Abstract: To aid humans in everyday tasks, robots need to know which objects exist in the scene, where they are, and how to grasp and manipulate them in different situations. Therefore, object recognition and gras** are two key functionalities for autonomous robots. Most state-of-the-art approaches treat object recognition and gras** as two separate problems, even though both use visual input. Furthermo… ▽ More To aid humans in everyday tasks, robots need to know which objects exist in the scene, where they are, and how to grasp and manipulate them in different situations. Therefore, object recognition and gras** are two key functionalities for autonomous robots. Most state-of-the-art approaches treat object recognition and gras** as two separate problems, even though both use visual input. Furthermore, the knowledge of the robot is fixed after the training phase. In such cases, if the robot encounters new object categories, it must be retrained to incorporate new information without catastrophic forgetting. In order to resolve this problem, we propose a deep learning architecture with an augmented memory capacity to handle open-ended object recognition and gras** simultaneously. In particular, our approach takes multi-views of an object as input and jointly estimates pixel-wise grasp configuration as well as a deep scale- and rotation-invariant representation as output. The obtained representation is then used for open-ended object recognition through a meta-active learning technique. We demonstrate the ability of our approach to grasp never-seen-before objects and to rapidly learn new object categories using very few examples on-site in both simulation and real-world settings. A video of these experiments is available online at: https://youtu.be/n9SMpuEkOgk △ Less

Submitted 6 December, 2022; v1 submitted 3 June, 2021; originally announced June 2021.

Comments: arXiv admin note: text overlap with arXiv:2103.10997

arXiv:2103.13834 [pdf, other]

Self-Imitation Learning by Planning

Authors: Sha Luo, Hamidreza Kasaei, Lambert Schomaker

Abstract: Imitation learning (IL) enables robots to acquire skills quickly by transferring expert knowledge, which is widely adopted in reinforcement learning (RL) to initialize exploration. However, in long-horizon motion planning tasks, a challenging problem in deploying IL and RL methods is how to generate and collect massive, broadly distributed data such that these methods can generalize effectively. I… ▽ More Imitation learning (IL) enables robots to acquire skills quickly by transferring expert knowledge, which is widely adopted in reinforcement learning (RL) to initialize exploration. However, in long-horizon motion planning tasks, a challenging problem in deploying IL and RL methods is how to generate and collect massive, broadly distributed data such that these methods can generalize effectively. In this work, we solve this problem using our proposed approach called {self-imitation learning by planning (SILP)}, where demonstration data are collected automatically by planning on the visited states from the current policy. SILP is inspired by the observation that successfully visited states in the early reinforcement learning stage are collision-free nodes in the graph-search based motion planner, so we can plan and relabel robot's own trials as demonstrations for policy learning. Due to these self-generated demonstrations, we relieve the human operator from the laborious data preparation process required by IL and RL methods in solving complex motion planning tasks. The evaluation results show that our SILP method achieves higher success rates and enhances sample efficiency compared to selected baselines, and the policy learned in simulation performs well in a real-world placement task with changing goals and obstacles. △ Less

Submitted 26 March, 2021; v1 submitted 25 March, 2021; originally announced March 2021.

arXiv:2103.10997 [pdf, other]

MVGrasp: Real-Time Multi-View 3D Object Gras** in Highly Cluttered Environments

Authors: Hamidreza Kasaei, Mohammadreza Kasaei

Abstract: Nowadays robots play an increasingly important role in our daily life. In human-centered environments, robots often encounter piles of objects, packed items, or isolated objects. Therefore, a robot must be able to grasp and manipulate different objects in various situations to help humans with daily tasks. In this paper, we propose a multi-view deep learning approach to handle robust object graspi… ▽ More Nowadays robots play an increasingly important role in our daily life. In human-centered environments, robots often encounter piles of objects, packed items, or isolated objects. Therefore, a robot must be able to grasp and manipulate different objects in various situations to help humans with daily tasks. In this paper, we propose a multi-view deep learning approach to handle robust object gras** in human-centric domains. In particular, our approach takes a point cloud of an arbitrary object as an input, and then, generates orthographic views of the given object. The obtained views are finally used to estimate pixel-wise grasp synthesis for each object. We train the model end-to-end using a small object grasp dataset and test it on both simulations and real-world data without any further fine-tuning. To evaluate the performance of the proposed approach, we performed extensive sets of experiments in three scenarios, including isolated objects, packed items, and pile of objects. Experimental results show that our approach performed very well in all simulation and real-robot scenarios, and is able to achieve reliable closed-loop gras** of novel objects across various scene configurations. △ Less

Submitted 5 October, 2022; v1 submitted 19 March, 2021; originally announced March 2021.

Comments: The video of our experiments can be found here: https://youtu.be/c-4lzjbF7fY

arXiv:2103.09863 [pdf, other]

MORE: Simultaneous Multi-View 3D Object Recognition and Pose Estimation

Authors: Tommaso Parisotto, Subhaditya Mukherjee, Hamidreza Kasaei

Abstract: Simultaneous object recognition and pose estimation are two key functionalities for robots to safely interact with humans as well as environments. Although both object recognition and pose estimation use visual input, most state-of-the-art tackles them as two separate problems since the former needs a view-invariant representation while object pose estimation necessitates a view-dependent descript… ▽ More Simultaneous object recognition and pose estimation are two key functionalities for robots to safely interact with humans as well as environments. Although both object recognition and pose estimation use visual input, most state-of-the-art tackles them as two separate problems since the former needs a view-invariant representation while object pose estimation necessitates a view-dependent description. Nowadays, multi-view Convolutional Neural Network (MVCNN) approaches show state-of-the-art classification performance. Although MVCNN object recognition has been widely explored, there has been very little research on multi-view object pose estimation methods, and even less on addressing these two problems simultaneously. The pose of virtual cameras in MVCNN methods is often predefined in advance, leading to bound the application of such approaches. In this paper, we propose an approach capable of handling object recognition and pose estimation simultaneously. In particular, we develop a deep object-agnostic entropy estimation model, capable of predicting the best viewpoints of a given 3D object. The obtained views of the object are then fed to the network to simultaneously predict the pose and category label of the target object. Experimental results showed that the views obtained from such positions are descriptive enough to achieve a good accuracy score. Furthermore, we designed a real-life serve drink scenario to demonstrate how well the proposed approach worked in real robot tasks. Code is available online at: github.com/SubhadityaMukherjee/more_mvcnn △ Less

Submitted 7 April, 2023; v1 submitted 17 March, 2021; originally announced March 2021.

arXiv:2103.09720 [pdf, other]

Few-Shot Visual Grounding for Natural Human-Robot Interaction

Authors: Giorgos Tziafas, Hamidreza Kasaei

Abstract: Natural Human-Robot Interaction (HRI) is one of the key components for service robots to be able to work in human-centric environments. In such dynamic environments, the robot needs to understand the intention of the user to accomplish a task successfully. Towards addressing this point, we propose a software architecture that segments a target object from a crowded scene, indicated verbally by a h… ▽ More Natural Human-Robot Interaction (HRI) is one of the key components for service robots to be able to work in human-centric environments. In such dynamic environments, the robot needs to understand the intention of the user to accomplish a task successfully. Towards addressing this point, we propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user. At the core of our system, we employ a multi-modal deep neural network for visual grounding. Unlike most grounding methods that tackle the challenge using pre-trained object detectors via a two-stepped process, we develop a single stage zero-shot model that is able to provide predictions in unseen data. We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets. Experimental results showed that the proposed model performs well in terms of accuracy and speed, while showcasing robustness to variation in the natural language input. △ Less

Submitted 31 March, 2021; v1 submitted 17 March, 2021; originally announced March 2021.

Comments: 6 pages, 4 figures, ICARSC2021 accepted

arXiv:2103.05857 [pdf, ps, other]

Fast block-coordinate Frank-Wolfe algorithm for semi-relaxed optimal transport

Authors: Takumi Fukunaga, Hiroyuki Kasai

Abstract: Optimal transport (OT), which provides a distance between two probability distributions by considering their spatial locations, has been applied to widely diverse applications. Computing an OT problem requires solution of linear programming with tight mass-conservation constraints. This requirement hinders its application to large-scale problems. To alleviate this issue, the recently proposed rela… ▽ More Optimal transport (OT), which provides a distance between two probability distributions by considering their spatial locations, has been applied to widely diverse applications. Computing an OT problem requires solution of linear programming with tight mass-conservation constraints. This requirement hinders its application to large-scale problems. To alleviate this issue, the recently proposed relaxed-OT approach uses a faster algorithm by relaxing such constraints. Its effectiveness for practical applications has been demonstrated. Nevertheless, it still exhibits slow convergence. To this end, addressing a convex semi-relaxed OT, we propose a fast block-coordinate Frank-Wolfe (BCFW) algorithm, which gives sparse solutions. Specifically, we provide their upper bounds of the worst convergence iterations, and equivalence between the linearization duality gap and the Lagrangian duality gap. Three fast variants of the proposed BCFW are also proposed. Numerical evaluations in color transfer problem demonstrate that the proposed algorithms outperform state-of-the-art algorithms across different settings. △ Less

Submitted 9 March, 2021; originally announced March 2021.

arXiv:2103.00902 [pdf, other]

Manifold optimization for non-linear optimal transport problems

Authors: Bamdev Mishra, N T V Satyadev, Hiroyuki Kasai, Pratik Jawanpuria

Abstract: Optimal transport (OT) has recently found widespread interest in machine learning. It allows to define novel distances between probability measures, which have shown promise in several applications. In this work, we discuss how to computationally approach general non-linear OT problems within the framework of Riemannian manifold optimization. The basis of this is the manifold of doubly stochastic… ▽ More Optimal transport (OT) has recently found widespread interest in machine learning. It allows to define novel distances between probability measures, which have shown promise in several applications. In this work, we discuss how to computationally approach general non-linear OT problems within the framework of Riemannian manifold optimization. The basis of this is the manifold of doubly stochastic matrices (and their generalization). Even though the manifold geometry is not new, surprisingly, its usefulness for solving general non-linear OT problems has not been popular. To this end, we specifically discuss optimization-related ingredients that allow modeling the OT problem on smooth Riemannian manifolds by exploiting the geometry of the search space. We also discuss extensions where we reuse the developed optimization ingredients. We make available the Manifold optimization-based Optimal Transport, or MOT, repository with codes useful in solving OT problems in Python and Matlab. The codes are available at \url{https://github.com/SatyadevNtv/MOT}. △ Less

Submitted 8 October, 2021; v1 submitted 1 March, 2021; originally announced March 2021.

Comments: technical report, change is title, addition of experiments

arXiv:2012.03612 [pdf, ps, other]

doi 10.1016/j.sigpro.2021.108281

LCS Graph Kernel Based on Wasserstein Distance in Longest Common Subsequence Metric Space

Authors: Jianming Huang, Zhongxi Fang, Hiroyuki Kasai

Abstract: For graph learning tasks, many existing methods utilize a message-passing mechanism where vertex features are updated iteratively by aggregation of neighbor information. This strategy provides an efficient means for graph features extraction, but obtained features after many iterations might contain too much information from other vertices, and tend to be similar to each other. This makes their re… ▽ More For graph learning tasks, many existing methods utilize a message-passing mechanism where vertex features are updated iteratively by aggregation of neighbor information. This strategy provides an efficient means for graph features extraction, but obtained features after many iterations might contain too much information from other vertices, and tend to be similar to each other. This makes their representations less expressive. Learning graphs using paths, on the other hand, can be less adversely affected by this problem because it does not involve all vertex neighbors. However, most of them can only compare paths with the same length, which might engender information loss. To resolve this difficulty, we propose a new Graph Kernel based on a Longest Common Subsequence (LCS) similarity. Moreover, we found that the widely-used R-convolution framework is unsuitable for path-based Graph Kernel because a huge number of comparisons between dissimilar paths might deteriorate graph distances calculation. Therefore, we propose a novel metric space by exploiting the proposed LCS-based similarity, and compute a new Wasserstein-based graph distance in this metric space, which emphasizes more the comparison between similar paths. Furthermore, to reduce the computational cost, we propose an adjacent point merging operation to sparsify point clouds in the metric space. △ Less

Submitted 29 October, 2021; v1 submitted 7 December, 2020; originally announced December 2020.

Journal ref: Signal Processing, Vol.189, 2021

arXiv:2011.12542 [pdf, ps, other]

Wasserstein k-means with sparse simplex projection

Authors: Takumi Fukunaga, Hiroyuki Kasai

Abstract: This paper presents a proposal of a faster Wasserstein $k$-means algorithm for histogram data by reducing Wasserstein distance computations and exploiting sparse simplex projection. We shrink data samples, centroids, and the ground cost matrix, which leads to considerable reduction of the computations used to solve optimal transport problems without loss of clustering quality. Furthermore, we dyna… ▽ More This paper presents a proposal of a faster Wasserstein $k$-means algorithm for histogram data by reducing Wasserstein distance computations and exploiting sparse simplex projection. We shrink data samples, centroids, and the ground cost matrix, which leads to considerable reduction of the computations used to solve optimal transport problems without loss of clustering quality. Furthermore, we dynamically reduced the computational complexity by removing lower-valued data samples and harnessing sparse simplex projection while kee** the degradation of clustering quality lower. We designate this proposed algorithm as sparse simplex projection based Wasserstein $k$-means, or SSPW $k$-means. Numerical evaluations conducted with comparison to results obtained using Wasserstein $k$-means algorithm demonstrate the effectiveness of the proposed SSPW $k$-means for real-world datasets △ Less

Submitted 25 November, 2020; originally announced November 2020.

Comments: Accepted in ICPR2020

arXiv:2011.12532 [pdf, ps, other]

Consistency-aware and Inconsistency-aware Graph-based Multi-view Clustering

Authors: Mitsuhiko Horie, Hiroyuki Kasai

Abstract: Multi-view data analysis has gained increasing popularity because multi-view data are frequently encountered in machine learning applications. A simple but promising approach for clustering of multi-view data is multi-view clustering (MVC), which has been developed extensively to classify given subjects into some clustered groups by learning latent common features that are shared across multi-view… ▽ More Multi-view data analysis has gained increasing popularity because multi-view data are frequently encountered in machine learning applications. A simple but promising approach for clustering of multi-view data is multi-view clustering (MVC), which has been developed extensively to classify given subjects into some clustered groups by learning latent common features that are shared across multi-view data. Among existing approaches, graph-based multi-view clustering (GMVC) achieves state-of-the-art performance by leveraging a shared graph matrix called the unified matrix. However, existing methods including GMVC do not explicitly address inconsistent parts of input graph matrices. Consequently, they are adversely affected by unacceptable clustering performance. To this end, this paper proposes a new GMVC method that incorporates consistent and inconsistent parts lying across multiple views. This proposal is designated as CI-GMVC. Numerical evaluations of real-world datasets demonstrate the effectiveness of the proposed CI-GMVC. △ Less

Submitted 25 November, 2020; originally announced November 2020.

Comments: Accepted in EUSIPCO2020

arXiv:2010.14773 [pdf, ps, other]

Graph embedding using multi-layer adjacent point merging model

Authors: Jianming Huang, Hiroyuki Kasai

Abstract: For graph classification tasks, many traditional kernel methods focus on measuring the similarity between graphs. These methods have achieved great success on resolving graph isomorphism problems. However, in some classification problems, the graph class depends on not only the topological similarity of the whole graph, but also constituent subgraph patterns. To this end, we propose a novel graph… ▽ More For graph classification tasks, many traditional kernel methods focus on measuring the similarity between graphs. These methods have achieved great success on resolving graph isomorphism problems. However, in some classification problems, the graph class depends on not only the topological similarity of the whole graph, but also constituent subgraph patterns. To this end, we propose a novel graph embedding method using a multi-layer adjacent point merging model. This embedding method allows us to extract different subgraph patterns from train-data. Then we present a flexible loss function for feature selection which enhances the robustness of our method for different classification problems. Finally, numerical evaluations demonstrate that our proposed method outperforms many state-of-the-art methods. △ Less

Submitted 17 February, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

Comments: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021). arXiv admin note: text overlap with arXiv:2012.03612

arXiv:2009.09235 [pdf, other]

Open-Ended Fine-Grained 3D Object Categorization by Combining Shape and Texture Features in Multiple Colorspaces

Authors: Nils Keunecke, S. Hamidreza Kasaei

Abstract: As a consequence of an ever-increasing number of service robots, there is a growing demand for highly accurate real-time 3D object recognition. Considering the expansion of robot applications in more complex and dynamic environments,it is evident that it is not possible to pre-program all object categories and anticipate all exceptions in advance. Therefore, robots should have the functionality to… ▽ More As a consequence of an ever-increasing number of service robots, there is a growing demand for highly accurate real-time 3D object recognition. Considering the expansion of robot applications in more complex and dynamic environments,it is evident that it is not possible to pre-program all object categories and anticipate all exceptions in advance. Therefore, robots should have the functionality to learn about new object categories in an open-ended fashion while working in the environment.Towards this goal, we propose a deep transfer learning approach to generate a scale- and pose-invariant object representation by considering shape and texture information in multiple colorspaces. The obtained global object representation is then fed to an instance-based object category learning and recognition,where a non-expert human user exists in the learning loop and can interactively guide the process of experience acquisition by teaching new object categories, or by correcting insufficient or erroneous categories. In this work, shape information encodes the common patterns of all categories, while texture information is used to describes the appearance of each instance in detail.Multiple color space combinations and network architectures are evaluated to find the most descriptive system. Experimental results showed that the proposed network architecture out-performed the selected state-of-the-art approaches in terms of object classification accuracy and scalability. Furthermore, we performed a real robot experiment in the context of serve-a-beer scenario to show the real-time performance of the proposed approach. △ Less

Submitted 28 May, 2021; v1 submitted 19 September, 2020; originally announced September 2020.

arXiv:2009.07213 [pdf, other]

3D_DEN: Open-ended 3D Object Recognition using Dynamically Expandable Networks

Authors: Sudhakaran Jain, Hamidreza Kasaei

Abstract: Service robots, in general, have to work independently and adapt to the dynamic changes happening in the environment in real-time. One important aspect in such scenarios is to continually learn to recognize newer object categories when they become available. This combines two main research problems namely continual learning and 3D object recognition. Most of the existing research approaches includ… ▽ More Service robots, in general, have to work independently and adapt to the dynamic changes happening in the environment in real-time. One important aspect in such scenarios is to continually learn to recognize newer object categories when they become available. This combines two main research problems namely continual learning and 3D object recognition. Most of the existing research approaches include the use of deep Convolutional Neural Networks (CNNs) focusing on image datasets. A modified approach might be needed for continually learning 3D object categories. A major concern in using CNNs is the problem of catastrophic forgetting when a model tries to learn a new task. Despite various proposed solutions to mitigate this problem, there still exist some downsides of such solutions, e.g., computational complexity, especially when learning substantial number of tasks. These downsides can pose major problems in robotic scenarios where real-time response plays an essential role. Towards addressing this challenge, we propose a new deep transfer learning approach based on a dynamic architectural method to make robots capable of open-ended learning about new 3D object categories. Furthermore, we make sure that the mentioned downsides are minimized to a great extent. Experimental results showed that the proposed model outperformed state-of-the-art approaches with regards to accuracy and also substantially minimizes computational overhead. △ Less

Submitted 15 March, 2021; v1 submitted 15 September, 2020; originally announced September 2020.

arXiv:2009.01152 [pdf, other]

Local-HDP: Interactive Open-Ended 3D Object Categorization in Real-Time Robotic Scenarios

Authors: H. Ayoobi, H. Kasaei, M. Cao, R. Verbrugge, B. Verheij

Abstract: We introduce a non-parametric hierarchical Bayesian approach for open-ended 3D object categorization, named the Local Hierarchical Dirichlet Process (Local-HDP). This method allows an agent to learn independent topics for each category incrementally and to adapt to the environment in time. Hierarchical Bayesian approaches like Latent Dirichlet Allocation (LDA) can transform low-level features to h… ▽ More We introduce a non-parametric hierarchical Bayesian approach for open-ended 3D object categorization, named the Local Hierarchical Dirichlet Process (Local-HDP). This method allows an agent to learn independent topics for each category incrementally and to adapt to the environment in time. Hierarchical Bayesian approaches like Latent Dirichlet Allocation (LDA) can transform low-level features to high-level conceptual topics for 3D object categorization. However, the efficiency and accuracy of LDA-based approaches depend on the number of topics that is chosen manually. Moreover, fixing the number of topics for all categories can lead to overfitting or underfitting of the model. In contrast, the proposed Local-HDP can autonomously determine the number of topics for each category. Furthermore, the online variational inference method has been adapted for fast posterior approximation in the Local-HDP model. Experiments show that the proposed Local-HDP method outperforms other state-of-the-art approaches in terms of accuracy, scalability, and memory efficiency by a large margin. Moreover, two robotic experiments have been conducted to show the applicability of the proposed approach in real-time applications. △ Less

Submitted 11 April, 2021; v1 submitted 2 September, 2020; originally announced September 2020.

Comments: 13 pages

arXiv:2003.08151 [pdf, other]

The State of Lifelong Learning in Service Robots: Current Bottlenecks in Object Perception and Manipulation

Authors: S. Hamidreza Kasaei, Jorik Melsen, Floris van Beers, Christiaan Steenkist, Klemen Voncina

Abstract: Service robots are appearing more and more in our daily life. The development of service robots combines multiple fields of research, from object perception to object manipulation. The state-of-the-art continues to improve to make a proper coupling between object perception and manipulation. This coupling is necessary for service robots not only to perform various tasks in a reasonable amount of t… ▽ More Service robots are appearing more and more in our daily life. The development of service robots combines multiple fields of research, from object perception to object manipulation. The state-of-the-art continues to improve to make a proper coupling between object perception and manipulation. This coupling is necessary for service robots not only to perform various tasks in a reasonable amount of time but also to continually adapt to new environments and safely interact with non-expert human users. Nowadays, robots are able to recognize various objects, and quickly plan a collision-free trajectory to grasp a target object in predefined settings. Besides, in most of the cases, there is a reliance on large amounts of training data. Therefore, the knowledge of such robots is fixed after the training phase, and any changes in the environment require complicated, time-consuming, and expensive robot re-programming by human experts. Therefore, these approaches are still too rigid for real-life applications in unstructured environments, where a significant portion of the environment is unknown and cannot be directly sensed or controlled. In such environments, no matter how extensive the training data used for batch learning, a robot will always face new objects. Therefore, apart from batch learning, the robot should be able to continually learn about new object categories and grasp affordances from very few training examples on-site. Moreover, apart from robot self-learning, non-expert users could interactively guide the process of experience acquisition by teaching new concepts, or by correcting insufficient or erroneous concepts. In this way, the robot will constantly learn how to help humans in everyday tasks by gaining more and more experiences without the need for re-programming. △ Less

Submitted 6 May, 2021; v1 submitted 18 March, 2020; originally announced March 2020.

arXiv:2002.03892 [pdf, other]

Learning to Grasp 3D Objects using Deep Residual U-Nets

Authors: Yikun Li, Lambert Schomaker, S. Hamidreza Kasaei

Abstract: Grasp synthesis is one of the challenging tasks for any robot object manipulation task. In this paper, we present a new deep learning-based grasp synthesis approach for 3D objects. In particular, we propose an end-to-end 3D Convolutional Neural Network to predict the objects' graspable areas. We named our approach Res-U-Net since the architecture of the network is designed based on U-Net structure… ▽ More Grasp synthesis is one of the challenging tasks for any robot object manipulation task. In this paper, we present a new deep learning-based grasp synthesis approach for 3D objects. In particular, we propose an end-to-end 3D Convolutional Neural Network to predict the objects' graspable areas. We named our approach Res-U-Net since the architecture of the network is designed based on U-Net structure and residual network-styled blocks. It devised to plan 6-DOF grasps for any desired object, be efficient to compute and use, and be robust against varying point cloud density and Gaussian noise. We have performed extensive experiments to assess the performance of the proposed approach concerning graspable part detection, grasp success rate, and robustness to varying point cloud density and Gaussian noise. Experiments validate the promising performance of the proposed architecture in all aspects. A video showing the performance of our approach in the simulation environment can be found at: http://youtu.be/5_yAJCc8owo △ Less

Submitted 12 September, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

arXiv:2002.03779 [pdf, other]

Investigating the Importance of Shape Features, Color Constancy, Color Spaces and Similarity Measures in Open-Ended 3D Object Recognition

Authors: S. Hamidreza Kasaei, Maryam Ghorbani, Jits Schilperoort, Wessel van der Rest

Abstract: Despite the recent success of state-of-the-art 3D object recognition approaches, service robots are frequently failed to recognize many objects in real human-centric environments. For these robots, object recognition is a challenging task due to the high demand for accurate and real-time response under changing and unpredictable environmental conditions. Most of the recent approaches use either th… ▽ More Despite the recent success of state-of-the-art 3D object recognition approaches, service robots are frequently failed to recognize many objects in real human-centric environments. For these robots, object recognition is a challenging task due to the high demand for accurate and real-time response under changing and unpredictable environmental conditions. Most of the recent approaches use either the shape information only and ignore the role of color information or vice versa. Furthermore, they mainly utilize the $L_n$ Minkowski family functions to measure the similarity of two object views, while there are various distance measures that are applicable to compare two object views. In this paper, we explore the importance of shape information, color constancy, color spaces, and various similarity measures in open-ended 3D object recognition. Towards this goal, we extensively evaluate the performance of object recognition approaches in three different configurations, including \textit{color-only}, \textit{shape-only}, and \textit{ combinations of color and shape}, in both offline and online settings. Experimental results concerning scalability, memory usage, and object recognition performance show that all of the \textit{combinations of color and shape} yields significant improvements over the \textit{shape-only} and \textit{color-only} approaches. The underlying reason is that color information is an important feature to distinguish objects that have very similar geometric properties with different colors and vice versa. Moreover, by combining color and shape information, we demonstrate that the robot can learn new object categories from very few training examples in a real-world setting. △ Less

Submitted 26 September, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

arXiv:2002.02697 [pdf, other]

doi 10.1109/IJCNN48605.2020.9207427

Accelerating Reinforcement Learning for Reaching using Continuous Curriculum Learning

Authors: Sha Luo, Hamidreza Kasaei, Lambert Schomaker

Abstract: Reinforcement learning has shown great promise in the training of robot behavior due to the sequential decision making characteristics. However, the required enormous amount of interactive and informative training data provides the major stumbling block for progress. In this study, we focus on accelerating reinforcement learning (RL) training and improving the performance of multi-goal reaching ta… ▽ More Reinforcement learning has shown great promise in the training of robot behavior due to the sequential decision making characteristics. However, the required enormous amount of interactive and informative training data provides the major stumbling block for progress. In this study, we focus on accelerating reinforcement learning (RL) training and improving the performance of multi-goal reaching tasks. Specifically, we propose a precision-based continuous curriculum learning (PCCL) method in which the requirements are gradually adjusted during the training process, instead of fixing the parameter in a static schedule. To this end, we explore various continuous curriculum strategies for controlling a training process. This approach is tested using a Universal Robot 5e in both simulation and real-world multi-goal reach experiments. Experimental results support the hypothesis that a static training schedule is suboptimal, and using an appropriate decay function for curriculum learning provides superior results in a faster way. △ Less

Submitted 21 December, 2020; v1 submitted 7 February, 2020; originally announced February 2020.

arXiv:1912.09539 [pdf, other]

Interactive Open-Ended Learning for 3D Object Recognition

Authors: S. Hamidreza Kasaei

Abstract: The thesis contributes in several important ways to the research area of 3D object category learning and recognition. To cope with the mentioned limitations, we look at human cognition, in particular at the fact that human beings learn to recognize object categories ceaselessly over time. This ability to refine knowledge from the set of accumulated experiences facilitates the adaptation to new env… ▽ More The thesis contributes in several important ways to the research area of 3D object category learning and recognition. To cope with the mentioned limitations, we look at human cognition, in particular at the fact that human beings learn to recognize object categories ceaselessly over time. This ability to refine knowledge from the set of accumulated experiences facilitates the adaptation to new environments. Inspired by this capability, we seek to create a cognitive object perception and perceptual learning architecture that can learn 3D object categories in an open-ended fashion. In this context, ``open-ended'' implies that the set of categories to be learned is not known in advance, and the training instances are extracted from actual experiences of a robot, and thus become gradually available, rather than being available since the beginning of the learning process. In particular, this architecture provides perception capabilities that will allow robots to incrementally learn object categories from the set of accumulated experiences and reason about how to perform complex tasks. This framework integrates detection, tracking, teaching, learning, and recognition of objects. An extensive set of systematic experiments, in multiple experimental settings, was carried out to thoroughly evaluate the described learning approaches. Experimental results show that the proposed system is able to interact with human users, learn new object categories over time, as well as perform complex tasks. The contributions presented in this thesis have been fully implemented and evaluated on different standard object and scene datasets and empirically evaluated on different robotic platforms. △ Less

Submitted 19 December, 2019; originally announced December 2019.

Comments: PhD thesis

arXiv:1907.12924 [pdf, other]

Look Further to Recognize Better: Learning Shared Topics and Category-Specific Dictionaries for Open-Ended 3D Object Recognition

Authors: S. Hamidreza Kasaei

Abstract: Service robots are expected to operate effectively in human-centric environments for long periods of time. In such realistic scenarios, fine-grained object categorization is as important as basic-level object categorization. We tackle this problem by proposing an open-ended object recognition approach which concurrently learns both the object categories and the local features for encoding objects.… ▽ More Service robots are expected to operate effectively in human-centric environments for long periods of time. In such realistic scenarios, fine-grained object categorization is as important as basic-level object categorization. We tackle this problem by proposing an open-ended object recognition approach which concurrently learns both the object categories and the local features for encoding objects. In this work, each object is represented using a set of general latent visual topics and category-specific dictionaries. The general topics encode the common patterns of all categories, while the category-specific dictionary describes the content of each category in details. The proposed approach discovers both sets of general and specific representations in an unsupervised fashion and updates them incrementally using new object views. Experimental results show that our approach yields significant improvements over the previous state-of-the-art approaches concerning scalability and object classification performance. Moreover, our approach demonstrates the capability of learning from very few training examples in a real-world setting. Regarding computation time, the best result was obtained with a Bag-of-Words method followed by a variant of the Latent Dirichlet Allocation approach. △ Less

Submitted 26 July, 2019; originally announced July 2019.

Comments: arXiv admin note: text overlap with arXiv:1902.03057

arXiv:1907.10932 [pdf, other]

Object Perception and Gras** in Open-Ended Domains

Authors: S. Hamidreza Kasaei

Abstract: Nowadays service robots are leaving the structured and completely known environments and entering human-centric settings. For these robots, object perception and gras** are two challenging tasks due to the high demand for accurate and real-time responses. Although many problems have already been understood and solved successfully, many challenges still remain. Open-ended learning is one of these… ▽ More Nowadays service robots are leaving the structured and completely known environments and entering human-centric settings. For these robots, object perception and gras** are two challenging tasks due to the high demand for accurate and real-time responses. Although many problems have already been understood and solved successfully, many challenges still remain. Open-ended learning is one of these challenges waiting for many improvements. Cognitive science revealed that humans learn to recognize object categories and grasp affordances ceaselessly over time. This ability allows adapting to new environments by enhancing their knowledge from the accumulation of experiences and the conceptualization of new object categories. Inspired by this, an autonomous robot must have the ability to process visual information and conduct learning and recognition tasks in an open-ended fashion. In this context, "open-ended" implies that the set of object categories to be learned is not known in advance, and the training instances are extracted from online experiences of a robot, and become gradually available over time, rather than being completely available at the beginning of the learning process. In my research, I mainly focus on interactive open-ended learning approaches to recognize multiple objects and their grasp affordances concurrently. In particular, I try to address the following research questions: (i) What is the importance of open-ended learning for autonomous robots? (ii) How robots could learn incrementally from their own experiences as well as from interaction with humans? (iii) What are the limitations of Deep Learning approaches to be used in an open-ended manner? (iv) How to evaluate open-ended learning approaches and what are the right metrics to do so? △ Less

Submitted 25 July, 2019; originally announced July 2019.

arXiv:1906.10436 [pdf, other]

Riemannian optimization on the simplex of positive definite matrices

Authors: Bamdev Mishra, Hiroyuki Kasai, Pratik Jawanpuria

Abstract: In this work, we generalize the probability simplex constraint to matrices, i.e., $\mathbf{X}_1 + \mathbf{X}_2 + \ldots + \mathbf{X}_K = \mathbf{I}$, where $\mathbf{X}_i \succeq 0$ is a symmetric positive semidefinite matrix of size $n\times n$ for all $i = \{1,\ldots,K \}$. By assuming positive definiteness of the matrices, we show that the constraint set arising from the matrix simplex has the s… ▽ More In this work, we generalize the probability simplex constraint to matrices, i.e., $\mathbf{X}_1 + \mathbf{X}_2 + \ldots + \mathbf{X}_K = \mathbf{I}$, where $\mathbf{X}_i \succeq 0$ is a symmetric positive semidefinite matrix of size $n\times n$ for all $i = \{1,\ldots,K \}$. By assuming positive definiteness of the matrices, we show that the constraint set arising from the matrix simplex has the structure of a smooth Riemannian submanifold. We discuss a novel Riemannian geometry for the matrix simplex manifold and show the derivation of first- and second-order optimization related ingredients. △ Less

Submitted 17 November, 2020; v1 submitted 25 June, 2019; originally announced June 2019.

Comments: 12th OPT Workshop on Optimization for Machine Learning at NeurIPS 2020

Showing 1–50 of 69 results for author: Kasaei, H