Search | arXiv e-print repository

Revisiting Constant Negative Rewards for Goal-Reaching Tasks in Robot Learning

Authors: Gautham Vasan, Yan Wang, Fahim Shahriar, James Bergstra, Martin Jagersand, A. Rupam Mahmood

Abstract: Many real-world robot learning problems, such as pick-and-place or arriving at a destination, can be seen as a problem of reaching a goal state as soon as possible. These problems, when formulated as episodic reinforcement learning tasks, can easily be specified to align well with our intended goal: -1 reward every time step with termination upon reaching the goal state, called minimum-time tasks.… ▽ More Many real-world robot learning problems, such as pick-and-place or arriving at a destination, can be seen as a problem of reaching a goal state as soon as possible. These problems, when formulated as episodic reinforcement learning tasks, can easily be specified to align well with our intended goal: -1 reward every time step with termination upon reaching the goal state, called minimum-time tasks. Despite this simplicity, such formulations are often overlooked in favor of dense rewards due to their perceived difficulty and lack of informativeness. Our studies contrast the two reward paradigms, revealing that the minimum-time task specification not only facilitates learning higher-quality policies but can also surpass dense-reward-based policies on their own performance metrics. Crucially, we also identify the goal-hit rate of the initial policy as a robust early indicator for learning success in such sparse feedback settings. Finally, using four distinct real-robotic platforms, we show that it is possible to learn pixel-based policies from scratch within two to three hours using constant negative rewards. △ Less

Submitted 29 June, 2024; originally announced July 2024.

Comments: In Proceedings of Reinforcement Learning Conference 2024. For video demo, see https://drive.google.com/file/d/1O8D3oCWq5xf2hi1JOlMBbs6W1ClrvUFb/view?usp=sharing

arXiv:2310.03932 [pdf, other]

Bridging Low-level Geometry to High-level Concepts in Visual Servoing of Robot Manipulation Task Using Event Knowledge Graphs and Vision-Language Models

Authors: Chen Jiang, Martin Jagersand

Abstract: In this paper, we propose a framework of building knowledgeable robot control in the scope of smart human-robot interaction, by empowering a basic uncalibrated visual servoing controller with contextual knowledge through the joint usage of event knowledge graphs (EKGs) and large-scale pretrained vision-language models (VLMs). The framework is expanded in twofold: first, we interpret low-level imag… ▽ More In this paper, we propose a framework of building knowledgeable robot control in the scope of smart human-robot interaction, by empowering a basic uncalibrated visual servoing controller with contextual knowledge through the joint usage of event knowledge graphs (EKGs) and large-scale pretrained vision-language models (VLMs). The framework is expanded in twofold: first, we interpret low-level image geometry as high-level concepts, allowing us to prompt VLMs and to select geometric features of points and lines for motor control skills; then, we create an event knowledge graph (EKG) to conceptualize a robot manipulation task of interest, where the main body of the EKG is characterized by an executable behavior tree, and the leaves by semantic concepts relevant to the manipulation context. We demonstrate, in an uncalibrated environment with real robot trials, that our method lowers the reliance of human annotation during task interfacing, allows the robot to perform activities of daily living more easily by treating low-level geometric-based motor control skills as high-level concepts, and is beneficial in building cognitive thinking for smart robot applications. △ Less

Submitted 5 October, 2023; originally announced October 2023.

arXiv:2309.09183 [pdf, other]

CLIPUNetr: Assisting Human-robot Interface for Uncalibrated Visual Servoing Control with CLIP-driven Referring Expression Segmentation

Authors: Chen Jiang, Yuchen Yang, Martin Jagersand

Abstract: The classical human-robot interface in uncalibrated image-based visual servoing (UIBVS) relies on either human annotations or semantic segmentation with categorical labels. Both methods fail to match natural human communication and convey rich semantics in manipulation tasks as effectively as natural language expressions. In this paper, we tackle this problem by using referring expression segmenta… ▽ More The classical human-robot interface in uncalibrated image-based visual servoing (UIBVS) relies on either human annotations or semantic segmentation with categorical labels. Both methods fail to match natural human communication and convey rich semantics in manipulation tasks as effectively as natural language expressions. In this paper, we tackle this problem by using referring expression segmentation, which is a prompt-based approach, to provide more in-depth information for robot perception. To generate high-quality segmentation predictions from referring expressions, we propose CLIPUNetr - a new CLIP-driven referring expression segmentation network. CLIPUNetr leverages CLIP's strong vision-language representations to segment regions from referring expressions, while utilizing its ``U-shaped'' encoder-decoder architecture to generate predictions with sharper boundaries and finer structures. Furthermore, we propose a new pipeline to integrate CLIPUNetr into UIBVS and apply it to control robots in real-world environments. In experiments, our method improves boundary and structure measurements by an average of 120% and can successfully assist real-world UIBVS control in an unstructured manipulation environment. △ Less

Submitted 17 September, 2023; originally announced September 2023.

arXiv:2307.05141 [pdf, other]

Deep Probabilistic Movement Primitives with a Bayesian Aggregator

Authors: Michael Przystupa, Faezeh Haghverd, Martin Jagersand, Samuele Tosatto

Abstract: Movement primitives are trainable parametric models that reproduce robotic movements starting from a limited set of demonstrations. Previous works proposed simple linear models that exhibited high sample efficiency and generalization power by allowing temporal modulation of movements (reproducing movements faster or slower), blending (merging two movements into one), via-point conditioning (constr… ▽ More Movement primitives are trainable parametric models that reproduce robotic movements starting from a limited set of demonstrations. Previous works proposed simple linear models that exhibited high sample efficiency and generalization power by allowing temporal modulation of movements (reproducing movements faster or slower), blending (merging two movements into one), via-point conditioning (constraining a movement to meet some particular via-points) and context conditioning (generation of movements based on an observed variable, e.g., position of an object). Previous works have proposed neural network-based motor primitive models, having demonstrated their capacity to perform tasks with some forms of input conditioning or time-modulation representations. However, there has not been a single unified deep motor primitive's model proposed that is capable of all previous operations, limiting neural motor primitive's potential applications. This paper proposes a deep movement primitive architecture that encodes all the operations above and uses a Bayesian context aggregator that allows a more sound context conditioning and blending. Our results demonstrate our approach can scale to reproduce complex motions on a larger variety of input choices compared to baselines while maintaining operations of linear movement primitives provide. △ Less

Submitted 6 June, 2024; v1 submitted 11 July, 2023; originally announced July 2023.

arXiv:2212.08235 [pdf, other]

A Simple Decentralized Cross-Entropy Method

Authors: Zichen Zhang, Jun **, Martin Jagersand, Jun Luo, Dale Schuurmans

Abstract: Cross-Entropy Method (CEM) is commonly used for planning in model-based reinforcement learning (MBRL) where a centralized approach is typically utilized to update the sampling distribution based on only the top-$k$ operation's results on samples. In this paper, we show that such a centralized approach makes CEM vulnerable to local optima, thus impairing its sample efficiency. To tackle this issue,… ▽ More Cross-Entropy Method (CEM) is commonly used for planning in model-based reinforcement learning (MBRL) where a centralized approach is typically utilized to update the sampling distribution based on only the top-$k$ operation's results on samples. In this paper, we show that such a centralized approach makes CEM vulnerable to local optima, thus impairing its sample efficiency. To tackle this issue, we propose Decentralized CEM (DecentCEM), a simple but effective improvement over classical CEM, by using an ensemble of CEM instances running independently from one another, and each performing a local improvement of its own sampling distribution. We provide both theoretical and empirical analysis to demonstrate the effectiveness of this simple decentralized approach. We empirically show that, compared to the classical centralized approach using either a single or even a mixture of Gaussian distributions, our DecentCEM finds the global optimum much more consistently thus improves the sample efficiency. Furthermore, we plug in our DecentCEM in the planning problem of MBRL, and evaluate our approach in several continuous control environments, with comparison to the state-of-art CEM based MBRL approaches (PETS and POPLIN). Results show sample efficiency improvement by simply replacing the classical CEM module with our DecentCEM module, while only sacrificing a reasonable amount of computational cost. Lastly, we conduct ablation studies for more in-depth analysis. Code is available at https://github.com/vincentzhang/decentCEM △ Less

Submitted 15 December, 2022; originally announced December 2022.

Comments: NeurIPS 2022. The last two authors advised equally

arXiv:2212.04407 [pdf, other]

Dynamic Decision Frequency with Continuous Options

Authors: Amirmohammad Karimi, Jun **, Jun Luo, A. Rupam Mahmood, Martin Jagersand, Samuele Tosatto

Abstract: In classic reinforcement learning algorithms, agents make decisions at discrete and fixed time intervals. The duration between decisions becomes a crucial hyperparameter, as setting it too short may increase the problem's difficulty by requiring the agent to make numerous decisions to achieve its goal while setting it too long can result in the agent losing control over the system. However, physic… ▽ More In classic reinforcement learning algorithms, agents make decisions at discrete and fixed time intervals. The duration between decisions becomes a crucial hyperparameter, as setting it too short may increase the problem's difficulty by requiring the agent to make numerous decisions to achieve its goal while setting it too long can result in the agent losing control over the system. However, physical systems do not necessarily require a constant control frequency, and for learning agents, it is often preferable to operate with a low frequency when possible and a high frequency when necessary. We propose a framework called Continuous-Time Continuous-Options (CTCO), where the agent chooses options as sub-policies of variable durations. These options are time-continuous and can interact with the system at any desired frequency providing a smooth change of actions. We demonstrate the effectiveness of CTCO by comparing its performance to classical RL and temporal-abstraction RL methods on simulated continuous control tasks with various action-cycle times. We show that our algorithm's performance is not affected by the choice of environment interaction frequency. Furthermore, we demonstrate the efficacy of CTCO in facilitating exploration in a real-world visual reaching task for a 7 DOF robotic arm with sparse rewards. △ Less

Submitted 25 October, 2023; v1 submitted 6 December, 2022; originally announced December 2022.

Comments: Appears in the Proceedings of the 2023 International Conference on Intelligent Robots and Systems (IROS). Source code at https://github.com/amir-karimi96/continuous-time-continuous-option-policy-gradient.git

arXiv:2202.13604 [pdf, other]

Generalizable task representation learning from human demonstration videos: a geometric approach

Authors: Jun **, Martin Jagersand

Abstract: We study the problem of generalizable task learning from human demonstration videos without extra training on the robot or pre-recorded robot motions. Given a set of human demonstration videos showing a task with different objects/tools (categorical objects), we aim to learn a representation of visual observation that generalizes to categorical objects and enables efficient controller design. We p… ▽ More We study the problem of generalizable task learning from human demonstration videos without extra training on the robot or pre-recorded robot motions. Given a set of human demonstration videos showing a task with different objects/tools (categorical objects), we aim to learn a representation of visual observation that generalizes to categorical objects and enables efficient controller design. We propose to introduce a geometric task structure to the representation learning problem that geometrically encodes the task specification from human demonstration videos, and that enables generalization by building task specification correspondence between categorical objects. Specifically, we propose CoVGS-IL, which uses a graph-structured task function to learn task representations under structural constraints. Our method enables task generalization by selecting geometric features from different objects whose inner connection relationships define the same task in geometric constraints. The learned task representation is then transferred to a robot controller using uncalibrated visual servoing (UVS); thus, the need for extra robot training or pre-recorded robot motions is removed. △ Less

Submitted 28 February, 2022; originally announced February 2022.

Comments: Accepted in ICRA 2022

arXiv:2106.06083 [pdf, other]

Analyzing Neural Jacobian Methods in Applications of Visual Servoing and Kinematic Control

Authors: Michael Przystupa, Masood Dehghan, Martin Jagersand, A. Rupam Mahmood

Abstract: Designing adaptable control laws that can transfer between different robots is a challenge because of kinematic and dynamic differences, as well as in scenarios where external sensors are used. In this work, we empirically investigate a neural networks ability to approximate the Jacobian matrix for an application in Cartesian control schemes. Specifically, we are interested in approximating the ki… ▽ More Designing adaptable control laws that can transfer between different robots is a challenge because of kinematic and dynamic differences, as well as in scenarios where external sensors are used. In this work, we empirically investigate a neural networks ability to approximate the Jacobian matrix for an application in Cartesian control schemes. Specifically, we are interested in approximating the kinematic Jacobian, which arises from kinematic equations map** a manipulator's joint angles to the end-effector's location. We propose two different approaches to learn the kinematic Jacobian. The first method arises from visual servoing where we learn the kinematic Jacobian as an approximate linear system of equations from the k-nearest neighbors for a desired joint configuration. The second, motivated by forward models in machine learning, learns the kinematic behavior directly and calculates the Jacobian by differentiating the learned neural kinematics model. Simulation experimental results show that both methods achieve better performance than alternative data-driven methods for control, provide closer approximations to the proper kinematics Jacobian matrix, and on average produce better-conditioned Jacobian matrices. Real-world experiments were conducted on a Kinova Gen-3 lightweight robotic manipulator, which includes an uncalibrated visual servoing experiment, a practical application of our methods, as well as a 7-DOF point-to-point task highlighting that our methods are applicable on real robotic manipulators. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: 8 pages, 6 Figures, https://www.youtube.com/watch?v=mOMIIBLCL20

arXiv:2105.03533 [pdf, other]

Video Class Agnostic Segmentation with Contrastive Learning for Autonomous Driving

Authors: Mennatullah Siam, Alex Kendall, Martin Jagersand

Abstract: Semantic segmentation in autonomous driving predominantly focuses on learning from large-scale data with a closed set of known classes without considering unknown objects. Motivated by safety reasons, we address the video class agnostic segmentation task, which considers unknown objects outside the closed set of known classes in our training data. We propose a novel auxiliary contrastive loss to l… ▽ More Semantic segmentation in autonomous driving predominantly focuses on learning from large-scale data with a closed set of known classes without considering unknown objects. Motivated by safety reasons, we address the video class agnostic segmentation task, which considers unknown objects outside the closed set of known classes in our training data. We propose a novel auxiliary contrastive loss to learn the segmentation of known classes and unknown objects. Unlike previous work in contrastive learning that samples the anchor, positive and negative examples on an image level, our contrastive learning method leverages pixel-wise semantic and temporal guidance. We conduct experiments on Cityscapes-VPS by withholding four classes from training and show an improvement gain for both known and unknown objects segmentation with the auxiliary contrastive loss. We further release a large-scale synthetic dataset for different autonomous driving scenarios that includes distinct and rare unknown objects. We conduct experiments on the full synthetic dataset and a reduced small-scale version, and show how contrastive learning is more effective in small scale datasets. Our proposed models, dataset, and code will be released at https://github.com/MSiam/video_class_agnostic_segmentation. △ Less

Submitted 10 May, 2021; v1 submitted 7 May, 2021; originally announced May 2021.

arXiv:2104.03892 [pdf, other]

A Quantitative Analysis of Activities of Daily Living: Insights into Improving Functional Independence with Assistive Robotics

Authors: Laura Petrich, Jun **, Masood Dehghan, Martin Jagersand

Abstract: Human assistive robotics have the potential to help the elderly and individuals living with disabilities with their Activities of Daily Living (ADL). Robotics researchers focus on assistive tasks from the perspective of various control schemes and motion types. Health research on the other hand focuses on clinical assessment and rehabilitation, arguably leaving important differences between the tw… ▽ More Human assistive robotics have the potential to help the elderly and individuals living with disabilities with their Activities of Daily Living (ADL). Robotics researchers focus on assistive tasks from the perspective of various control schemes and motion types. Health research on the other hand focuses on clinical assessment and rehabilitation, arguably leaving important differences between the two domains. In particular, little is known quantitatively on which ADLs are typically carried out in a persons everyday environment - at home, work, etc. Understanding what activities are frequently carried out during the day can help guide the development and prioritization of robotic technology for in-home assistive robotic deployment. This study targets several lifelogging databases, where we compute (i) ADL task frequency from long-term low sampling frequency video and Internet of Things (IoT) sensor data, and (ii) short term arm and hand movement data from 30 fps video data of domestic tasks. Robotics and health care communities have differing terms and taxonomies for representing tasks and motions. In this work, we derive and discuss a robotics-relevant taxonomy from quantitative ADL task and motion data in attempt to ameliorate taxonomic differences between the two communities. Our quantitative results provide direction for the development of better assistive robots to support the true demands of the healthcare community. △ Less

Submitted 8 April, 2021; originally announced April 2021.

Comments: Submitted to IROS 2021. arXiv admin note: substantial text overlap with arXiv:2101.02750

arXiv:2103.11015 [pdf, other]

Video Class Agnostic Segmentation Benchmark for Autonomous Driving

Authors: Mennatullah Siam, Alex Kendall, Martin Jagersand

Abstract: Semantic segmentation approaches are typically trained on large-scale data with a closed finite set of known classes without considering unknown objects. In certain safety-critical robotics applications, especially autonomous driving, it is important to segment all objects, including those unknown at training time. We formalize the task of video class agnostic segmentation from monocular video seq… ▽ More Semantic segmentation approaches are typically trained on large-scale data with a closed finite set of known classes without considering unknown objects. In certain safety-critical robotics applications, especially autonomous driving, it is important to segment all objects, including those unknown at training time. We formalize the task of video class agnostic segmentation from monocular video sequences in autonomous driving to account for unknown objects. Video class agnostic segmentation can be formulated as an open-set or a motion segmentation problem. We discuss both formulations and provide datasets and benchmark different baseline approaches for both tracks. In the motion-segmentation track we benchmark real-time joint panoptic and motion instance segmentation, and evaluate the effect of ego-flow suppression. In the open-set segmentation track we evaluate baseline methods that combine appearance, and geometry to learn prototypes per semantic class. We then compare it to a model that uses an auxiliary contrastive loss to improve the discrimination between known and unknown objects. Datasets and models are publicly released at https://msiam.github.io/vca/. △ Less

Submitted 19 April, 2021; v1 submitted 19 March, 2021; originally announced March 2021.

Comments: Accepted in WAD workshop, CVPR 2021

arXiv:2101.04704 [pdf, other]

Boundary-Aware Segmentation Network for Mobile and Web Applications

Authors: Xuebin Qin, Deng-** Fan, Chenyang Huang, Cyril Diagne, Zichen Zhang, Adrià Cabeza Sant'Anna, Albert Suàrez, Martin Jagersand, Ling Shao

Abstract: Although deep models have greatly improved the accuracy and robustness of image segmentation, obtaining segmentation results with highly accurate boundaries and fine structures is still a challenging problem. In this paper, we propose a simple yet powerful Boundary-Aware Segmentation Network (BASNet), which comprises a predict-refine architecture and a hybrid loss, for highly accurate image segmen… ▽ More Although deep models have greatly improved the accuracy and robustness of image segmentation, obtaining segmentation results with highly accurate boundaries and fine structures is still a challenging problem. In this paper, we propose a simple yet powerful Boundary-Aware Segmentation Network (BASNet), which comprises a predict-refine architecture and a hybrid loss, for highly accurate image segmentation. The predict-refine architecture consists of a densely supervised encoder-decoder network and a residual refinement module, which are respectively used to predict and refine a segmentation probability map. The hybrid loss is a combination of the binary cross entropy, structural similarity and intersection-over-union losses, which guide the network to learn three-level (ie, pixel-, patch- and map- level) hierarchy representations. We evaluate our BASNet on two reverse tasks including salient object segmentation, camouflaged object segmentation, showing that it achieves very competitive performance with sharp segmentation boundaries. Importantly, BASNet runs at over 70 fps on a single GPU which benefits many potential real applications. Based on BASNet, we further developed two (close to) commercial applications: AR COPY & PASTE, in which BASNet is integrated with augmented reality for "COPYING" and "PASTING" real-world objects, and OBJECT CUT, which is a web-based tool for automatic object background removal. Both applications have already drawn huge amount of attention and have important real-world impacts. The code and two applications will be publicly available at: https://github.com/NathanUA/BASNet. △ Less

Submitted 11 May, 2021; v1 submitted 12 January, 2021; originally announced January 2021.

Comments: 18 pages, 16 figures, submitted to TPAMI

arXiv:2101.02750 [pdf, other]

Assistive arm and hand manipulation: How does current research intersect with actual healthcare needs?

Authors: Laura Petrich, Jun **, Masood Dehghan, Martin Jagersand

Abstract: Human assistive robotics have the potential to help the elderly and individuals living with disabilities with their Activities of Daily Living (ADL). Robotics researchers present bottom up solutions using various control methods for different types of movements. Health research on the other hand focuses on clinical assessment and rehabilitation leaving arguably important differences between the tw… ▽ More Human assistive robotics have the potential to help the elderly and individuals living with disabilities with their Activities of Daily Living (ADL). Robotics researchers present bottom up solutions using various control methods for different types of movements. Health research on the other hand focuses on clinical assessment and rehabilitation leaving arguably important differences between the two domains. In particular, little is known quantitatively on what ADLs humans perform in their everyday environment - at home, work etc. This information can help guide development and prioritization of robotic technology for in-home assistive robotic deployment. This study targets several lifelogging databases, where we compute (i) ADL task frequency from long-term low sampling frequency video and Internet of Things (IoT) sensor data, and (ii) short term arm and hand movement data from 30 fps video data of domestic tasks. Robotics and health care communities have different terms and taxonomies for representing tasks and motions. We derive and discuss a robotics-relevant taxonomy from this quantitative ADL task and ICF motion data in attempt to ameliorate these taxonomic differences. Our statistics quantify that humans reach, open drawers, doors, and retrieve and use objects hundreds of times a day. Commercial wheelchair mounted robot arms can help 150,000 upper body disabled in the USA alone, but only a few hundred robots are deployed. Better user interfaces, and more capable robots can increase the potential user base and number of ADL tasks solved significantly. △ Less

Submitted 7 January, 2021; originally announced January 2021.

Comments: Submitted to ICRA 2021

arXiv:2011.05857 [pdf, other]

Offline Learning of Counterfactual Predictions for Real-World Robotic Reinforcement Learning

Authors: Jun **, Daniel Graves, Cameron Haigh, Jun Luo, Martin Jagersand

Abstract: We consider real-world reinforcement learning (RL) of robotic manipulation tasks that involve both visuomotor skills and contact-rich skills. We aim to train a policy that maps multimodal sensory observations (vision and force) to a manipulator's joint velocities under practical considerations. We propose to use offline samples to learn a set of general value functions (GVFs) that make counterfact… ▽ More We consider real-world reinforcement learning (RL) of robotic manipulation tasks that involve both visuomotor skills and contact-rich skills. We aim to train a policy that maps multimodal sensory observations (vision and force) to a manipulator's joint velocities under practical considerations. We propose to use offline samples to learn a set of general value functions (GVFs) that make counterfactual predictions from the visual inputs. We show that combining the offline learned counterfactual predictions with force feedbacks in online policy learning allows efficient reinforcement learning given only a terminal (success/failure) reward. We argue that the learned counterfactual predictions form a compact and informative representation that enables sample efficiency and provides auxiliary reward signals that guide online explorations towards contact-rich states. Various experiments in simulation and real-world settings were performed for evaluation. Recordings of the real-world robot training can be found via https://sites.google.com/view/realrl. △ Less

Submitted 25 February, 2022; v1 submitted 11 November, 2020; originally announced November 2020.

Comments: Accepted in ICRA 2022

arXiv:2005.09007 [pdf, other]

doi 10.1016/j.patcog.2020.107404

U$^2$-Net: Going Deeper with Nested U-Structure for Salient Object Detection

Authors: Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R. Zaiane, Martin Jagersand

Abstract: In this paper, we design a simple yet powerful deep network architecture, U$^2$-Net, for salient object detection (SOD). The architecture of our U$^2$-Net is a two-level nested U-structure. The design has the following advantages: (1) it is able to capture more contextual information from different scales thanks to the mixture of receptive fields of different sizes in our proposed ReSidual U-block… ▽ More In this paper, we design a simple yet powerful deep network architecture, U$^2$-Net, for salient object detection (SOD). The architecture of our U$^2$-Net is a two-level nested U-structure. The design has the following advantages: (1) it is able to capture more contextual information from different scales thanks to the mixture of receptive fields of different sizes in our proposed ReSidual U-blocks (RSU), (2) it increases the depth of the whole architecture without significantly increasing the computational cost because of the pooling operations used in these RSU blocks. This architecture enables us to train a deep network from scratch without using backbones from image classification tasks. We instantiate two models of the proposed architecture, U$^2$-Net (176.3 MB, 30 FPS on GTX 1080Ti GPU) and U$^2$-Net$^{\dagger}$ (4.7 MB, 40 FPS), to facilitate the usage in different environments. Both models achieve competitive performance on six SOD datasets. The code is available: https://github.com/NathanUA/U-2-Net. △ Less

Submitted 8 March, 2022; v1 submitted 18 May, 2020; originally announced May 2020.

Comments: Accepted in Pattern Recognition 2020

arXiv:2003.02768 [pdf, other]

A Geometric Perspective on Visual Imitation Learning

Authors: Jun **, Laura Petrich, Masood Dehghan, Martin Jagersand

Abstract: We consider the problem of visual imitation learning without human supervision (e.g. kinesthetic teaching or teleoperation), nor access to an interactive reinforcement learning (RL) training environment. We present a geometric perspective to derive solutions to this problem. Specifically, we propose VGS-IL (Visual Geometric Skill Imitation Learning), an end-to-end geometry-parameterized task conce… ▽ More We consider the problem of visual imitation learning without human supervision (e.g. kinesthetic teaching or teleoperation), nor access to an interactive reinforcement learning (RL) training environment. We present a geometric perspective to derive solutions to this problem. Specifically, we propose VGS-IL (Visual Geometric Skill Imitation Learning), an end-to-end geometry-parameterized task concept inference method, to infer globally consistent geometric feature association rules from human demonstration video frames. We show that, instead of learning actions from image pixels, learning a geometry-parameterized task concept provides an explainable and invariant representation across demonstrator to imitator under various environmental settings. Moreover, such a task concept representation provides a direct link with geometric vision based controllers (e.g. visual servoing), allowing for efficient map** of high-level task concepts to low-level robot actions. △ Less

Submitted 5 March, 2020; originally announced March 2020.

Comments: submitted to IROS 2020

arXiv:2003.01163 [pdf, other]

Understanding Contexts Inside Robot and Human Manipulation Tasks through a Vision-Language Model and Ontology System in a Video Stream

Authors: Chen Jiang, Masood Dehghan, Martin Jagersand

Abstract: Manipulation tasks in daily life, such as pouring water, unfold intentionally under specialized manipulation contexts. Being able to process contextual knowledge in these Activities of Daily Living (ADLs) over time can help us understand manipulation intentions, which are essential for an intelligent robot to transition smoothly between various manipulation actions. In this paper, to model the int… ▽ More Manipulation tasks in daily life, such as pouring water, unfold intentionally under specialized manipulation contexts. Being able to process contextual knowledge in these Activities of Daily Living (ADLs) over time can help us understand manipulation intentions, which are essential for an intelligent robot to transition smoothly between various manipulation actions. In this paper, to model the intended concepts of manipulation, we present a vision dataset under a strictly constrained knowledge domain for both robot and human manipulations, where manipulation concepts and relations are stored by an ontology system in a taxonomic manner. Furthermore, we propose a scheme to generate a combination of visual attentions and an evolving knowledge graph filled with commonsense knowledge. Our scheme works with real-world camera streams and fuses an attention-based Vision-Language model with the ontology system. The experimental results demonstrate that the proposed scheme can successfully represent the evolution of an intended object manipulation procedure for both robots and humans. The proposed scheme allows the robot to mimic human-like intentional behaviors by watching real-time videos. We aim to develop this scheme further for real-world robot intelligence in Human-Robot Interaction. △ Less

Submitted 2 March, 2020; originally announced March 2020.

arXiv:2001.09540 [pdf, other]

Weakly Supervised Few-shot Object Segmentation using Co-Attention with Visual and Semantic Embeddings

Authors: Mennatullah Siam, Naren Doraiswamy, Boris N. Oreshkin, Hengshuai Yao, Martin Jagersand

Abstract: Significant progress has been made recently in develo** few-shot object segmentation methods. Learning is shown to be successful in few-shot segmentation settings, using pixel-level, scribbles and bounding box supervision. This paper takes another approach, i.e., only requiring image-level label for few-shot object segmentation. We propose a novel multi-modal interaction module for few-shot obje… ▽ More Significant progress has been made recently in develo** few-shot object segmentation methods. Learning is shown to be successful in few-shot segmentation settings, using pixel-level, scribbles and bounding box supervision. This paper takes another approach, i.e., only requiring image-level label for few-shot object segmentation. We propose a novel multi-modal interaction module for few-shot object segmentation that utilizes a co-attention mechanism using both visual and word embedding. Our model using image-level labels achieves 4.8% improvement over previously proposed image-level few-shot object segmentation. It also outperforms state-of-the-art methods that use weak bounding box supervision on PASCAL-5i. Our results show that few-shot segmentation benefits from utilizing word embeddings, and that we are able to perform few-shot segmentation using stacked joint visual semantic processing with weak image-level labels. We further propose a novel setup, Temporal Object Segmentation for Few-shot Learning (TOSFL) for videos. TOSFL can be used on a variety of public video data such as Youtube-VOS, as demonstrated in both instance-level and category-level TOSFL experiments. △ Less

Submitted 17 May, 2020; v1 submitted 26 January, 2020; originally announced January 2020.

Comments: Accepted to IJCAI'20. The first three authors listed contributed equally

arXiv:1912.08936 [pdf, other]

One-Shot Weakly Supervised Video Object Segmentation

Authors: Mennatullah Siam, Naren Doraiswamy, Boris N. Oreshkin, Hengshuai Yao, Martin Jagersand

Abstract: Conventional few-shot object segmentation methods learn object segmentation from a few labelled support images with strongly labelled segmentation masks. Recent work has shown to perform on par with weaker levels of supervision in terms of scribbles and bounding boxes. However, there has been limited attention given to the problem of few-shot object segmentation with image-level supervision. We pr… ▽ More Conventional few-shot object segmentation methods learn object segmentation from a few labelled support images with strongly labelled segmentation masks. Recent work has shown to perform on par with weaker levels of supervision in terms of scribbles and bounding boxes. However, there has been limited attention given to the problem of few-shot object segmentation with image-level supervision. We propose a novel multi-modal interaction module for few-shot object segmentation that utilizes a co-attention mechanism using both visual and word embeddings. It enables our model to achieve 5.1% improvement over previously proposed image-level few-shot object segmentation. Our method compares relatively close to the state of the art methods that use strong supervision, while ours use the least possible supervision. We further propose a novel setup for few-shot weakly supervised video object segmentation(VOS) that relies on image-level labels for the first frame. The proposed setup uses weak annotation unlike semi-supervised VOS setting that utilizes strongly labelled segmentation masks. The setup evaluates the effectiveness of generalizing to novel classes in the VOS setting. The setup splits the VOS data into multiple folds with different categories per fold. It provides a potential setup to evaluate how few-shot object segmentation methods can benefit from additional object poses, or object interactions that is not available in static frames as in PASCAL-5i benchmark. △ Less

Submitted 18 December, 2019; originally announced December 2019.

arXiv:1911.04418 [pdf, other]

doi 10.1109/ICRA40945.2020.9196570

Visual Geometric Skill Inference by Watching Human Demonstration

Authors: Jun **, Laura Petrich, Zichen Zhang, Masood Dehghan, Martin Jagersand

Abstract: We study the problem of learning manipulation skills from human demonstration video by inferring the association relationships between geometric features. Motivation for this work stems from the observation that humans perform eye-hand coordination tasks by using geometric primitives to define a task while a geometric control error drives the task through execution. We propose a graph based kernel… ▽ More We study the problem of learning manipulation skills from human demonstration video by inferring the association relationships between geometric features. Motivation for this work stems from the observation that humans perform eye-hand coordination tasks by using geometric primitives to define a task while a geometric control error drives the task through execution. We propose a graph based kernel regression method to directly infer the underlying association constraints from human demonstration video using Incremental Maximum Entropy Inverse Reinforcement Learning (InMaxEnt IRL). The learned skill inference provides human readable task definition and outputs control errors that can be directly plugged into traditional controllers. Our method removes the need for tedious feature selection and robust feature trackers required in traditional approaches (e.g. feature-based visual servoing). Experiments show our method infers correct geometric associations even with only one human demonstration video and can generalize well under variance. △ Less

Submitted 5 March, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

Comments: Accepted in ICRA 2020

arXiv:1911.03074 [pdf, other]

doi 10.1109/ICRA40945.2020.9197148

Mapless Navigation among Dynamics with Social-safety-awareness: a reinforcement learning approach from 2D laser scans

Authors: Jun **, Nhat M. Nguyen, Nazmus Sakib, Daniel Graves, Hengshuai Yao, Martin Jagersand

Abstract: We propose a method to tackle the problem of mapless collision-avoidance navigation where humans are present using 2D laser scans. Our proposed method uses ego-safety to measure collision from the robot's perspective while social-safety to measure the impact of our robot's actions on surrounding pedestrians. Specifically, the social-safety part predicts the intrusion impact of our robot's action i… ▽ More We propose a method to tackle the problem of mapless collision-avoidance navigation where humans are present using 2D laser scans. Our proposed method uses ego-safety to measure collision from the robot's perspective while social-safety to measure the impact of our robot's actions on surrounding pedestrians. Specifically, the social-safety part predicts the intrusion impact of our robot's action into the interaction area with surrounding humans. We train the policy using reinforcement learning on a simple simulator and directly evaluate the learned policy in Gazebo and real robot tests. Experiments show the learned policy can be smoothly transferred without any fine tuning. We observe that our method demonstrates time-efficient path planning behavior with high success rate in mapless navigation tasks. Furthermore, we test our method in a navigation among dynamic crowds task considering both low and high volume traffic. Our learned policy demonstrates cooperative behavior that actively drives our robot into traffic flows while showing respect to nearby pedestrians. Evaluation videos are at https://sites.google.com/view/ssw-batman △ Less

Submitted 5 March, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

Comments: Accepted in ICRA 2020

arXiv:1909.07459 [pdf, other]

Bridging Visual Perception with Contextual Semantics for Understanding Robot Manipulation Tasks

Authors: Chen Jiang, Martin Jagersand

Abstract: Understanding manipulation scenarios allows intelligent robots to plan for appropriate actions to complete a manipulation task successfully. It is essential for intelligent robots to semantically interpret manipulation knowledge by describing entities, relations and attributes in a structural manner. In this paper, we propose an implementing framework to generate high-level conceptual dynamic know… ▽ More Understanding manipulation scenarios allows intelligent robots to plan for appropriate actions to complete a manipulation task successfully. It is essential for intelligent robots to semantically interpret manipulation knowledge by describing entities, relations and attributes in a structural manner. In this paper, we propose an implementing framework to generate high-level conceptual dynamic knowledge graphs from video clips. A combination of a Vision-Language model and an ontology system, in correspondence with visual perception and contextual semantics, is used to represent robot manipulation knowledge with Entity-Relation-Entity (E-R-E) and Entity-Attribute-Value (E-A-V) tuples. The proposed method is flexible and well-versed. Using the framework, we present a case study where robot performs manipulation actions in a kitchen environment, bridging visual perception with contextual semantics using the generated dynamic knowledge graphs. △ Less

Submitted 26 July, 2020; v1 submitted 16 September, 2019; originally announced September 2019.

arXiv:1903.09189 [pdf, other]

Long range teleoperation for fine manipulation tasks under time-delay network conditions

Authors: Jun **, Laura Petrich, Shida He, Masood Dehghan, Martin Jagersand

Abstract: We present a coarse-to-fine approach based semi-autonomous teleoperation system using vision guidance. The system is optimized for long range teleoperation tasks under time-delay network conditions and does not require prior knowledge of the remote scene. Our system initializes with a self exploration behavior that senses the remote surroundings through a freely mounted eye-in-hand web cam. The se… ▽ More We present a coarse-to-fine approach based semi-autonomous teleoperation system using vision guidance. The system is optimized for long range teleoperation tasks under time-delay network conditions and does not require prior knowledge of the remote scene. Our system initializes with a self exploration behavior that senses the remote surroundings through a freely mounted eye-in-hand web cam. The self exploration stage estimates hand-eye calibration and provides a telepresence interface via real-time 3D geometric reconstruction. The human operator is able to specify a visual task through the interface and a coarse-to-fine controller guides the remote robot enabling our system to work in high latency networks. Large motions are guided by coarse 3D estimation, whereas fine motions use image cues (IBVS). Network data transmission cost is minimized by sending only sparse points and a final image to the human side. Experiments from Singapore to Canada on multiple tasks were conducted to show our system's capability to work in long range teleoperation tasks. △ Less

Submitted 21 March, 2019; originally announced March 2019.

Comments: --submitted to IROS 2019 with RA-L option

arXiv:1903.00634 [pdf, other]

Evaluation of state representation methods in robot hand-eye coordination learning from demonstration

Authors: Jun **, Masood Dehghan, Laura Petrich, Steven Weikai Lu, Martin Jagersand

Abstract: We evaluate different state representation methods in robot hand-eye coordination learning on different aspects. Regarding state dimension reduction: we evaluates how these state representation methods capture relevant task information and how much compactness should a state representation be. Regarding controllability: experiments are designed to use different state representation methods in a tr… ▽ More We evaluate different state representation methods in robot hand-eye coordination learning on different aspects. Regarding state dimension reduction: we evaluates how these state representation methods capture relevant task information and how much compactness should a state representation be. Regarding controllability: experiments are designed to use different state representation methods in a traditional visual servoing controller and a REINFORCE controller. We analyze the challenges arisen from the representation itself other than from control algorithms. Regarding embodiment problem in LfD: we evaluate different method's capability in transferring learned representation from human to robot. Results are visualized for better understanding and comparison. △ Less

Submitted 2 March, 2019; originally announced March 2019.

Comments: submitted to IROS 2019

arXiv:1902.11123 [pdf, other]

Adaptive Masked Proxies for Few-Shot Segmentation

Authors: Mennatullah Siam, Boris Oreshkin, Martin Jagersand

Abstract: Deep learning has thrived by training on large-scale datasets. However, in robotics applications sample efficiency is critical. We propose a novel adaptive masked proxies method that constructs the final segmentation layer weights from few labelled samples. It utilizes multi-resolution average pooling on base embeddings masked with the label to act as a positive proxy for the new class, while fusi… ▽ More Deep learning has thrived by training on large-scale datasets. However, in robotics applications sample efficiency is critical. We propose a novel adaptive masked proxies method that constructs the final segmentation layer weights from few labelled samples. It utilizes multi-resolution average pooling on base embeddings masked with the label to act as a positive proxy for the new class, while fusing it with the previously learned class signatures. Our method is evaluated on PASCAL-$5^i$ dataset and outperforms the state-of-the-art in the few-shot semantic segmentation. Unlike previous methods, our approach does not require a second branch to estimate parameters or prototypes, which enables it to be used with 2-stream motion and appearance based segmentation networks. We further propose a novel setup for evaluating continual learning of object segmentation which we name incremental PASCAL (iPASCAL) where our method outperforms the baseline method. Our code is publicly available at https://github.com/MSiam/AdaptiveMaskedProxies. △ Less

Submitted 14 October, 2019; v1 submitted 19 February, 2019; originally announced February 2019.

Comments: Accepted to ICCV'19

arXiv:1810.07733 [pdf, other]

Video Object Segmentation using Teacher-Student Adaptation in a Human Robot Interaction (HRI) Setting

Authors: Mennatullah Siam, Chen Jiang, Steven Lu, Laura Petrich, Mahmoud Gamal, Mohamed Elhoseiny, Martin Jagersand

Abstract: Video object segmentation is an essential task in robot manipulation to facilitate gras** and learning affordances. Incremental learning is important for robotics in unstructured environments, since the total number of objects and their variations can be intractable. Inspired by the children learning process, human robot interaction (HRI) can be utilized to teach robots about the world guided by… ▽ More Video object segmentation is an essential task in robot manipulation to facilitate gras** and learning affordances. Incremental learning is important for robotics in unstructured environments, since the total number of objects and their variations can be intractable. Inspired by the children learning process, human robot interaction (HRI) can be utilized to teach robots about the world guided by humans similar to how children learn from a parent or a teacher. A human teacher can show potential objects of interest to the robot, which is able to self adapt to the teaching signal without providing manual segmentation labels. We propose a novel teacher-student learning paradigm to teach robots about their surrounding environment. A two-stream motion and appearance "teacher" network provides pseudo-labels to adapt an appearance "student" network. The student network is able to segment the newly learned objects in other scenes, whether they are static or in motion. We also introduce a carefully designed dataset that serves the proposed HRI setup, denoted as (I)nteractive (V)ideo (O)bject (S)egmentation. Our IVOS dataset contains teaching videos of different objects, and manipulation tasks. Unlike previous datasets, IVOS provides manipulation tasks sequences with segmentation annotation along with the waypoints for the robot trajectories. It also provides segmentation annotation for the different transformations such as translation, scale, planar rotation, and out-of-plane rotation. Our proposed adaptation method outperforms the state-of-the-art on DAVIS and FBMS with 6.8% and 1.2% in F-measure respectively. It improves over the baseline on IVOS dataset with 46.1% and 25.9% in mIoU. △ Less

Submitted 12 March, 2019; v1 submitted 17 October, 2018; originally announced October 2018.

Comments: Accepted in ICRA'19, https://msiam.github.io/ivos/

arXiv:1810.00159 [pdf, other]

doi 10.1109/ICRA.2019.8793649

Robot eye-hand coordination learning by watching human demonstrations: a task function approximation approach

Authors: Jun **, Laura Petrich, Masood Dehghan, Zichen Zhang, Martin Jagersand

Abstract: We present a robot eye-hand coordination learning method that can directly learn visual task specification by watching human demonstrations. Task specification is represented as a task function, which is learned using inverse reinforcement learning(IRL) by inferring differential rewards between state changes. The learned task function is then used as continuous feedbacks in an uncalibrated visual… ▽ More We present a robot eye-hand coordination learning method that can directly learn visual task specification by watching human demonstrations. Task specification is represented as a task function, which is learned using inverse reinforcement learning(IRL) by inferring differential rewards between state changes. The learned task function is then used as continuous feedbacks in an uncalibrated visual servoing(UVS) controller designed for the execution phase. Our proposed method can directly learn from raw videos, which removes the need for hand-engineered task specification. It can also provide task interpretability by directly approximating the task function. Besides, benefiting from the use of a traditional UVS controller, our training process is efficient and the learned policy is independent from a particular robot platform. Various experiments were designed to show that, for a certain DOF task, our method can adapt to task/environment variances in target positions, backgrounds, illuminations, and occlusions without prior retraining. △ Less

Submitted 27 February, 2019; v1 submitted 29 September, 2018; originally announced October 2018.

Comments: Accepted in ICRA 2019

arXiv:1809.08722 [pdf, other]

Online Object and Task Learning via Human Robot Interaction

Authors: Masood Dehghan, Zichen Zhang, Mennatullah Siam, Jun **, Laura Petrich, Martin Jagersand

Abstract: This work describes the development of a robotic system that acquires knowledge incrementally through human interaction where new tools and motions are taught on the fly. The robotic system developed was one of the five finalists in the KUKA Innovation Award competition and demonstrated during the Hanover Messe 2018 in Germany. The main contributions of the system are a) a novel incremental object… ▽ More This work describes the development of a robotic system that acquires knowledge incrementally through human interaction where new tools and motions are taught on the fly. The robotic system developed was one of the five finalists in the KUKA Innovation Award competition and demonstrated during the Hanover Messe 2018 in Germany. The main contributions of the system are a) a novel incremental object learning module - a deep learning based localization and recognition system - that allows a human to teach new objects to the robot, b) an intuitive user interface for specifying 3D motion task associated with the new object, c) a hybrid force-vision control module for performing compliant motion on an unstructured surface. This paper describes the implementation and integration of the main modules of the system and summarizes the lessons learned from the competition. △ Less

Submitted 27 February, 2019; v1 submitted 23 September, 2018; originally announced September 2018.

Comments: 7 pages. ICRA19

arXiv:1803.02758 [pdf, other]

RTSeg: Real-time Semantic Segmentation Comparative Study

Authors: Mennatullah Siam, Mostafa Gamal, Moemen Abdel-Razek, Senthil Yogamani, Martin Jagersand

Abstract: Semantic segmentation benefits robotics related applications especially autonomous driving. Most of the research on semantic segmentation is only on increasing the accuracy of segmentation models with little attention to computationally efficient solutions. The few work conducted in this direction does not provide principled methods to evaluate the different design choices for segmentation. In thi… ▽ More Semantic segmentation benefits robotics related applications especially autonomous driving. Most of the research on semantic segmentation is only on increasing the accuracy of segmentation models with little attention to computationally efficient solutions. The few work conducted in this direction does not provide principled methods to evaluate the different design choices for segmentation. In this paper, we address this gap by presenting a real-time semantic segmentation benchmarking framework with a decoupled design for feature extraction and decoding methods. The framework is comprised of different network architectures for feature extraction such as VGG16, Resnet18, MobileNet, and ShuffleNet. It is also comprised of multiple meta-architectures for segmentation that define the decoding methodology. These include SkipNet, UNet, and Dilation Frontend. Experimental results are presented on the Cityscapes dataset for urban scenes. The modular design allows novel architectures to emerge, that lead to 143x GFLOPs reduction in comparison to SegNet. This benchmarking framework is publicly available at "https://github.com/MSiam/TFSegmentation". △ Less

Submitted 16 May, 2020; v1 submitted 7 March, 2018; originally announced March 2018.

Comments: Accepted in IEEE ICIP 2018. IEEE Copyrights: Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

arXiv:1801.02722 [pdf, other]

End-to-end detection-segmentation network with ROI convolution

Authors: Zichen Zhang, Min Tang, Dana Cobzas, Dornoosh Zonoobi, Martin Jagersand, Jacob L. Jaremko

Abstract: We propose an end-to-end neural network that improves the segmentation accuracy of fully convolutional networks by incorporating a localization unit. This network performs object localization first, which is then used as a cue to guide the training of the segmentation network. We test the proposed method on a segmentation task of small objects on a clinical dataset of ultrasound images. We show th… ▽ More We propose an end-to-end neural network that improves the segmentation accuracy of fully convolutional networks by incorporating a localization unit. This network performs object localization first, which is then used as a cue to guide the training of the segmentation network. We test the proposed method on a segmentation task of small objects on a clinical dataset of ultrasound images. We show that by jointly learning for detection and segmentation, the proposed network is able to improve the segmentation accuracy compared to only learning for segmentation. Code is publicly available at https://github.com/vincentzhang/roi-fcn. △ Less

Submitted 2 December, 2019; v1 submitted 8 January, 2018; originally announced January 2018.

Comments: ISBI 2018

arXiv:1711.00139 [pdf, other]

Segmentation-by-Detection: A Cascade Network for Volumetric Medical Image Segmentation

Authors: Min Tang, Zichen Zhang, Dana Cobzas, Martin Jagersand, Jacob L. Jaremko

Abstract: We propose an attention mechanism for 3D medical image segmentation. The method, named segmentation-by-detection, is a cascade of a detection module followed by a segmentation module. The detection module enables a region of interest to come to attention and produces a set of object region candidates which are further used as an attention model. Rather than dealing with the entire volume, the segm… ▽ More We propose an attention mechanism for 3D medical image segmentation. The method, named segmentation-by-detection, is a cascade of a detection module followed by a segmentation module. The detection module enables a region of interest to come to attention and produces a set of object region candidates which are further used as an attention model. Rather than dealing with the entire volume, the segmentation module distills the information from the potential region. This scheme is an efficient solution for volumetric data as it reduces the influence of the surrounding noise which is especially important for medical data with low signal-to-noise ratio. Experimental results on 3D ultrasound data of the femoral head shows superiority of the proposed method when compared with a standard fully convolutional network like the U-Net. △ Less

Submitted 31 October, 2017; originally announced November 2017.

arXiv:1709.04821 [pdf, other]

MODNet: Moving Object Detection Network with Motion and Appearance for Autonomous Driving

Authors: Mennatullah Siam, Heba Mahgoub, Mohamed Zahran, Senthil Yogamani, Martin Jagersand, Ahmad El-Sallab

Abstract: We propose a novel multi-task learning system that combines appearance and motion cues for a better semantic reasoning of the environment. A unified architecture for joint vehicle detection and motion segmentation is introduced. In this architecture, a two-stream encoder is shared among both tasks. In order to evaluate our method in autonomous driving setting, KITTI annotated sequences with detect… ▽ More We propose a novel multi-task learning system that combines appearance and motion cues for a better semantic reasoning of the environment. A unified architecture for joint vehicle detection and motion segmentation is introduced. In this architecture, a two-stream encoder is shared among both tasks. In order to evaluate our method in autonomous driving setting, KITTI annotated sequences with detection and odometry ground truth are used to automatically generate static/dynamic annotations on the vehicles. This dataset is called KITTI Moving Object Detection dataset (KITTI MOD). The dataset will be made publicly available to act as a benchmark for the motion detection task. Our experiments show that the proposed method outperforms state of the art methods that utilize motion cue only with 21.5% in mAP on KITTI MOD. Our method performs on par with the state of the art unsupervised methods on DAVIS benchmark for generic object segmentation. One of our interesting conclusions is that joint training of motion segmentation and vehicle detection benefits motion segmentation. Motion segmentation has relatively fewer data, unlike the detection task. However, the shared fusion encoder benefits from joint training to learn a generalized representation. The proposed method runs in 120 ms per frame, which beats the state of the art motion detection/segmentation in computational efficiency. △ Less

Submitted 12 November, 2017; v1 submitted 14 September, 2017; originally announced September 2017.

arXiv:1708.03275 [pdf, other]

Incremental 3D Line Segment Extraction from Semi-dense SLAM

Authors: Shida He, Xuebin Qin, Zichen Zhang, Martin Jagersand

Abstract: Although semi-dense Simultaneous Localization and Map** (SLAM) has been becoming more popular over the last few years, there is a lack of efficient methods for representing and processing their large scale point clouds. In this paper, we propose using 3D line segments to simplify the point clouds generated by semi-dense SLAM. Specifically, we present a novel incremental approach for 3D line segm… ▽ More Although semi-dense Simultaneous Localization and Map** (SLAM) has been becoming more popular over the last few years, there is a lack of efficient methods for representing and processing their large scale point clouds. In this paper, we propose using 3D line segments to simplify the point clouds generated by semi-dense SLAM. Specifically, we present a novel incremental approach for 3D line segment extraction. This approach reduces a 3D line segment fitting problem into two 2D line segment fitting problems and takes advantage of both images and depth maps. In our method, 3D line segments are fitted incrementally along detected edge segments via minimizing fitting errors on two planes. By clustering the detected line segments, the resulting 3D representation of the scene achieves a good balance between compactness and completeness. Our experimental results show that the 3D line segments generated by our method are highly accurate. As an application, we demonstrate that these line segments greatly improve the quality of 3D surface reconstruction compared to a feature point based baseline. △ Less

Submitted 26 April, 2018; v1 submitted 10 August, 2017; originally announced August 2017.

Comments: Accepted at ICPR 2018

arXiv:1707.02432 [pdf, other]

Deep Semantic Segmentation for Automated Driving: Taxonomy, Roadmap and Challenges

Authors: Mennatullah Siam, Sara Elkerdawy, Martin Jagersand, Senthil Yogamani

Abstract: Semantic segmentation was seen as a challenging computer vision problem few years ago. Due to recent advancements in deep learning, relatively accurate solutions are now possible for its use in automated driving. In this paper, the semantic segmentation problem is explored from the perspective of automated driving. Most of the current semantic segmentation algorithms are designed for generic image… ▽ More Semantic segmentation was seen as a challenging computer vision problem few years ago. Due to recent advancements in deep learning, relatively accurate solutions are now possible for its use in automated driving. In this paper, the semantic segmentation problem is explored from the perspective of automated driving. Most of the current semantic segmentation algorithms are designed for generic images and do not incorporate prior structure and end goal for automated driving. First, the paper begins with a generic taxonomic survey of semantic segmentation algorithms and then discusses how it fits in the context of automated driving. Second, the particular challenges of deploying it into a safety system which needs high level of accuracy and robustness are listed. Third, different alternatives instead of using an independent semantic segmentation module are explored. Finally, an empirical evaluation of various semantic segmentation architectures was performed on CamVid dataset in terms of accuracy and speed. This paper is a preliminary shorter version of a more detailed survey which is work in progress. △ Less

Submitted 3 August, 2017; v1 submitted 8 July, 2017; originally announced July 2017.

Comments: To appear in IEEE ITSC 2017

arXiv:1705.00360 [pdf, other]

Real-Time Salient Closed Boundary Tracking via Line Segments Perceptual Grou**

Authors: Xuebin Qin, Shida He, Camilo Perez Quintero, Abhineet Singh, Masood Dehghan, Martin Jagersand

Abstract: This paper presents a novel real-time method for tracking salient closed boundaries from video image sequences. This method operates on a set of straight line segments that are produced by line detection. The tracking scheme is coherently integrated into a perceptual grou** framework in which the visual tracking problem is tackled by identifying a subset of these line segments and connecting the… ▽ More This paper presents a novel real-time method for tracking salient closed boundaries from video image sequences. This method operates on a set of straight line segments that are produced by line detection. The tracking scheme is coherently integrated into a perceptual grou** framework in which the visual tracking problem is tackled by identifying a subset of these line segments and connecting them sequentially to form a closed boundary with the largest saliency and a certain similarity to the previous one. Specifically, we define a new tracking criterion which combines a grou** cost and an area similarity constraint. The proposed criterion makes the resulting boundary tracking more robust to local minima. To achieve real-time tracking performance, we use Delaunay Triangulation to build a graph model with the detected line segments and then reduce the tracking problem to finding the optimal cycle in this graph. This is solved by our newly proposed closed boundary candidates searching algorithm called "Bidirectional Shortest Path (BDSP)". The efficiency and robustness of the proposed method are tested on real video sequences as well as during a robot arm pouring experiment. △ Less

Submitted 9 August, 2017; v1 submitted 30 April, 2017; originally announced May 2017.

Comments: 7 pages, 8 figures, The 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2017) submission ID 1034

arXiv:1703.01698 [pdf, other]

4-DoF Tracking for Robot Fine Manipulation Tasks

Authors: Mennatullah Siam, Abhineet Singh, Camilo Perez, Martin Jagersand

Abstract: This paper presents two visual trackers from the different paradigms of learning and registration based tracking and evaluates their application in image based visual servoing. They can track object motion with four degrees of freedom (DoF) which, as we will show here, is sufficient for many fine manipulation tasks. One of these trackers is a newly developed learning based tracker that relies on l… ▽ More This paper presents two visual trackers from the different paradigms of learning and registration based tracking and evaluates their application in image based visual servoing. They can track object motion with four degrees of freedom (DoF) which, as we will show here, is sufficient for many fine manipulation tasks. One of these trackers is a newly developed learning based tracker that relies on learning discriminative correlation filters while the other is a refinement of a recent 8 DoF RANSAC based tracker adapted with a new appearance model for tracking 4 DoF motion. Both trackers are shown to provide superior performance to several state of the art trackers on an existing dataset for manipulation tasks. Further, a new dataset with challenging sequences for fine manipulation tasks captured from robot mounted eye-in-hand (EIH) cameras is also presented. These sequences have a variety of challenges encountered during real tasks including jittery camera movement, motion blur, drastic scale changes and partial occlusions. Quantitative and qualitative results on these sequences are used to show that these two trackers are robust to failures while providing high precision that makes them suitable for such fine manipulation tasks. △ Less

Submitted 3 April, 2017; v1 submitted 5 March, 2017; originally announced March 2017.

Comments: accepted in CRV 2017

arXiv:1701.04693 [pdf, other]

Incremental Learning for Robot Perception through HRI

Authors: Sepehr Valipour, Camilo Perez, Martin Jagersand

Abstract: Scene understanding and object recognition is a difficult to achieve yet crucial skill for robots. Recently, Convolutional Neural Networks (CNN), have shown success in this task. However, there is still a gap between their performance on image datasets and real-world robotics scenarios. We present a novel paradigm for incrementally improving a robot's visual perception through active human interac… ▽ More Scene understanding and object recognition is a difficult to achieve yet crucial skill for robots. Recently, Convolutional Neural Networks (CNN), have shown success in this task. However, there is still a gap between their performance on image datasets and real-world robotics scenarios. We present a novel paradigm for incrementally improving a robot's visual perception through active human interaction. In this paradigm, the user introduces novel objects to the robot by means of pointing and voice commands. Given this information, the robot visually explores the object and adds images from it to re-train the perception module. Our base perception module is based on recent development in object detection and recognition using deep learning. Our method leverages state of the art CNNs from off-line batch learning, human guidance, robot exploration and incremental on-line learning. △ Less

Submitted 17 January, 2017; originally announced January 2017.

arXiv:1611.05435 [pdf, other]

Convolutional Gated Recurrent Networks for Video Segmentation

Authors: Mennatullah Siam, Sepehr Valipour, Martin Jagersand, Nilanjan Ray

Abstract: Semantic segmentation has recently witnessed major progress, where fully convolutional neural networks have shown to perform well. However, most of the previous work focused on improving single image segmentation. To our knowledge, no prior work has made use of temporal video information in a recurrent network. In this paper, we introduce a novel approach to implicitly utilize temporal data in vid… ▽ More Semantic segmentation has recently witnessed major progress, where fully convolutional neural networks have shown to perform well. However, most of the previous work focused on improving single image segmentation. To our knowledge, no prior work has made use of temporal video information in a recurrent network. In this paper, we introduce a novel approach to implicitly utilize temporal data in videos for online semantic segmentation. The method relies on a fully convolutional network that is embedded into a gated recurrent architecture. This design receives a sequence of consecutive video frames and outputs the segmentation of the last frame. Convolutional gated recurrent networks are used for the recurrent part to preserve spatial connectivities in the image. Our proposed method can be applied in both online and batch segmentation. This architecture is tested for both binary and semantic video segmentation tasks. Experiments are conducted on the recent benchmarks in SegTrack V2, Davis, CityScapes, and Synthia. Using recurrent fully convolutional networks improved the baseline network performance in all of our experiments. Namely, 5% and 3% improvement of F-measure in SegTrack2 and Davis respectively, 5.7% improvement in mean IoU in Synthia and 3.5% improvement in categorical mean IoU in CityScapes. The performance of the RFCN network depends on its baseline fully convolutional network. Thus RFCN architecture can be seen as a method to improve its baseline segmentation network by exploiting spatiotemporal information in videos. △ Less

Submitted 21 November, 2016; v1 submitted 16 November, 2016; originally announced November 2016.

Comments: arXiv admin note: text overlap with arXiv:1606.00487

arXiv:1607.04673 [pdf, ps, other]

Unifying Registration based Tracking: A Case Study with Structural Similarity

Authors: Abhineet Singh, Mennatullah Siam, Martin Jagersand

Abstract: This paper adapts a popular image quality measure called structural similarity for high precision registration based tracking while also introducing a simpler and faster variant of the same. Further, these are evaluated comprehensively against existing measures using a unified approach to study registration based trackers that decomposes them into three constituent sub modules - appearance model,… ▽ More This paper adapts a popular image quality measure called structural similarity for high precision registration based tracking while also introducing a simpler and faster variant of the same. Further, these are evaluated comprehensively against existing measures using a unified approach to study registration based trackers that decomposes them into three constituent sub modules - appearance model, state space model and search method. Several popular trackers in literature are broken down using this method so that their contributions - as of this paper - are shown to be limited to only one or two of these submodules. An open source tracking framework is made available that follows this decomposition closely through extensive use of generic programming. It is used to perform all experiments on four publicly available datasets so the results are easily reproducible. This framework provides a convenient interface to plug in a new method for any sub module and combine it with existing methods for the other two. It can also serve as a fast and flexible solution for practical tracking needs due to its highly efficient implementation. △ Less

Submitted 30 January, 2017; v1 submitted 15 July, 2016; originally announced July 2016.

Comments: Accepted at WACV 2017. Supplementary available at: http://webdocs.cs.ualberta.ca/~vis/mtf/ssim_supplementary.pdf arXiv admin note: text overlap with arXiv:1603.01292

arXiv:1606.09367 [pdf, other]

Parking Stall Vacancy Indicator System Based on Deep Convolutional Neural Networks

Authors: Sepehr Valipour, Mennatullah Siam, Eleni Stroulia, Martin Jagersand

Abstract: Parking management systems, and vacancy-indication services in particular, can play a valuable role in reducing traffic and energy waste in large cities. Visual detection methods represent a cost-effective option, since they can take advantage of hardware usually already available in many parking lots, namely cameras. However, visual detection methods can be fragile and not easily generalizable. I… ▽ More Parking management systems, and vacancy-indication services in particular, can play a valuable role in reducing traffic and energy waste in large cities. Visual detection methods represent a cost-effective option, since they can take advantage of hardware usually already available in many parking lots, namely cameras. However, visual detection methods can be fragile and not easily generalizable. In this paper, we present a robust detection algorithm based on deep convolutional neural networks. We implemented and tested our algorithm on a large baseline dataset, and also on a set of image feeds from actual cameras already installed in parking lots. We have developed a fully functional system, from server-side image analysis to front-end user interface, to demonstrate the practicality of our method. △ Less

Submitted 30 June, 2016; originally announced June 2016.

arXiv:1606.00487 [pdf, other]

Recurrent Fully Convolutional Networks for Video Segmentation

Authors: Sepehr Valipour, Mennatullah Siam, Martin Jagersand, Nilanjan Ray

Abstract: Image segmentation is an important step in most visual tasks. While convolutional neural networks have shown to perform well on single image segmentation, to our knowledge, no study has been been done on leveraging recurrent gated architectures for video segmentation. Accordingly, we propose a novel method for online segmentation of video sequences that incorporates temporal data. The network is b… ▽ More Image segmentation is an important step in most visual tasks. While convolutional neural networks have shown to perform well on single image segmentation, to our knowledge, no study has been been done on leveraging recurrent gated architectures for video segmentation. Accordingly, we propose a novel method for online segmentation of video sequences that incorporates temporal data. The network is built from fully convolutional element and recurrent unit that works on a sliding window over the temporal data. We also introduce a novel convolutional gated recurrent unit that preserves the spatial information and reduces the parameters learned. Our method has the advantage that it can work in an online fashion instead of operating over the whole input batch of video frames. The network is tested on the change detection dataset, and proved to have 5.5\% improvement in F-measure over a plain fully convolutional network for per frame segmentation. It was also shown to have improvement of 1.4\% for the F-measure compared to our baseline network that we call FCN 12s. △ Less

Submitted 30 October, 2016; v1 submitted 1 June, 2016; originally announced June 2016.

arXiv:1603.01292 [pdf, ps, other]

Modular Decomposition and Analysis of Registration based Trackers

Authors: Abhineet Singh, Ankush Roy, Xi Zhang, Martin Jagersand

Abstract: This paper presents a new way to study registration based trackers by decomposing them into three constituent sub modules: appearance model, state space model and search method. It is often the case that when a new tracker is introduced in literature, it only contributes to one or two of these sub modules while using existing methods for the rest. Since these are often selected arbitrarily by the… ▽ More This paper presents a new way to study registration based trackers by decomposing them into three constituent sub modules: appearance model, state space model and search method. It is often the case that when a new tracker is introduced in literature, it only contributes to one or two of these sub modules while using existing methods for the rest. Since these are often selected arbitrarily by the authors, they may not be optimal for the new method. In such cases, our breakdown can help to experimentally find the best combination of methods for these sub modules while also providing a framework within which the contributions of the new tracker can be clearly demarcated and thus studied better. We show how existing trackers can be broken down using the suggested methodology and compare the performance of the default configuration chosen by the authors against other possible combinations to demonstrate the new insights that can be gained by such an approach. We also present an open source system that provides a convenient interface to plug in a new method for any sub module and test it against all possible combinations of methods for the other two sub modules while also serving as a fast and efficient solution for practical tracking requirements. △ Less

Submitted 25 March, 2016; v1 submitted 3 March, 2016; originally announced March 2016.

arXiv:1602.09130 [pdf, other]

Modular Tracking Framework: A Unified Approach to Registration based Tracking

Authors: Abhineet Singh, Martin Jagersand

Abstract: This paper presents a modular, extensible and highly efficient open source framework for registration based tracking called Modular Tracking Framework (MTF). Targeted at robotics applications, it is implemented entirely in C++ and designed from the ground up to easily integrate with systems that support any of several major vision and robotics libraries including OpenCV, ROS, ViSP and Eigen. It im… ▽ More This paper presents a modular, extensible and highly efficient open source framework for registration based tracking called Modular Tracking Framework (MTF). Targeted at robotics applications, it is implemented entirely in C++ and designed from the ground up to easily integrate with systems that support any of several major vision and robotics libraries including OpenCV, ROS, ViSP and Eigen. It implements more methods, is faster, and more precise than other existing systems. Further, the theoretical basis for its design is a new way to conceptualize registration based trackers that decomposes them into three constituent sub modules - Search Method (SM), Appearance Model (AM) and State Space Model (SSM). In the process, we integrate many important advances published after Baker \& Matthews' landmark work in 2004. In addition to being a practical solution for fast and high precision tracking, MTF can also serve as a useful research tool by allowing existing and new methods for any of the sub modules to be studied better. When a new method is introduced for one of these, the breakdown can help to experimentally find the combination of methods for the others that is optimum for it. By extensive use of generic programming, MTF makes it easy to plug in a new method for any of the sub modules so that it can not only be tested comprehensively with existing methods but also become immediately available for deployment in any project that uses the framework. With 16 AMs, 11 SMs and 13 SSMs implemented already, MTF provides over 2000 distinct single layer trackers. It also allows two or more of these to be combined together in several ways to create a practically unlimited variety of novel multi layer trackers. △ Less

Submitted 17 May, 2018; v1 submitted 29 February, 2016; originally announced February 2016.

Comments: Under consideration at Computer Vision and Image Understanding

Showing 1–43 of 43 results for author: Jagersand, M