Search | arXiv e-print repository

Neural L1 Adaptive Control of Vehicle Lateral Dynamics

Authors: Pratik Mukherjee, Burak M. Gonultas, O. Goktug Poyrazoglu, Volkan Isler

Abstract: We address the problem of stable and robust control of vehicles with lateral error dynamics for the application of lane kee**. Lane departure is the primary reason for half of the fatalities in road accidents, making the development of stable, adaptive and robust controllers a necessity. Traditional linear feedback controllers achieve satisfactory tracking performance, however, they exhibit unst… ▽ More We address the problem of stable and robust control of vehicles with lateral error dynamics for the application of lane kee**. Lane departure is the primary reason for half of the fatalities in road accidents, making the development of stable, adaptive and robust controllers a necessity. Traditional linear feedback controllers achieve satisfactory tracking performance, however, they exhibit unstable behavior when uncertainties are induced into the system. Any disturbance or uncertainty introduced to the steering-angle input can be catastrophic for the vehicle. Therefore, controllers must be developed to actively handle such uncertainties. In this work, we introduce a Neural L1 Adaptive controller (Neural-L1) which learns the uncertainties in the lateral error dynamics of a front-steered Ackermann vehicle and guarantees stability and robustness. Our contributions are threefold: i) We extend the theoretical results for guaranteed stability and robustness of conventional L1 Adaptive controllers to Neural-L1; ii) We implement a Neural-L1 for the lane kee** application which learns uncertainties in the dynamics accurately; iii)We evaluate the performance of Neural-L1 on a physics-based simulator, PyBullet, and conduct extensive real-world experiments with the F1TENTH platform to demonstrate superior reference trajectory tracking performance of Neural-L1 compared to other state-of-the-art controllers, in the presence of uncertainties. Our project page, including supplementary material and videos, can be found at https://mukhe027.github.io/Neural-Adaptive-Control/ △ Less

Submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.05372 [pdf, other]

Learning to Play Pursuit-Evasion with Dynamic and Sensor Constraints

Authors: Burak M. Gonultas, Volkan Isler

Abstract: We present a multi-agent reinforcement learning approach to solve a pursuit-evasion game between two players with car-like dynamics and sensing limitations. We develop a curriculum for an existing multi-agent deterministic policy gradient algorithm to simultaneously obtain strategies for both players, and deploy the learned strategies on real robots moving as fast as 2 m/s in indoor environments.… ▽ More We present a multi-agent reinforcement learning approach to solve a pursuit-evasion game between two players with car-like dynamics and sensing limitations. We develop a curriculum for an existing multi-agent deterministic policy gradient algorithm to simultaneously obtain strategies for both players, and deploy the learned strategies on real robots moving as fast as 2 m/s in indoor environments. Through experiments we show that the learned strategies improve over existing baselines by up to 30% in terms of capture rate for the pursuer. The learned evader model has up to 5% better escape rate over the baselines even against our competitive pursuer model. We also present experiment results which show how the pursuit-evasion game and its results evolve as the player dynamics and sensor constraints are varied. Finally, we deploy learned policies on physical robots for a game between the F1TENTH and JetRacer platforms and show that the learned strategies can be executed on real-robots. Our code and supplementary material including videos from experiments are available at https: //gonultasbu.github.io/pursuit-evasion/. △ Less

Submitted 8 May, 2024; originally announced May 2024.

arXiv:2403.13294 [pdf, other]

Map-Aware Human Pose Prediction for Robot Follow-Ahead

Authors: Qingyuan Jiang, Burak Susam, Jun-Jee Chao, Volkan Isler

Abstract: In the robot follow-ahead task, a mobile robot is tasked to maintain its relative position in front of a moving human actor while kee** the actor in sight. To accomplish this task, it is important that the robot understand the full 3D pose of the human (since the head orientation can be different than the torso) and predict future human poses so as to plan accordingly. This prediction task is es… ▽ More In the robot follow-ahead task, a mobile robot is tasked to maintain its relative position in front of a moving human actor while kee** the actor in sight. To accomplish this task, it is important that the robot understand the full 3D pose of the human (since the head orientation can be different than the torso) and predict future human poses so as to plan accordingly. This prediction task is especially tricky in a complex environment with junctions and multiple corridors. In this work, we address the problem of forecasting the full 3D trajectory of a human in such environments. Our main insight is to show that one can first predict the 2D trajectory and then estimate the full 3D trajectory by conditioning the estimator on the predicted 2D trajectory. With this approach, we achieve results comparable or better than the state-of-the-art methods three times faster. As part of our contribution, we present a new dataset where, in contrast to existing datasets, the human motion is in a much larger area than a single room. We also present a complete robot system that integrates our human pose forecasting network on the mobile robot to enable real-time robot follow-ahead and present results from real-world experiments in multiple buildings on campus. Our project page, including supplementary material and videos, can be found at: https://qingyuan-jiang.github.io/iros2024_poseForecasting/ △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2312.09252 [pdf, other]

FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection

Authors: Hongsuk Choi, Isaac Kasahara, Selim Engin, Moritz Graule, Nikhil Chavan-Dafle, Volkan Isler

Abstract: Recently introduced ControlNet has the ability to steer the text-driven image generation process with geometric input such as human 2D pose, or edge features. While ControlNet provides control over the geometric form of the instances in the generated image, it lacks the capability to dictate the visual appearance of each instance. We present FineControlNet to provide fine control over each instanc… ▽ More Recently introduced ControlNet has the ability to steer the text-driven image generation process with geometric input such as human 2D pose, or edge features. While ControlNet provides control over the geometric form of the instances in the generated image, it lacks the capability to dictate the visual appearance of each instance. We present FineControlNet to provide fine control over each instance's appearance while maintaining the precise pose control capability. Specifically, we develop and demonstrate FineControlNet with geometric control via human pose images and appearance control via instance-level text prompts. The spatial alignment of instance-specific text prompts and 2D poses in latent space enables the fine control capabilities of FineControlNet. We evaluate the performance of FineControlNet with rigorous comparison against state-of-the-art pose-conditioned text-to-image diffusion models. FineControlNet achieves superior performance in generating images that follow the user-provided instance-specific text prompts and poses compared with existing methods. Project webpage: https://samsunglabs.github.io/FineControlNet-project-page △ Less

Submitted 14 December, 2023; originally announced December 2023.

Comments: Hongsuk Choi and Isaac Kasahara have eqaul contributions. 19 pages, 15 figures, 3 tables

arXiv:2311.04783 [pdf, other]

VioLA: Aligning Videos to 2D LiDAR Scans

Authors: Jun-Jee Chao, Selim Engin, Nikhil Chavan-Dafle, Bhoram Lee, Volkan Isler

Abstract: We study the problem of aligning a video that captures a local portion of an environment to the 2D LiDAR scan of the entire environment. We introduce a method (VioLA) that starts with building a semantic map of the local scene from the image sequence, then extracts points at a fixed height for registering to the LiDAR map. Due to reconstruction errors or partial coverage of the camera scan, the re… ▽ More We study the problem of aligning a video that captures a local portion of an environment to the 2D LiDAR scan of the entire environment. We introduce a method (VioLA) that starts with building a semantic map of the local scene from the image sequence, then extracts points at a fixed height for registering to the LiDAR map. Due to reconstruction errors or partial coverage of the camera scan, the reconstructed semantic map may not contain sufficient information for registration. To address this problem, VioLA makes use of a pre-trained text-to-image inpainting model paired with a depth completion model for filling in the missing scene content in a geometrically consistent fashion to support pose registration. We evaluate VioLA on two real-world RGB-D benchmarks, as well as a self-captured dataset of a large office scene. Notably, our proposed scene completion module improves the pose registration performance by up to 20%. △ Less

Submitted 8 November, 2023; originally announced November 2023.

Comments: 8 pages

arXiv:2310.20034 [pdf, other]

GG-LLM: Geometrically Grounding Large Language Models for Zero-shot Human Activity Forecasting in Human-Aware Task Planning

Authors: Moritz A. Graule, Volkan Isler

Abstract: A robot in a human-centric environment needs to account for the human's intent and future motion in its task and motion planning to ensure safe and effective operation. This requires symbolic reasoning about probable future actions and the ability to tie these actions to specific locations in the physical environment. While one can train behavioral models capable of predicting human motion from pa… ▽ More A robot in a human-centric environment needs to account for the human's intent and future motion in its task and motion planning to ensure safe and effective operation. This requires symbolic reasoning about probable future actions and the ability to tie these actions to specific locations in the physical environment. While one can train behavioral models capable of predicting human motion from past activities, this approach requires large amounts of data to achieve acceptable long-horizon predictions. More importantly, the resulting models are constrained to specific data formats and modalities. Moreover, connecting predictions from such models to the environment at hand to ensure the applicability of these predictions is an unsolved problem. We present a system that utilizes a Large Language Model (LLM) to infer a human's next actions from a range of modalities without fine-tuning. A novel aspect of our system that is critical to robotics applications is that it links the predicted actions to specific locations in a semantic map of the environment. Our method leverages the fact that LLMs, trained on a vast corpus of text describing typical human behaviors, encode substantial world knowledge, including probable sequences of human actions and activities. We demonstrate how these localized activity predictions can be incorporated in a human-aware task planner for an assistive robot to reduce the occurrences of undesirable human-robot interactions by 29.2% on average. △ Less

Submitted 30 October, 2023; originally announced October 2023.

arXiv:2310.18473 [pdf, other]

Pouring by Feel: An Analysis of Tactile and Proprioceptive Sensing for Accurate Pouring

Authors: Pedro Piacenza, Daewon Lee, Volkan Isler

Abstract: As service robots begin to be deployed to assist humans, it is important for them to be able to perform a skill as ubiquitous as pouring. Specifically, we focus on the task of pouring an exact amount of water without any environmental instrumentation, that is, using only the robot's own sensors to perform this task in a general way robustly. In our approach we use a simple PID controller which use… ▽ More As service robots begin to be deployed to assist humans, it is important for them to be able to perform a skill as ubiquitous as pouring. Specifically, we focus on the task of pouring an exact amount of water without any environmental instrumentation, that is, using only the robot's own sensors to perform this task in a general way robustly. In our approach we use a simple PID controller which uses the measured change in weight of the held container to supervise the pour. Unlike previous methods which use specialized force-torque sensors at the robot wrist, we use our robot joint torque sensors and investigate the added benefit of tactile sensors at the fingertips. We train three estimators from data which regress the poured weight out of the source container and show that we can accurately pour within 10 ml of the target on average while being robust enough to pour at novel locations and with different grasps on the source container. △ Less

Submitted 27 October, 2023; originally announced October 2023.

arXiv:2310.18459 [pdf, other]

VFAS-Grasp: Closed Loop Gras** with Visual Feedback and Adaptive Sampling

Authors: Pedro Piacenza, Jiacheng Yuan, **wook Huh, Volkan Isler

Abstract: We consider the problem of closed-loop robotic gras** and present a novel planner which uses Visual Feedback and an uncertainty-aware Adaptive Sampling strategy (VFAS) to close the loop. At each iteration, our method VFAS-Grasp builds a set of candidate grasps by generating random perturbations of a seed grasp. The candidates are then scored using a novel metric which combines a learned grasp-qu… ▽ More We consider the problem of closed-loop robotic gras** and present a novel planner which uses Visual Feedback and an uncertainty-aware Adaptive Sampling strategy (VFAS) to close the loop. At each iteration, our method VFAS-Grasp builds a set of candidate grasps by generating random perturbations of a seed grasp. The candidates are then scored using a novel metric which combines a learned grasp-quality estimator, the uncertainty in the estimate and the distance from the seed proposal to promote temporal consistency. Additionally, we present two mechanisms to improve the efficiency of our sampling strategy: We dynamically scale the sampling region size and number of samples in it based on past grasp scores. We also leverage a motion vector field estimator to shift the center of our sampling region. We demonstrate that our algorithm can run in real time (20 Hz) and is capable of improving grasp performance for static scenes by refining the initial grasp proposal. We also show that it can enable gras** of slow moving objects, such as those encountered during human to robot handover. △ Less

Submitted 27 October, 2023; originally announced October 2023.

arXiv:2310.09463 [pdf, other]

HIO-SDF: Hierarchical Incremental Online Signed Distance Fields

Authors: Vasileios Vasilopoulos, Suveer Garg, **wook Huh, Bhoram Lee, Volkan Isler

Abstract: A good representation of a large, complex mobile robot workspace must be space-efficient yet capable of encoding relevant geometric details. When exploring unknown environments, it needs to be updatable incrementally in an online fashion. We introduce HIO-SDF, a new method that represents the environment as a Signed Distance Field (SDF). State of the art representations of SDFs are based on either… ▽ More A good representation of a large, complex mobile robot workspace must be space-efficient yet capable of encoding relevant geometric details. When exploring unknown environments, it needs to be updatable incrementally in an online fashion. We introduce HIO-SDF, a new method that represents the environment as a Signed Distance Field (SDF). State of the art representations of SDFs are based on either neural networks or voxel grids. Neural networks are capable of representing the SDF continuously. However, they are hard to update incrementally as neural networks tend to forget previously observed parts of the environment unless an extensive sensor history is stored for training. Voxel-based representations do not have this problem but they are not space-efficient especially in large environments with fine details. HIO-SDF combines the advantages of these representations using a hierarchical approach which employs a coarse voxel grid that captures the observed parts of the environment together with high-resolution local information to train a neural network. HIO-SDF achieves a 46% lower mean global SDF error across all test scenes than a state of the art continuous representation, and a 30% lower error than a discrete representation at the same resolution as our coarse global SDF grid. Videos and code are available at: https://samsunglabs.github.io/HIO-SDF-project-page/ △ Less

Submitted 3 March, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

Comments: IEEE International Conference on Robotics and Automation (ICRA 2024) - 7 pages, 7 figures

arXiv:2309.07891 [pdf, other]

HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image

Authors: Hongsuk Choi, Nikhil Chavan-Dafle, Jiacheng Yuan, Volkan Isler, Hyunsoo Park

Abstract: This paper presents a method to learn hand-object interaction prior for reconstructing a 3D hand-object scene from a single RGB image. The inference as well as training-data generation for 3D hand-object scene reconstruction is challenging due to the depth ambiguity of a single image and occlusions by the hand and object. We turn this challenge into an opportunity by utilizing the hand shape to co… ▽ More This paper presents a method to learn hand-object interaction prior for reconstructing a 3D hand-object scene from a single RGB image. The inference as well as training-data generation for 3D hand-object scene reconstruction is challenging due to the depth ambiguity of a single image and occlusions by the hand and object. We turn this challenge into an opportunity by utilizing the hand shape to constrain the possible relative configuration of the hand and object geometry. We design a generalizable implicit function, HandNeRF, that explicitly encodes the correlation of the 3D hand shape features and 2D object features to predict the hand and object scene geometry. With experiments on real-world datasets, we show that HandNeRF is able to reconstruct hand-object scenes of novel grasp configurations more accurately than comparable methods. Moreover, we demonstrate that object reconstruction from HandNeRF ensures more accurate execution of downstream tasks, such as gras** and motion planning for robotic hand-over and manipulation. The code is released here: https://github.com/SamsungLabs/HandNeRF △ Less

Submitted 11 February, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: In ICRA 2024; 13 pages including the supplementary material, 8 tables, 12 figures

arXiv:2308.03898 [pdf, other]

System Identification and Control of Front-Steered Ackermann Vehicles through Differentiable Physics

Authors: Burak M. Gonultas, Pratik Mukherjee, O. Goktug Poyrazoglu, Volkan Isler

Abstract: In this paper, we address the problem of system identification and control of a front-steered vehicle which abides by the Ackermann geometry constraints. This problem arises naturally for on-road and off-road vehicles that require reliable system identification and basic feedback controllers for various applications such as lane kee** and way-point navigation. Traditional system identification r… ▽ More In this paper, we address the problem of system identification and control of a front-steered vehicle which abides by the Ackermann geometry constraints. This problem arises naturally for on-road and off-road vehicles that require reliable system identification and basic feedback controllers for various applications such as lane kee** and way-point navigation. Traditional system identification requires expensive equipment and is time consuming. In this work we explore the use of differentiable physics for system identification and controller design and make the following contributions: i)We develop a differentiable physics simulator (DPS) to provide a method for the system identification of front-steered class of vehicles whose system parameters are learned using a gradient-based method; ii) We provide results for our gradient-based method that exhibit better sample efficiency in comparison to other gradient-free methods; iii) We validate the learned system parameters by implementing a feedback controller to demonstrate stable lane kee** performance on a real front-steered vehicle, the F1TENTH; iv) Further, we provide results exhibiting comparable lane kee** behavior for system parameters learned using our gradient-based method with lane kee** behavior of the actual system parameters of the F1TENTH. △ Less

Submitted 8 November, 2023; v1 submitted 7 August, 2023; originally announced August 2023.

Comments: Accepted for IROS 2023

arXiv:2308.00134 [pdf, other]

Onboard View Planning of a Flying Camera for High Fidelity 3D Reconstruction of a Moving Actor

Authors: Qingyuan Jiang, Volkan Isler

Abstract: Capturing and reconstructing a human actor's motion is important for filmmaking and gaming. Currently, motion capture systems with static cameras are used for pixel-level high-fidelity reconstructions. Such setups are costly, require installation and calibration and, more importantly, confine the user to a predetermined area. In this work, we present a drone-based motion capture system that can al… ▽ More Capturing and reconstructing a human actor's motion is important for filmmaking and gaming. Currently, motion capture systems with static cameras are used for pixel-level high-fidelity reconstructions. Such setups are costly, require installation and calibration and, more importantly, confine the user to a predetermined area. In this work, we present a drone-based motion capture system that can alleviate these limitations. We present a complete system implementation and study view planning which is critical for achieving high-quality reconstructions. The main challenge for view planning for a drone-based capture system is that it needs to be performed during motion capture. To address this challenge, we introduce simple geometric primitives and show that they can be used for view planning. Specifically, we introduce Pixel-Per-Area (PPA) as a reconstruction quality proxy and plan views by maximizing the PPA of the faces of a simple geometric shape representing the actor. Through experiments in simulation, we show that PPA is highly correlated with reconstruction quality. We also conduct real-world experiments showing that our system can produce dynamic 3D reconstructions of good quality. We share our code for the simulation experiments in the link: https://github.com/Qingyuan-Jiang/view_planning_3dhuman △ Less

Submitted 31 July, 2023; originally announced August 2023.

arXiv:2307.11932 [pdf, other]

RIC: Rotate-Inpaint-Complete for Generalizable Scene Reconstruction

Authors: Isaac Kasahara, Shubham Agrawal, Selim Engin, Nikhil Chavan-Dafle, Shuran Song, Volkan Isler

Abstract: General scene reconstruction refers to the task of estimating the full 3D geometry and texture of a scene containing previously unseen objects. In many practical applications such as AR/VR, autonomous navigation, and robotics, only a single view of the scene may be available, making the scene reconstruction task challenging. In this paper, we present a method for scene reconstruction by structural… ▽ More General scene reconstruction refers to the task of estimating the full 3D geometry and texture of a scene containing previously unseen objects. In many practical applications such as AR/VR, autonomous navigation, and robotics, only a single view of the scene may be available, making the scene reconstruction task challenging. In this paper, we present a method for scene reconstruction by structurally breaking the problem into two steps: rendering novel views via inpainting and 2D to 3D scene lifting. Specifically, we leverage the generalization capability of large visual language models (Dalle-2) to inpaint the missing areas of scene color images rendered from different views. Next, we lift these inpainted images to 3D by predicting normals of the inpainted image and solving for the missing depth values. By predicting for normals instead of depth directly, our method allows for robustness to changes in depth distributions and scale. With rigorous quantitative evaluation, we show that our method outperforms multiple baselines while providing generalization to novel objects and scenes. △ Less

Submitted 4 October, 2023; v1 submitted 21 July, 2023; originally announced July 2023.

arXiv:2305.10534 [pdf, other]

RAMP: Hierarchical Reactive Motion Planning for Manipulation Tasks Using Implicit Signed Distance Functions

Authors: Vasileios Vasilopoulos, Suveer Garg, Pedro Piacenza, **wook Huh, Volkan Isler

Abstract: We introduce Reactive Action and Motion Planner (RAMP), which combines the strengths of sampling-based and reactive approaches for motion planning. In essence, RAMP is a hierarchical approach where a novel variant of a Model Predictive Path Integral (MPPI) controller is used to generate trajectories which are then followed asynchronously by a local vector field controller. We demonstrate, in the c… ▽ More We introduce Reactive Action and Motion Planner (RAMP), which combines the strengths of sampling-based and reactive approaches for motion planning. In essence, RAMP is a hierarchical approach where a novel variant of a Model Predictive Path Integral (MPPI) controller is used to generate trajectories which are then followed asynchronously by a local vector field controller. We demonstrate, in the context of a table clearing application, that RAMP can rapidly find paths in the robot's configuration space, satisfy task and robot-specific constraints, and provide safety by reacting to static or dynamically moving obstacles. RAMP achieves superior performance through a number of key innovations: we use Signed Distance Function (SDF) representations directly from the robot configuration space, both for collision checking and reactive control. The use of SDFs allows for a smoother definition of collision cost when planning for a trajectory, and is critical in ensuring safety while following trajectories. In addition, we introduce a novel variant of MPPI which, combined with the safety guarantees of the vector field trajectory follower, performs incremental real-time global trajectory planning. Simulation results establish that our method can generate paths that are comparable to traditional and state-of-the-art approaches in terms of total trajectory length while being up to 30 times faster. Real-world experiments demonstrate the safety and effectiveness of our approach in challenging table clearing scenarios. Videos and code are available at: https://samsunglabs.github.io/RAMP-project-page/ △ Less

Submitted 31 July, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

Comments: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023) - 8 pages, 6 figures

arXiv:2305.09510 [pdf, other]

Real-time Simultaneous Multi-Object 3D Shape Reconstruction, 6DoF Pose Estimation and Dense Grasp Prediction

Authors: Shubham Agrawal, Nikhil Chavan-Dafle, Isaac Kasahara, Selim Engin, **wook Huh, Volkan Isler

Abstract: Robotic manipulation systems operating in complex environments rely on perception systems that provide information about the geometry (pose and 3D shape) of the objects in the scene along with other semantic information such as object labels. This information is then used for choosing the feasible grasps on relevant objects. In this paper, we present a novel method to provide this geometric and se… ▽ More Robotic manipulation systems operating in complex environments rely on perception systems that provide information about the geometry (pose and 3D shape) of the objects in the scene along with other semantic information such as object labels. This information is then used for choosing the feasible grasps on relevant objects. In this paper, we present a novel method to provide this geometric and semantic information of all objects in the scene as well as feasible grasps on those objects simultaneously. The main advantage of our method is its speed as it avoids sequential perception and grasp planning steps. With detailed quantitative analysis, we show that our method delivers competitive performance compared to the state-of-the-art dedicated methods for object shape, pose, and grasp predictions while providing fast inference at 30 frames per second speed. △ Less

Submitted 16 May, 2023; originally announced May 2023.

ACM Class: I.4.5; I.4.8; I.4.10; I.2.9; I.2.10; I.6.3

arXiv:2304.07200 [pdf, other]

doi 10.1109/LRA.2022.3188400

EV-Catcher: High-Speed Object Catching Using Low-latency Event-based Neural Networks

Authors: Ziyun Wang, Fernando Cladera Ojeda, Anthony Bisulco, Daewon Lee, Camillo J. Taylor, Kostas Daniilidis, M. Ani Hsieh, Daniel D. Lee, Volkan Isler

Abstract: Event-based sensors have recently drawn increasing interest in robotic perception due to their lower latency, higher dynamic range, and lower bandwidth requirements compared to standard CMOS-based imagers. These properties make them ideal tools for real-time perception tasks in highly dynamic environments. In this work, we demonstrate an application where event cameras excel: accurately estimating… ▽ More Event-based sensors have recently drawn increasing interest in robotic perception due to their lower latency, higher dynamic range, and lower bandwidth requirements compared to standard CMOS-based imagers. These properties make them ideal tools for real-time perception tasks in highly dynamic environments. In this work, we demonstrate an application where event cameras excel: accurately estimating the impact location of fast-moving objects. We introduce a lightweight event representation called Binary Event History Image (BEHI) to encode event data at low latency, as well as a learning-based approach that allows real-time inference of a confidence-enabled control signal to the robot. To validate our approach, we present an experimental catching system in which we catch fast-flying **-pong balls. We show that the system is capable of achieving a success rate of 81% in catching balls targeted at different locations, with a velocity of up to 13 m/s even on compute-constrained embedded platforms such as the Nvidia Jetson NX. △ Less

Submitted 14 April, 2023; originally announced April 2023.

Comments: 8 pages, 6 figures, IEEE Robotics and Automation Letters ( Volume: 7, Issue: 4, October 2022)

arXiv:2304.04100 [pdf, other]

Pick2Place: Task-aware 6DoF Grasp Estimation via Object-Centric Perspective Affordance

Authors: Zhanpeng He, Nikhil Chavan-Dafle, **wook Huh, Shuran Song, Volkan Isler

Abstract: The choice of a grasp plays a critical role in the success of downstream manipulation tasks. Consider a task of placing an object in a cluttered scene; the majority of possible grasps may not be suitable for the desired placement. In this paper, we study the synergy between the picking and placing of an object in a cluttered scene to develop an algorithm for task-aware grasp estimation. We present… ▽ More The choice of a grasp plays a critical role in the success of downstream manipulation tasks. Consider a task of placing an object in a cluttered scene; the majority of possible grasps may not be suitable for the desired placement. In this paper, we study the synergy between the picking and placing of an object in a cluttered scene to develop an algorithm for task-aware grasp estimation. We present an object-centric action space that encodes the relationship between the geometry of the placement scene and the object to be placed in order to provide placement affordance maps directly from perspective views of the placement scene. This action space enables the computation of a one-to-one map** between the placement and picking actions allowing the robot to generate a diverse set of pick-and-place proposals and to optimize for a grasp under other task constraints such as robot kinematics and collision avoidance. With experiments both in simulation and on a real robot we demonstrate that with our method, the robot is able to successfully complete the task of placement-aware gras** with over 89% accuracy in such a way that generalizes to novel objects and scenes. △ Less

Submitted 8 April, 2023; originally announced April 2023.

Comments: IEEE International Conference on Robotics and Automation 2023

arXiv:2303.01010 [pdf, other]

Active Mass Distribution Estimation from Tactile Feedback

Authors: Jiacheng Yuan, Changhyun Choi, Ellad B. Tadmor, Volkan Isler

Abstract: In this work, we present a method to estimate the mass distribution of a rigid object through robotic interactions and tactile feedback. This is a challenging problem because of the complexity of physical dynamics modeling and the action dependencies across the model parameters. We propose a sequential estimation strategy combined with a set of robot action selection rules based on the analytical… ▽ More In this work, we present a method to estimate the mass distribution of a rigid object through robotic interactions and tactile feedback. This is a challenging problem because of the complexity of physical dynamics modeling and the action dependencies across the model parameters. We propose a sequential estimation strategy combined with a set of robot action selection rules based on the analytical formulation of a discrete-time dynamics model. To evaluate the performance of our approach, we also manufactured re-configurable block objects that allow us to modify the object mass distribution while having access to the ground truth values. We compare our approach against multiple baselines and show that our approach can estimate the mass distribution with around 10% error, while the baselines have errors ranging from 18% to 68%. △ Less

Submitted 2 March, 2023; originally announced March 2023.

arXiv:2302.12883 [pdf, other]

3D Surface Reconstruction in the Wild by Deforming Shape Priors from Synthetic Data

Authors: Nicolai Häni, Jun-Jee Chao, Volkan Isler

Abstract: Reconstructing the underlying 3D surface of an object from a single image is a challenging problem that has received extensive attention from the computer vision community. Many learning-based approaches tackle this problem by learning a 3D shape prior from either ground truth 3D data or multi-view observations. To achieve state-of-the-art results, these methods assume that the objects are specifi… ▽ More Reconstructing the underlying 3D surface of an object from a single image is a challenging problem that has received extensive attention from the computer vision community. Many learning-based approaches tackle this problem by learning a 3D shape prior from either ground truth 3D data or multi-view observations. To achieve state-of-the-art results, these methods assume that the objects are specified with respect to a fixed canonical coordinate frame, where instances of the same category are perfectly aligned. In this work, we present a new method for joint category-specific 3D reconstruction and object pose estimation from a single image. We show that one can leverage shape priors learned on purely synthetic 3D data together with a point cloud pose canonicalization method to achieve high-quality 3D reconstruction in the wild. Given a single depth image at test time, we first transform this partial point cloud into a learned canonical frame. Then, we use a neural deformation field to reconstruct the 3D surface of the object. Finally, we jointly optimize object pose and 3D shape to fit the partial depth observation. Our approach achieves state-of-the-art reconstruction performance across several real-world datasets, even when trained only on synthetic data. We further show that our method generalizes to different input modalities, from dense depth images to sparse and noisy LIDAR scans. △ Less

Submitted 24 February, 2023; originally announced February 2023.

arXiv:2302.09846 [pdf, other]

Neural Optimal Control using Learned System Dynamics

Authors: Selim Engin, Volkan Isler

Abstract: We study the problem of generating control laws for systems with unknown dynamics. Our approach is to represent the controller and the value function with neural networks, and to train them using loss functions adapted from the Hamilton-Jacobi-Bellman (HJB) equations. In the absence of a known dynamics model, our method first learns the state transitions from data collected by interacting with the… ▽ More We study the problem of generating control laws for systems with unknown dynamics. Our approach is to represent the controller and the value function with neural networks, and to train them using loss functions adapted from the Hamilton-Jacobi-Bellman (HJB) equations. In the absence of a known dynamics model, our method first learns the state transitions from data collected by interacting with the system in an offline process. The learned transition function is then integrated to the HJB equations and used to forward simulate the control signals produced by our controller in a feedback loop. In contrast to trajectory optimization methods that optimize the controller for a single initial state, our controller can generate near-optimal control signals for initial states from a large portion of the state space. Compared to recent model-based reinforcement learning algorithms, we show that our method is more sample efficient and trains faster by an order of magnitude. We demonstrate our method in a number of tasks, including the control of a quadrotor with 12 state variables. △ Less

Submitted 20 February, 2023; originally announced February 2023.

arXiv:2212.06393 [pdf]

Predicting Energy Consumption of Ground Robots On Uneven Terrains

Authors: Minghan Wei, Volkan Isler

Abstract: Optimizing energy consumption for robot navigation in fields requires energy-cost maps. However, obtaining such a map is still challenging, especially for large, uneven terrains. Physics-based energy models work for uniform, flat surfaces but do not generalize well to these terrains. Furthermore, slopes make the energy consumption at every location directional and add to the complexity of data col… ▽ More Optimizing energy consumption for robot navigation in fields requires energy-cost maps. However, obtaining such a map is still challenging, especially for large, uneven terrains. Physics-based energy models work for uniform, flat surfaces but do not generalize well to these terrains. Furthermore, slopes make the energy consumption at every location directional and add to the complexity of data collection and energy prediction. In this paper, we address these challenges in a data-driven manner. We consider a function which takes terrain geometry and robot motion direction as input and outputs expected energy consumption. The function is represented as a ResNet-based neural network whose parameters are learned from field-collected data. The prediction accuracy of our method is within 12% of the ground truth in our test environments that are unseen during training. We compare our method to a baseline method in the literature: a method using a basic physics-based model. We demonstrate that our method significantly outperforms it by more than 10% measured by the prediction error. More importantly, our method generalizes better when applied to test data from new environments with various slope angles and navigation directions. △ Less

Submitted 13 December, 2022; originally announced December 2022.

Journal ref: IEEE Robotics and Automation Letters, 2021

arXiv:2209.14419 [pdf, other]

Category-Level Global Camera Pose Estimation with Multi-Hypothesis Point Cloud Correspondences

Authors: Jun-Jee Chao, Selim Engin, Nicolai Häni, Volkan Isler

Abstract: Correspondence search is an essential step in rigid point cloud registration algorithms. Most methods maintain a single correspondence at each step and gradually remove wrong correspondances. However, building one-to-one correspondence with hard assignments is extremely difficult, especially when matching two point clouds with many locally similar features. This paper proposes an optimization meth… ▽ More Correspondence search is an essential step in rigid point cloud registration algorithms. Most methods maintain a single correspondence at each step and gradually remove wrong correspondances. However, building one-to-one correspondence with hard assignments is extremely difficult, especially when matching two point clouds with many locally similar features. This paper proposes an optimization method that retains all possible correspondences for each keypoint when matching a partial point cloud to a complete point cloud. These uncertain correspondences are then gradually updated with the estimated rigid transformation by considering the matching cost. Moreover, we propose a new point feature descriptor that measures the similarity between local point cloud regions. Extensive experiments show that our method outperforms the state-of-the-art (SoTA) methods even when matching different objects within the same category. Notably, our method outperforms the SoTA methods when registering real-world noisy depth images to a template shape by up to 20% performance. △ Less

Submitted 28 September, 2022; originally announced September 2022.

Comments: 8 pages

arXiv:2209.05432 [pdf, other]

Self-supervised Wide Baseline Visual Servoing via 3D Equivariance

Authors: **wook Huh, Jungseok Hong, Suveer Garg, Hyun Soo Park, Volkan Isler

Abstract: One of the challenging input settings for visual servoing is when the initial and goal camera views are far apart. Such settings are difficult because the wide baseline can cause drastic changes in object appearance and cause occlusions. This paper presents a novel self-supervised visual servoing method for wide baseline images which does not require 3D ground truth supervision. Existing approache… ▽ More One of the challenging input settings for visual servoing is when the initial and goal camera views are far apart. Such settings are difficult because the wide baseline can cause drastic changes in object appearance and cause occlusions. This paper presents a novel self-supervised visual servoing method for wide baseline images which does not require 3D ground truth supervision. Existing approaches that regress absolute camera pose with respect to an object require 3D ground truth data of the object in the forms of 3D bounding boxes or meshes. We learn a coherent visual representation by leveraging a geometric property called 3D equivariance-the representation is transformed in a predictable way as a function of 3D transformation. To ensure that the feature-space is faithful to the underlying geodesic space, a geodesic preserving constraint is applied in conjunction with the equivariance. We design a Siamese network that can effectively enforce these two geometric properties without requiring 3D supervision. With the learned model, the relative transformation can be inferred simply by following the gradient in the learned space and used as feedback for closed-loop visual servoing. Our method is evaluated on objects from the YCB dataset, showing meaningful outperformance on a visual servoing task, or object alignment task with respect to state-of-the-art approaches that use 3D supervision. Ours yields more than 35% average distance error reduction and more than 90% success rate with 3cm error tolerance. △ Less

Submitted 12 September, 2022; originally announced September 2022.

Comments: Accepted at the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:2208.11566 [pdf, other]

doi 10.1109/IROS.2018.8594304

Apple Counting using Convolutional Neural Networks

Authors: Nicolai Häni, Pravakar Roy, Volkan Isler

Abstract: Estimating accurate and reliable fruit and vegetable counts from images in real-world settings, such as orchards, is a challenging problem that has received significant recent attention. Estimating fruit counts before harvest provides useful information for logistics planning. While considerable progress has been made toward fruit detection, estimating the actual counts remains challenging. In pra… ▽ More Estimating accurate and reliable fruit and vegetable counts from images in real-world settings, such as orchards, is a challenging problem that has received significant recent attention. Estimating fruit counts before harvest provides useful information for logistics planning. While considerable progress has been made toward fruit detection, estimating the actual counts remains challenging. In practice, fruits are often clustered together. Therefore, methods that only detect fruits fail to offer general solutions to estimate accurate fruit counts. Furthermore, in horticultural studies, rather than a single yield estimate, finer information such as the distribution of the number of apples per cluster is desirable. In this work, we formulate fruit counting from images as a multi-class classification problem and solve it by training a Convolutional Neural Network. We first evaluate the per-image accuracy of our method and compare it with a state-of-the-art method based on Gaussian Mixture Models over four test datasets. Even though the parameters of the Gaussian Mixture Model-based method are specifically tuned for each dataset, our network outperforms it in three out of four datasets with a maximum of 94\% accuracy. Next, we use the method to estimate the yield for two datasets for which we have ground truth. Our method achieved 96-97\% accuracies. For additional details please see our video here: https://www.youtube.com/watch?v=Le0mb5P-SYc}{https://www.youtube.com/watch?v=Le0mb5P-SYc. △ Less

Submitted 24 August, 2022; originally announced August 2022.

Journal ref: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:2208.11538 [pdf, other]

doi 10.1109/IROS.2016.7759456

Visual Servoing in Orchard Settings

Authors: Nicolai Häni, Volkan Isler

Abstract: We present a general framework for accurate positioning of sensors and end effectors in farm settings using a camera mounted on a robotic manipulator. Our main contribution is a visual servoing approach based on a new and robust feature tracking algorithm. Results from field experiments performed at an apple orchard demonstrate that our approach converges to a given termination criterion even unde… ▽ More We present a general framework for accurate positioning of sensors and end effectors in farm settings using a camera mounted on a robotic manipulator. Our main contribution is a visual servoing approach based on a new and robust feature tracking algorithm. Results from field experiments performed at an apple orchard demonstrate that our approach converges to a given termination criterion even under environmental influences such as strong winds, varying illumination conditions and partial occlusion of the target object. Further, we show experimentally that the system converges to the desired view for a wide range of initial conditions. This approach opens possibilities for new applications such as automated fruit inspection, fruit picking or precise pesticide application. △ Less

Submitted 24 August, 2022; originally announced August 2022.

Journal ref: In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 2946-2953)

arXiv:2112.00216 [pdf, other]

PoseKernelLifter: Metric Lifting of 3D Human Pose using Sound

Authors: Zhijian Yang, Xiaoran Fan, Volkan Isler, Hyun Soo Park

Abstract: Reconstructing the 3D pose of a person in metric scale from a single view image is a geometrically ill-posed problem. For example, we can not measure the exact distance of a person to the camera from a single view image without additional scene assumptions (e.g., known height). Existing learning based approaches circumvent this issue by reconstructing the 3D pose up to scale. However, there are ma… ▽ More Reconstructing the 3D pose of a person in metric scale from a single view image is a geometrically ill-posed problem. For example, we can not measure the exact distance of a person to the camera from a single view image without additional scene assumptions (e.g., known height). Existing learning based approaches circumvent this issue by reconstructing the 3D pose up to scale. However, there are many applications such as virtual telepresence, robotics, and augmented reality that require metric scale reconstruction. In this paper, we show that audio signals recorded along with an image, provide complementary information to reconstruct the metric 3D pose of the person. The key insight is that as the audio signals traverse across the 3D space, their interactions with the body provide metric information about the body's pose. Based on this insight, we introduce a time-invariant transfer function called pose kernel -- the impulse response of audio signals induced by the body pose. The main properties of the pose kernel are that (1) its envelope highly correlates with 3D pose, (2) the time response corresponds to arrival time, indicating the metric distance to the microphone, and (3) it is invariant to changes in the scene geometry configurations. Therefore, it is readily generalizable to unseen scenes. We design a multi-stage 3D CNN that fuses audio and visual signals and learns to reconstruct 3D pose in a metric scale. We show that our multi-modal method produces accurate metric reconstruction in real world scenes, which is not possible with state-of-the-art lifting approaches including parametric mesh regression and depth regression. △ Less

Submitted 2 December, 2021; v1 submitted 30 November, 2021; originally announced December 2021.

arXiv:2111.10462 [pdf, other]

Online Coverage Planning for an Autonomous Weed Mowing Robot with Curvature Constraints

Authors: Parikshit Maini, Burak M. Gonultas, Volkan Isler

Abstract: The land used for grazing cattle takes up about one-third of the land in the United States. These areas can be highly rugged. Yet, they need to be maintained to prevent weeds from taking over the nutritious grassland. This can be a daunting task especially in the case of organic farming since herbicides cannot be used. In this paper, we present the design of Cowbot, an autonomous weed mowing robot… ▽ More The land used for grazing cattle takes up about one-third of the land in the United States. These areas can be highly rugged. Yet, they need to be maintained to prevent weeds from taking over the nutritious grassland. This can be a daunting task especially in the case of organic farming since herbicides cannot be used. In this paper, we present the design of Cowbot, an autonomous weed mowing robot for pastures. Cowbot is an electric mower designed to operate in the rugged environments on cow pastures and provide a cost-effective method for weed control in organic farms. Path planning for the Cowbot is challenging since weed distribution on pastures is unknown. Given a limited field of view, online path planning is necessary to detect weeds and plan paths to mow them. We study the general online path planning problem for an autonomous mower with curvature and field of view constraints. We develop two online path planning algorithms that are able to utilize new information about weeds to optimize path length and ensure coverage. We deploy our algorithms on the Cowbot and perform field experiments to validate the suitability of our methods for real-time path planning. We also perform extensive simulation experiments which show that our algorithms result in up to 60 % reduction in path length as compared to baseline boustrophedon and random-search based coverage paths. △ Less

Submitted 19 November, 2021; originally announced November 2021.

arXiv:2109.07134 [pdf, other]

ROW-SLAM: Under-Canopy Cornfield Semantic SLAM

Authors: Jiacheng Yuan, Jungseok Hong, Junaed Sattar, Volkan Isler

Abstract: We study a semantic SLAM problem faced by a robot tasked with autonomous weeding under the corn canopy. The goal is to detect corn stalks and localize them in a global coordinate frame. This is a challenging setup for existing algorithms because there is very little space between the camera and the plants, and the camera motion is primarily restricted to be along the row. To overcome these challen… ▽ More We study a semantic SLAM problem faced by a robot tasked with autonomous weeding under the corn canopy. The goal is to detect corn stalks and localize them in a global coordinate frame. This is a challenging setup for existing algorithms because there is very little space between the camera and the plants, and the camera motion is primarily restricted to be along the row. To overcome these challenges, we present a multi-camera system where a side camera (facing the plants) is used for detection whereas front and back cameras are used for motion estimation. Next, we show how semantic features in the environment (corn stalks, ground, and crop planes) can be used to develop a robust semantic SLAM solution and present results from field trials performed throughout the growing season across various cornfields. △ Less

Submitted 15 September, 2021; originally announced September 2021.

Comments: 7 pages, 6 figures

arXiv:2109.06837 [pdf, other]

Simultaneous Object Reconstruction and Grasp Prediction using a Camera-centric Object Shell Representation

Authors: Nikhil Chavan-Dafle, Sergiy Popovych, Shubham Agrawal, Daniel D. Lee, Volkan Isler

Abstract: Being able to grasp objects is a fundamental component of most robotic manipulation systems. In this paper, we present a new approach to simultaneously reconstruct a mesh and a dense grasp quality map of an object from a depth image. At the core of our approach is a novel camera-centric object representation called the "object shell" which is composed of an observed "entry image" and a predicted "… ▽ More Being able to grasp objects is a fundamental component of most robotic manipulation systems. In this paper, we present a new approach to simultaneously reconstruct a mesh and a dense grasp quality map of an object from a depth image. At the core of our approach is a novel camera-centric object representation called the "object shell" which is composed of an observed "entry image" and a predicted "exit image". We present an image-to-image residual ConvNet architecture in which the object shell and a grasp-quality map are predicted as separate output channels. The main advantage of the shell representation and the corresponding neural network architecture, ShellGrasp-Net, is that the input-output pixel correspondences in the shell representation are explicitly represented in the architecture. We show that this coupling yields superior generalization capabilities for object reconstruction and accurate grasp quality estimation implicitly considering the object geometry. Our approach yields an efficient dense grasp quality map and an object geometry estimate in a single forward pass. Both of these outputs can be used in a wide range of robotic manipulation applications. With rigorous experimental validation, both in simulation and on a real setup, we show that our shell-based method can be used to generate precise grasps and the associated grasp quality with over 90% accuracy. Diverse grasps computed on shell reconstructions allow the robot to select and execute grasps in cluttered scenes with more than 93% success rate. △ Less

Submitted 19 December, 2022; v1 submitted 14 September, 2021; originally announced September 2021.

Comments: 18 pages, 12 figures, 8 tables

Journal ref: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2022)

arXiv:2103.11168 [pdf, other]

Learning Continuous Cost-to-Go Functions for Non-holonomic Systems

Authors: **wook Huh, Daniel D. Lee, Volkan Isler

Abstract: This paper presents a supervised learning method to generate continuous cost-to-go functions of non-holonomic systems directly from the workspace description. Supervision from informative examples reduces training time and improves network performance. The manifold representing the optimal trajectories of a non-holonomic system has high-curvature regions which can not be efficiently captured with… ▽ More This paper presents a supervised learning method to generate continuous cost-to-go functions of non-holonomic systems directly from the workspace description. Supervision from informative examples reduces training time and improves network performance. The manifold representing the optimal trajectories of a non-holonomic system has high-curvature regions which can not be efficiently captured with uniform sampling. To address this challenge, we present an adaptive sampling method which makes use of sampling-based planners along with local, closed-form solutions to generate training samples. The cost-to-go function over a specific workspace is represented as a neural network whose weights are generated by a second, higher order network. The networks are trained in an end-to-end fashion. In our previous work, this architecture was shown to successfully learn to generate the cost-to-go functions of holonomic systems using uniform sampling. In this work, we show that uniform sampling fails for non-holonomic systems. However, with the proposed adaptive sampling methodology, our network can generate near-optimal trajectories for non-holonomic systems while avoiding obstacles. Experiments show that our method is two orders of magnitude faster compared to traditional approaches in cluttered environments. △ Less

Submitted 20 March, 2021; originally announced March 2021.

arXiv:2101.05212 [pdf, other]

Ellipse Regression with Predicted Uncertainties for Accurate Multi-View 3D Object Estimation

Authors: Wenbo Dong, Volkan Isler

Abstract: Convolutional neural network (CNN) based architectures, such as Mask R-CNN, constitute the state of the art in object detection and segmentation. Recently, these methods have been extended for model-based segmentation where the network outputs the parameters of a geometric model (e.g. an ellipse) directly. This work considers objects whose three-dimensional models can be represented as ellipsoids.… ▽ More Convolutional neural network (CNN) based architectures, such as Mask R-CNN, constitute the state of the art in object detection and segmentation. Recently, these methods have been extended for model-based segmentation where the network outputs the parameters of a geometric model (e.g. an ellipse) directly. This work considers objects whose three-dimensional models can be represented as ellipsoids. We present a variant of Mask R-CNN for estimating the parameters of ellipsoidal objects by segmenting each object and accurately regressing the parameters of projection ellipses. We show that model regression is sensitive to the underlying occlusion scenario and that prediction quality for each object needs to be characterized individually for accurate 3D object estimation. We present a novel ellipse regression loss which can learn the offset parameters with their uncertainties and quantify the overall geometric quality of detection for each ellipse. These values, in turn, allow us to fuse multi-view detections to obtain 3D ellipsoid parameters in a principled fashion. The experiments on both synthetic and real datasets quantitatively demonstrate the high accuracy of our proposed method in estimating 3D objects under heavy occlusions compared to previous state-of-the-art methods. △ Less

Submitted 27 December, 2020; originally announced January 2021.

Comments: 9 pages, 9 figures

arXiv:2012.06023 [pdf, other]

Cost-to-Go Function Generating Networks for High Dimensional Motion Planning

Authors: **wook Huh, Volkan Isler, Daniel D. Lee

Abstract: This paper presents c2g-HOF networks which learn to generate cost-to-go functions for manipulator motion planning. The c2g-HOF architecture consists of a cost-to-go function over the configuration space represented as a neural network (c2g-network) as well as a Higher Order Function (HOF) network which outputs the weights of the c2g-network for a given input workspace. Both networks are trained en… ▽ More This paper presents c2g-HOF networks which learn to generate cost-to-go functions for manipulator motion planning. The c2g-HOF architecture consists of a cost-to-go function over the configuration space represented as a neural network (c2g-network) as well as a Higher Order Function (HOF) network which outputs the weights of the c2g-network for a given input workspace. Both networks are trained end-to-end in a supervised fashion using costs computed from traditional motion planners. Once trained, c2g-HOF can generate a smooth and continuous cost-to-go function directly from workspace sensor inputs (represented as a point cloud in 3D or an image in 2D). At inference time, the weights of the c2g-network are computed very efficiently and near-optimal trajectories are generated by simply following the gradient of the cost-to-go function. We compare c2g-HOF with traditional planning algorithms for various robots and planning scenarios. The experimental results indicate that planning with c2g-HOF is significantly faster than other motion planning algorithms, resulting in orders of magnitude improvement when including collision checking. Furthermore, despite being trained from sparsely sampled trajectories in configuration space, c2g-HOF generalizes to generate smoother, and often lower cost, trajectories. We demonstrate cost-to-go based planning on a 7 DoF manipulator arm where motion planning in a complex workspace requires only 0.13 seconds for the entire trajectory. △ Less

Submitted 10 December, 2020; originally announced December 2020.

arXiv:2011.09427 [pdf, other]

Fast Motion Understanding with Spatiotemporal Neural Networks and Dynamic Vision Sensors

Authors: Anthony Bisulco, Fernando Cladera Ojeda, Volkan Isler, Daniel D. Lee

Abstract: This paper presents a Dynamic Vision Sensor (DVS) based system for reasoning about high speed motion. As a representative scenario, we consider the case of a robot at rest reacting to a small, fast approaching object at speeds higher than 15m/s. Since conventional image sensors at typical frame rates observe such an object for only a few frames, estimating the underlying motion presents a consider… ▽ More This paper presents a Dynamic Vision Sensor (DVS) based system for reasoning about high speed motion. As a representative scenario, we consider the case of a robot at rest reacting to a small, fast approaching object at speeds higher than 15m/s. Since conventional image sensors at typical frame rates observe such an object for only a few frames, estimating the underlying motion presents a considerable challenge for standard computer vision systems and algorithms. In this paper we present a method motivated by how animals such as insects solve this problem with their relatively simple vision systems. Our solution takes the event stream from a DVS and first encodes the temporal events with a set of causal exponential filters across multiple time scales. We couple these filters with a Convolutional Neural Network (CNN) to efficiently extract relevant spatiotemporal features. The combined network learns to output both the expected time to collision of the object, as well as the predicted collision point on a discretized polar grid. These critical estimates are computed with minimal delay by the network in order to react appropriately to the incoming object. We highlight the results of our system to a toy dart moving at 23.4m/s with a 24.73° error in $θ$, 18.4mm average discretized radius prediction error, and 25.03% median time to collision prediction error. △ Less

Submitted 18 November, 2020; originally announced November 2020.

Journal ref: International Conference on Robotics and Automation (ICRA) 2021

arXiv:2011.08319 [pdf, other]

Multi-Step Recurrent Q-Learning for Robotic Velcro Peeling

Authors: Jiacheng Yuan, Nicolai Häni, Volkan Isler

Abstract: Learning object manipulation is a critical skill for robots to interact with their environment. Even though there has been significant progress in robotic manipulation of rigid objects, interacting with non-rigid objects remains challenging for robots. In this work, we introduce velcro peeling as a representative application for robotic manipulation of non-rigid objects in complex environments. We… ▽ More Learning object manipulation is a critical skill for robots to interact with their environment. Even though there has been significant progress in robotic manipulation of rigid objects, interacting with non-rigid objects remains challenging for robots. In this work, we introduce velcro peeling as a representative application for robotic manipulation of non-rigid objects in complex environments. We present a method of learning force-based manipulation from noisy and incomplete sensor inputs in partially observable environments by modeling long term dependencies between measurements with a multi-step deep recurrent network. We present experiments on a real robot to show the necessity of modeling these long term dependencies and validate our approach in simulation and robot experiments. Our results show that using tactile input enables the robot to overcome geometric uncertainties present in the environment with high fidelity in ~90% of all cases, outperforming the baselines by a large margin. △ Less

Submitted 22 February, 2022; v1 submitted 16 November, 2020; originally announced November 2020.

arXiv:2010.14597 [pdf, other]

Learning to Generate Cost-to-Go Functions for Efficient Motion Planning

Authors: **wook Huh, Galen Xing, Ziyun Wang, Volkan Isler, Daniel D. Lee

Abstract: Traditional motion planning is computationally burdensome for practical robots, involving extensive collision checking and considerable iterative propagation of cost values. We present a novel neural network architecture which can directly generate the cost-to-go (c2g) function for a given configuration space and a goal configuration. The output of the network is a continuous function whose gradie… ▽ More Traditional motion planning is computationally burdensome for practical robots, involving extensive collision checking and considerable iterative propagation of cost values. We present a novel neural network architecture which can directly generate the cost-to-go (c2g) function for a given configuration space and a goal configuration. The output of the network is a continuous function whose gradient in configuration space can be directly used to generate trajectories in motion planning without the need for protracted iterations or extensive collision checking. This higher order function (i.e. a function generating another function) representation lies at the core of our motion planning architecture, c2g-HOF, which can take a workspace as input, and generate the cost-to-go function over the configuration space map (C-map). Simulation results for 2D and 3D environments show that c2g-HOF can be orders of magnitude faster at execution time than methods which explore the configuration space during execution. We also present an implementation of c2g-HOF which generates trajectories for robot manipulators directly from an overhead image of the workspace. △ Less

Submitted 27 October, 2020; originally announced October 2020.

arXiv:2007.15627 [pdf, other]

Continuous Object Representation Networks: Novel View Synthesis without Target View Supervision

Authors: Nicolai Häni, Selim Engin, Jun-Jee Chao, Volkan Isler

Abstract: Novel View Synthesis (NVS) is concerned with synthesizing views under camera viewpoint transformations from one or multiple input images. NVS requires explicit reasoning about 3D object structure and unseen parts of the scene to synthesize convincing results. As a result, current approaches typically rely on supervised training with either ground truth 3D models or multiple target images. We propo… ▽ More Novel View Synthesis (NVS) is concerned with synthesizing views under camera viewpoint transformations from one or multiple input images. NVS requires explicit reasoning about 3D object structure and unseen parts of the scene to synthesize convincing results. As a result, current approaches typically rely on supervised training with either ground truth 3D models or multiple target images. We propose Continuous Object Representation Networks (CORN), a conditional architecture that encodes an input image's geometry and appearance that map to a 3D consistent scene representation. We can train CORN with only two source images per object by combining our model with a neural renderer. A key feature of CORN is that it requires no ground truth 3D models or target view supervision. Regardless, CORN performs well on challenging tasks such as novel view synthesis and single-view 3D reconstruction and achieves performance comparable to state-of-the-art approaches that use direct supervision. For up-to-date information, data, and code, please see our project page: https://nicolaihaeni.github.io/corn/. △ Less

Submitted 23 October, 2020; v1 submitted 30 July, 2020; originally announced July 2020.

Comments: To appear at Advances in Neural Information Processing Systems 33 (NeurIPS 2020)

arXiv:2006.07981 [pdf, other]

Geodesic-HOF: 3D Reconstruction Without Cutting Corners

Authors: Ziyun Wang, Eric A. Mitchell, Volkan Isler, Daniel D. Lee

Abstract: Single-view 3D object reconstruction is a challenging fundamental problem in computer vision, largely due to the morphological diversity of objects in the natural world. In particular, high curvature regions are not always captured effectively by methods trained using only set-based loss functions, resulting in reconstructions short-circuiting the surface or cutting corners. In particular, high cu… ▽ More Single-view 3D object reconstruction is a challenging fundamental problem in computer vision, largely due to the morphological diversity of objects in the natural world. In particular, high curvature regions are not always captured effectively by methods trained using only set-based loss functions, resulting in reconstructions short-circuiting the surface or cutting corners. In particular, high curvature regions are not always captured effectively by methods trained using only set-based loss functions, resulting in reconstructions short-circuiting the surface or cutting corners. To address this issue, we propose learning an image-conditioned map** function from a canonical sampling domain to a high dimensional space where the Euclidean distance is equal to the geodesic distance on the object. The first three dimensions of a mapped sample correspond to its 3D coordinates. The additional lifted components contain information about the underlying geodesic structure. Our results show that taking advantage of these learned lifted coordinates yields better performance for estimating surface normals and generating surfaces than using point cloud reconstructions alone. Further, we find that this learned geodesic embedding space provides useful information for applications such as unsupervised object decomposition. △ Less

Submitted 14 June, 2020; originally announced June 2020.

arXiv:2004.01689 [pdf, other]

Near-chip Dynamic Vision Filtering for Low-Bandwidth Pedestrian Detection

Authors: Anthony Bisulco, Fernando Cladera Ojeda, Volkan Isler, Daniel D. Lee

Abstract: This paper presents a novel end-to-end system for pedestrian detection using Dynamic Vision Sensors (DVSs). We target applications where multiple sensors transmit data to a local processing unit, which executes a detection algorithm. Our system is composed of (i) a near-chip event filter that compresses and denoises the event stream from the DVS, and (ii) a Binary Neural Network (BNN) detection mo… ▽ More This paper presents a novel end-to-end system for pedestrian detection using Dynamic Vision Sensors (DVSs). We target applications where multiple sensors transmit data to a local processing unit, which executes a detection algorithm. Our system is composed of (i) a near-chip event filter that compresses and denoises the event stream from the DVS, and (ii) a Binary Neural Network (BNN) detection module that runs on a low-computation edge computing device (in our case a STM32F4 microcontroller). We present the system architecture and provide an end-to-end implementation for pedestrian detection in an office environment. Our implementation reduces transmission size by up to 99.6% compared to transmitting the raw event stream. The average packet size in our system is only 1397 bits, while 307.2 kb are required to send an uncompressed DVS time window. Our detector is able to perform a detection every 450 ms, with an overall testing F1 score of 83%. The low bandwidth and energy properties of our system make it ideal for IoT applications. △ Less

Submitted 3 April, 2020; originally announced April 2020.

Comments: 6 pages, 5 figures

arXiv:2003.01649 [pdf, other]

Robotic Gras** through Combined Image-Based Grasp Proposal and 3D Reconstruction

Authors: Daniel Yang, Tarik Tosun, Ben Eisner, Volkan Isler, Daniel Lee

Abstract: We present a novel approach to robotic grasp planning using both a learned grasp proposal network and a learned 3D shape reconstruction network. Our system generates 6-DOF grasps from a single RGB-D image of the target object, which is provided as input to both networks. By using the geometric reconstruction to refine the the candidate grasp produced by the grasp proposal network, our system is ab… ▽ More We present a novel approach to robotic grasp planning using both a learned grasp proposal network and a learned 3D shape reconstruction network. Our system generates 6-DOF grasps from a single RGB-D image of the target object, which is provided as input to both networks. By using the geometric reconstruction to refine the the candidate grasp produced by the grasp proposal network, our system is able to accurately grasp both known and unknown objects, even when the grasp location on the object is not visible in the input image. This paper presents the network architectures, training procedures, and grasp refinement method that comprise our system. Experiments demonstrate the efficacy of our system at gras** both known and unknown objects (91% success rate in a physical robot environment, 84% success rate in a simulated environment). We additionally perform ablation studies that show the benefits of combining a learned grasp proposal with geometric reconstruction for gras**, and also show that our system outperforms several baselines in a gras** task. △ Less

Submitted 6 November, 2020; v1 submitted 3 March, 2020; originally announced March 2020.

Comments: 7 pages, 7 figures

arXiv:2002.09850 [pdf, other]

Active localization of multiple targets using noisy relative measurements

Authors: Selim Engin, Volkan Isler

Abstract: Consider a mobile robot tasked with localizing targets at unknown locations by obtaining relative measurements. The observations can be bearing or range measurements. How should the robot move so as to localize the targets and minimize the uncertainty in their locations as quickly as possible? Most existing approaches are either greedy in nature or rely on accurate initial estimates. We formulat… ▽ More Consider a mobile robot tasked with localizing targets at unknown locations by obtaining relative measurements. The observations can be bearing or range measurements. How should the robot move so as to localize the targets and minimize the uncertainty in their locations as quickly as possible? Most existing approaches are either greedy in nature or rely on accurate initial estimates. We formulate this path planning problem as an unsupervised learning problem where the measurements are aggregated using a Bayesian histogram filter. The robot learns to minimize the total uncertainty of each target in the shortest amount of time using the current measurement and an aggregate representation of the current belief state. We analyze our method in a series of experiments where we show that our method outperforms a standard greedy approach. In addition, its performance is also comparable to an offline algorithm which has access to the true location of the targets. △ Less

Submitted 23 February, 2020; originally announced February 2020.

Comments: 8 pages, 5 figures

arXiv:2001.11584 [pdf, other]

doi 10.1109/TIP.2021.3050673

Ellipse R-CNN: Learning to Infer Elliptical Object from Clustering and Occlusion

Authors: Wenbo Dong, Pravakar Roy, Cheng Peng, Volkan Isler

Abstract: Images of heavily occluded objects in cluttered scenes, such as fruit clusters in trees, are hard to segment. To further retrieve the 3D size and 6D pose of each individual object in such cases, bounding boxes are not reliable from multiple views since only a little portion of the object's geometry is captured. We introduce the first CNN-based ellipse detector, called Ellipse R-CNN, to represent a… ▽ More Images of heavily occluded objects in cluttered scenes, such as fruit clusters in trees, are hard to segment. To further retrieve the 3D size and 6D pose of each individual object in such cases, bounding boxes are not reliable from multiple views since only a little portion of the object's geometry is captured. We introduce the first CNN-based ellipse detector, called Ellipse R-CNN, to represent and infer occluded objects as ellipses. We first propose a robust and compact ellipse regression based on the Mask R-CNN architecture for elliptical object detection. Our method can infer the parameters of multiple elliptical objects even they are occluded by other neighboring objects. For better occlusion handling, we exploit refined feature regions for the regression stage, and integrate the U-Net structure for learning different occlusion patterns to compute the final detection score. The correctness of ellipse regression is validated through experiments performed on synthetic data of clustered ellipses. We further quantitatively and qualitatively demonstrate that our approach outperforms the state-of-the-art model (i.e., Mask R-CNN followed by ellipse fitting) and its three variants on both synthetic and real datasets of occluded and clustered elliptical objects. △ Less

Submitted 14 November, 2020; v1 submitted 30 January, 2020; originally announced January 2020.

Comments: 18 pages, 20 figures, 7 tables

arXiv:1912.08852 [pdf, other]

Surface HOF: Surface Reconstruction from a Single Image Using Higher Order Function Networks

Authors: Ziyun Wang, Volkan Isler, Daniel D. Lee

Abstract: We address the problem of generating a high-resolution surface reconstruction from a single image. Our approach is to learn a Higher Order Function (HOF) which takes an image of an object as input and generates a map** function. The map** function takes samples from a canonical domain (e.g. the unit sphere) and maps each sample to a local tangent plane on the 3D reconstruction of the object. E… ▽ More We address the problem of generating a high-resolution surface reconstruction from a single image. Our approach is to learn a Higher Order Function (HOF) which takes an image of an object as input and generates a map** function. The map** function takes samples from a canonical domain (e.g. the unit sphere) and maps each sample to a local tangent plane on the 3D reconstruction of the object. Each tangent plane is represented as an origin point and a normal vector at that point. By efficiently learning a continuous map** function, the surface can be generated at arbitrary resolution in contrast to other methods which generate fixed resolution outputs. We present the Surface HOF in which both the higher order function and the map** function are represented as neural networks, and train the networks to generate reconstructions of PointNet objects. Experiments show that Surface HOF is more accurate and uses more efficient representations than other state of the art methods for surface reconstruction. Surface HOF is also easier to train: it requires minimal input pre-processing and output post-processing and generates surface representations that are more parameter efficient. Its accuracy and convenience make Surface HOF an appealing method for single image reconstruction. △ Less

Submitted 18 December, 2019; originally announced December 2019.

arXiv:1910.05766 [pdf, other]

QoS and Jamming-Aware Wireless Networking Using Deep Reinforcement Learning

Authors: Nof Abuzainab, Tugba Erpek, Kemal Davaslioglu, Yalin E. Sagduyu, Yi Shi, Sharon J. Mackey, Mitesh Patel, Frank Panettieri, Muhammad A. Qureshi, Volkan Isler, Aylin Yener

Abstract: The problem of quality of service (QoS) and jamming-aware communications is considered in an adversarial wireless network subject to external eavesdrop** and jamming attacks. To ensure robust communication against jamming, an interference-aware routing protocol is developed that allows nodes to avoid communication holes created by jamming attacks. Then, a distributed cooperation framework, based… ▽ More The problem of quality of service (QoS) and jamming-aware communications is considered in an adversarial wireless network subject to external eavesdrop** and jamming attacks. To ensure robust communication against jamming, an interference-aware routing protocol is developed that allows nodes to avoid communication holes created by jamming attacks. Then, a distributed cooperation framework, based on deep reinforcement learning, is proposed that allows nodes to assess network conditions and make deep learning-driven, distributed, and real-time decisions on whether to participate in data communications, defend the network against jamming and eavesdrop** attacks, or jam other transmissions. The objective is to maximize the network performance that incorporates throughput, energy efficiency, delay, and security metrics. Simulation results show that the proposed jamming-aware routing approach is robust against jamming and when throughput is prioritized, the proposed deep reinforcement learning approach can achieve significant (measured as three-fold) increase in throughput, compared to a benchmark policy with fixed roles assigned to nodes. △ Less

Submitted 13 October, 2019; originally announced October 2019.

arXiv:1910.02066 [pdf, other]

Higher Order Function Networks for View Planning and Multi-View Reconstruction

Authors: Selim Engin, Eric Mitchell, Daewon Lee, Volkan Isler, Daniel D. Lee

Abstract: We consider the problem of planning views for a robot to acquire images of an object for visual inspection and reconstruction. In contrast to offline methods which require a 3D model of the object as input or online methods which rely on only local measurements, our method uses a neural network which encodes shape information for a large number of objects. We build on recent deep learning methods… ▽ More We consider the problem of planning views for a robot to acquire images of an object for visual inspection and reconstruction. In contrast to offline methods which require a 3D model of the object as input or online methods which rely on only local measurements, our method uses a neural network which encodes shape information for a large number of objects. We build on recent deep learning methods capable of generating a complete 3D reconstruction of an object from a single image. Specifically, in this work, we extend a recent method which uses Higher Order Functions (HOF) to represent the shape of the object. We present a new generalization of this method to incorporate multiple images as input and establish a connection between visibility and reconstruction quality. This relationship forms the foundation of our view planning method where we compute viewpoints to visually cover the output of the multi-view HOF network with as few images as possible. Experiments indicate that our method provides a good compromise between online and offline methods: Similar to online methods, our method does not require the true object model as input. In terms of number of views, it is much more efficient. In most cases, its performance is comparable to the optimal offline case even on object classes the network has not been trained on. △ Less

Submitted 4 October, 2019; originally announced October 2019.

Comments: 7 pages, 6 figures

arXiv:1909.06441 [pdf, other]

doi 10.1109/LRA.2020.2965061

MinneApple: A Benchmark Dataset for Apple Detection and Segmentation

Authors: Nicolai Häni, Pravakar Roy, Volkan Isler

Abstract: In this work, we present a new dataset to advance the state-of-the-art in fruit detection, segmentation, and counting in orchard environments. While there has been significant recent interest in solving these problems, the lack of a unified dataset has made it difficult to compare results. We hope to enable direct comparisons by providing a large variety of high-resolution images acquired in orcha… ▽ More In this work, we present a new dataset to advance the state-of-the-art in fruit detection, segmentation, and counting in orchard environments. While there has been significant recent interest in solving these problems, the lack of a unified dataset has made it difficult to compare results. We hope to enable direct comparisons by providing a large variety of high-resolution images acquired in orchards, together with human annotations of the fruit on trees. The fruits are labeled using polygonal masks for each object instance to aid in precise object detection, localization, and segmentation. Additionally, we provide data for patch-based counting of clustered fruits. Our dataset contains over 41, 000 annotated object instances in 1000 images. We present a detailed overview of the dataset together with baseline performance analysis for bounding box detection, segmentation, and fruit counting as well as representative results for yield estimation. We make this dataset publicly available and host a CodaLab challenge to encourage comparison of results on a common dataset. To download the data and learn more about MinneApple please see the project website: http://rsn.cs.umn.edu/index.php/MinneApple. Up to date information is available online. △ Less

Submitted 3 January, 2020; v1 submitted 13 September, 2019; originally announced September 2019.

arXiv:1908.00914 [pdf, other]

doi 10.1109/ICRA.2019.8794031

Asynchronous Network Formation in Unknown Unbounded Environments

Authors: Selim Engin, Volkan Isler

Abstract: In this paper, we study the Online Network Formation Problem (ONFP) for a mobile multi-robot system. Consider a group of robots with a bounded communication range operating in a large open area. One of the robots has a piece of information which has to be propagated to all other robots. What strategy should the robots pursue to disseminate the information to the rest of the robots as quickly as po… ▽ More In this paper, we study the Online Network Formation Problem (ONFP) for a mobile multi-robot system. Consider a group of robots with a bounded communication range operating in a large open area. One of the robots has a piece of information which has to be propagated to all other robots. What strategy should the robots pursue to disseminate the information to the rest of the robots as quickly as possible? The initial locations of the robots are unknown to each other, therefore the problem must be solved in an online fashion. For this problem, we present an algorithm whose competitive ratio is $O(H \cdot \max\{M,\sqrt{M H}\})$ for arbitrary robot deployments, where $M$ is the largest edge length in the Euclidean minimum spanning tree on the initial robot configuration and $H$ is the height of the tree. We also study the case when the robot initial positions are chosen uniformly at random and improve the ratio to $O(M)$. Finally, we present simulation results to validate the performance in larger scales and demonstrate our algorithm using three robots in a field experiment. △ Less

Submitted 2 August, 2019; originally announced August 2019.

arXiv:1907.10388 [pdf, other]

Higher-Order Function Networks for Learning Composable 3D Object Representations

Authors: Eric Mitchell, Selim Engin, Volkan Isler, Daniel D Lee

Abstract: We present a new approach to 3D object representation where a neural network encodes the geometry of an object directly into the weights and biases of a second 'map**' network. This map** network can be used to reconstruct an object by applying its encoded transformation to points randomly sampled from a simple geometric space, such as the unit sphere. We study the effectiveness of our method… ▽ More We present a new approach to 3D object representation where a neural network encodes the geometry of an object directly into the weights and biases of a second 'map**' network. This map** network can be used to reconstruct an object by applying its encoded transformation to points randomly sampled from a simple geometric space, such as the unit sphere. We study the effectiveness of our method through various experiments on subsets of the ShapeNet dataset. We find that the proposed approach can reconstruct encoded objects with accuracy equal to or exceeding state-of-the-art methods with orders of magnitude fewer parameters. Our smallest map** network has only about 7000 parameters and shows reconstruction quality on par with state-of-the-art object decoder architectures with millions of parameters. Further experiments on feature mixing through the composition of learned functions show that the encoding captures a meaningful subspace of objects. △ Less

Submitted 6 April, 2020; v1 submitted 24 July, 2019; originally announced July 2019.

Comments: To be published in International Conference on Learning Representations (ICLR 2020) [https://openreview.net/forum?id=HJgfDREKDB]; 19 pages

arXiv:1907.06337 [pdf, other]

Energy-efficient Path Planning for Ground Robots by Combining Air and Ground Measurements

Authors: Minghan Wei, Volkan Isler

Abstract: As mobile robots find increasing use in outdoor applications, designing energy-efficient robot navigation algorithms is gaining importance. There are two primary approaches to energy efficient navigation: Offline approaches rely on a previously built energy map as input to a path planner. Obtaining energy maps for large environments is challenging. Alternatively, the robot can navigate in an onlin… ▽ More As mobile robots find increasing use in outdoor applications, designing energy-efficient robot navigation algorithms is gaining importance. There are two primary approaches to energy efficient navigation: Offline approaches rely on a previously built energy map as input to a path planner. Obtaining energy maps for large environments is challenging. Alternatively, the robot can navigate in an online fashion and build the map as it navigates. Online navigation in unknown environments with only local information is still a challenging research problem. In this paper, we present a novel approach which addresses both of these challenges. Our approach starts with a segmented aerial image of the environment. We show that a coarse energy map can be built from the segmentation. However, the absolute energy value for a specific terrain type (e.g. grass) can vary across environments. Therefore, rather than using this energy map directly, we use it to build the covariance function for a Gaussian Process (GP) based representation of the environment. In the online phase, energy measurements collected during navigation are used for estimating energy profiles across the environment using GP regression. Coupled with an A*-like navigation algorithm, we show in simulations that our approach outperforms representative baseline approaches. We also present results from field experiments which demonstrate the practical applicability of our method. △ Less

Submitted 15 July, 2019; originally announced July 2019.

arXiv:1904.03260 [pdf, other]

Pixels to Plans: Learning Non-Prehensile Manipulation by Imitating a Planner

Authors: Tarik Tosun, Eric Mitchell, Ben Eisner, **wook Huh, Bhoram Lee, Daewon Lee, Volkan Isler, H. Sebastian Seung, Daniel Lee

Abstract: We present a novel method enabling robots to quickly learn to manipulate objects by leveraging a motion planner to generate "expert" training trajectories from a small amount of human-labeled data. In contrast to the traditional sense-plan-act cycle, we propose a deep learning architecture and training regimen called PtPNet that can estimate effective end-effector trajectories for manipulation dir… ▽ More We present a novel method enabling robots to quickly learn to manipulate objects by leveraging a motion planner to generate "expert" training trajectories from a small amount of human-labeled data. In contrast to the traditional sense-plan-act cycle, we propose a deep learning architecture and training regimen called PtPNet that can estimate effective end-effector trajectories for manipulation directly from a single RGB-D image of an object. Additionally, we present a data collection and augmentation pipeline that enables the automatic generation of large numbers (millions) of training image and trajectory examples with almost no human labeling effort. We demonstrate our approach in a non-prehensile tool-based manipulation task, specifically picking up shoes with a hook. In hardware experiments, PtPNet generates motion plans (open-loop trajectories) that reliably (89% success over 189 trials) pick up four very different shoes from a range of positions and orientations, and reliably picks up a shoe it has never seen before. Compared with a traditional sense-plan-act paradigm, our system has the advantages of operating on sparse information (single RGB-D frame), producing high-quality trajectories much faster than the "expert" planner (300ms versus several seconds), and generalizing effectively to previously unseen shoes. △ Less

Submitted 5 April, 2019; originally announced April 2019.

Comments: 8 pages

arXiv:1904.02203 [pdf, other]

Semantics-Aware Image to Image Translation and Domain Transfer

Authors: Pravakar Roy, Nicolai Häni, Jun-Jee Chao, Volkan Isler

Abstract: Image to image translation is the problem of transferring an image from a source domain to a different (but related) target domain. We present a new unsupervised image to image translation technique that leverages the underlying semantic information for object transfiguration and domain transfer tasks. Specifically, we present a generative adversarial learning approach that jointly translates imag… ▽ More Image to image translation is the problem of transferring an image from a source domain to a different (but related) target domain. We present a new unsupervised image to image translation technique that leverages the underlying semantic information for object transfiguration and domain transfer tasks. Specifically, we present a generative adversarial learning approach that jointly translates images and labels from a source domain to a target domain. Our main technical contribution is an encoder-decoder based network architecture that jointly encodes the image and its underlying semantics and translates both individually to the target domain. Additionally, we propose object transfiguration and cross-domain semantic consistency losses that preserve semantic labels. Through extensive experimental evaluation, we demonstrate the effectiveness of our approach as compared to the state-of-the-art methods on unsupervised image-to-image translation, domain adaptation, and object transfiguration. △ Less

Submitted 1 March, 2021; v1 submitted 3 April, 2019; originally announced April 2019.

Showing 1–50 of 62 results for author: Isler, V