iPolicy: Incremental Policy Algorithms for Feedback Motion Planning
Abstract
This paper presents policy-based motion planning for robotic systems. The motion planning literature has been mostly focused on open-loop trajectory planning which is followed by tracking online. In contrast, we solve the problem of path planning and controller synthesis simultaneously by solving the related feedback control problem. We present a novel incremental policy (iPolicy) algorithm for motion planning, which integrates sampling-based methods and set-valued optimal control methods to compute feedback controllers for the robotic system. In particular, we use sampling to incrementally construct the state space of the system. Asynchronous value iterations are performed on the sampled state space to synthesize the incremental policy feedback controller. We show the convergence of the estimates to the optimal value function in continuous state space. Numerical results with various different dynamical systems (including nonholonomic systems) verify the optimality and effectiveness of iPolicy.
I Introduction
Informally speaking, given a robot with a description of its dynamics, a description of its environment and a set of goal states, the motion planning problem is to find a sequence of control inputs so as to guide the robot from the initial state to one of the goal states while avoiding collision in the cluttered environment. It is well-known that robotic motion planning is at least as difficult as the generalized piano mover’s problem, which has been proven to be PSPACE-hard [1]. Many planning algorithms have been proposed. The discrete approaches, such as Dijkstra’s Algorithm and A* [2], usually depend on the structured environment and apply graph search to find the shortest path. The optimality of their returned trajectories are formally guaranteed for discrete problems but their heavy dependency on structured environment makes them suffer from inaccuracies and infeasibility of the solution when dealing with continuous systems [3]. Recently, sampling-based geometric planning algorithms such as the rapidly-exploring random trees (RRT) [4] and its optimal variant RRT* [5] are arguably the most influential and widely-used motion planning algorithms since the last two decades. They are shown to compute quickly in high-dimensional continuous space and possess theoretical guarantees such as probabilistic completeness and optimality.
Robots, in most of the practical applications, have stringent differential constraints on their motion which need to be properly incorporated during motion planning. Motion planning for dynamical systems has a rich history and much work has been done on this topic [6, 7, 2, 8]. However, the problem of motion planning for nonholonomic systems is still open in many aspects [9]. Sampling-based algorithms have received a lot of attention for their efficiency in handling obstacle-cluttered environments. There are two main bodies of research to make these methods more suitable for dealing with differential constraints: one develops steering functions for dynamical systems using concepts from feedback control theory (see e.g., [10, 11]); the other direction of work improves the computational effectiveness of the algorithms using geometric planning (see e.g., [9, 12, 13]). Despite the tremendous body of work on this topic, most of the work focuses on planning open-loop trajectories.
In practical implementation, the planned open-loop trajectories are tracked by a feedback controller. [14]. The presence of differential, state and input constraints, however, makes the design of the feedback tracking controller for highly nonlinear and complex robots still a difficult problem. Furthermore, the performance guarantees of the closed-loop trajectories could be lost in the presence of these constraints, especially in terms of minimizing the aggregated cost. It necessitates feedback motion planning which explicitly takes into account the feedback tracking during the planning process. The feedback planner computes the map** from state space to control space subject to constraints and equivalently searches for the solution for every initial condition of the robot; this facilitates the completeness and performance guarantee of the planner. However, the feedback motion planning problem is challenging in various aspects and most of the computational issues involving feedback planning are still unexplored and open. Interested readers are referred to [3] for a comprehensive survey of the challenges in feedback motion planning. Most of the previous approaches for feedback motion planning struggle with computational tractability in continuous state-space or local minima issues [2]. In this paper, we present a method for feedback motion planning of dynamical systems which makes use of sampling-based methods and value iterations to recover the optimal feedback controller. As a synthesis of the computational advantages of sampling-based algorithms and the approximation consistency of set-valued algorithms [15], the proposed iPolicy algorithm is an incremental and anytime feedback motion planner with formal guarantee of asymptotic optimality.
Contributions. This paper presents a sampling-based algorithm for feedback motion planning for a class of nonlinear robotic systems. In this work, we leverage the existing set-valued analysis tool for motion planning and asynchronous value iterations to synthesize feedback motion planners in an incremental fashion. In particular, the continual sampling creates an approximation of the original minimal time problem in every iteration and the policy on the discretized space provides an incrementally refined estimate of the value function. Using value iterations in an asynchronous and incremental manner limits the computational requirements while retaining optimality conditions. We show the convergence of the estimated value functions to the optimal value function for the motion planning problem in the continuous state space. For clarification of presentation, the various salient features of the algorithms are demonstrated with simulations using point mass system. Some numerical simulations are then provided for feedback motion planning of simple car and Dubins car (these are well studied nonholonomic systems [2, 8]) using the proposed motion planning algorithm. Several simulations are provided to show the value functions calculated by the proposed algorithm and trajectories under the guidance of the computed controller in cluttered environments. Through simulation study of various different dynamical systems, we show the applicability of the proposed algorithms across a wide range of robotic systems (two different classic nonholonomic systems).
Organization. This paper is organized in eight sections including the current one. We present work related to our proposed problem in Section II. Commonly sued notations and notions are clarified in Section III. In Section IV, we present a formal statement of the problem solved in this paper. The main algorithms of the paper are presented in Section V followed by analysis of the same in Section VI. Numerical results for the proposed algorithm are presented in Section VII with related interpretation and discussion. The paper is finally concluded with a summary and future work in Section VIII.
II Related Work
Motion planning is central to robotics. Consequently, it has received a lot of attention in robotics and controls community. Broadly speaking, there are mostly three kinds of approaches for motion planning – optimization-based methods, gradient-based methods and sampling-based methods.
Some of the most popular optimization-based approaches could be found in [16, 17, 18, 19, 20]. The main idea of these approaches is to formulate a dynamic optimization problem with smooth formulation of constraints (like collision, etc.). The resulting optimization problem can then be solved to generate an optimal trajectory. However, the main shortcoming of these optimization-based techniques is that they struggle to find solution in obstacle cluttered, non-convex spaces. Trajectory optimization techniques using optimal control literature can generate control trajectories in presence of state and input constraints for nonlinear dynamical systems – however, they may not be able to find feasible trajectories in the presence of arbitrary obstacles [21, 22]. These methods are mostly used to compute trajectories under dynamical constraints which can be followed online using a trajectory tracking controller using desirable constraints for execution [23, 24, 25].
Gradient-based approaches, such as potential fields [26, 27] and navigation functions [28], consider the composition of a repulsive component for collision avoidance and an attractive component for reaching goal, and derive the control law based on the gradient of the synthesized field. The computations of both fields usually only depend on local information and this brings computational efficiency to gradient-based approaches; however, the cancellation of different components in the composition can also trap the robot at local minima [2]. Thus, these approaches tend to struggle when a robot has to operate in obstacle cluttered environment.
Sampling-based approaches are shown to be able to successfully address irregularly shaped environments and arguably the most widely used methods for motion planning in robotics [2, 29]. However, most of these methods are used to compute open-loop plans for the robot to follow. It is, in general, desirable that robots operate in a feedback fashion using state estimates during execution of a task. Consequently, there has been some work on using sampling-based algorithms to develop feedback planners by using different metrics to select the best path. Some methods to solve for feedback planners are presented in [30, 31, 10, 32, 33]. However, very little work has been done on using these algorithms to incrementally build the state-space of systems to solve for continuous time and space optimal control problems for the dynamical systems. So, even though the motion planning problem has been thoroughly studied in literature, there seems to be lack of techniques and algorithms which can compute feedback motion planners for dynamical systems. We make an attempt to address this problem in this paper.
Our approach is closely related to the value iterations-based approach in [34]; however, we use asynchronous value iterations, provide results for nonholonomic systems and also provide rigorous numerical simulation results. Our results on optimal performance do not require the explicit description of steering functions between any two states of the system in the collision-free space. As such, we expect our results to cater to a rich set of robotic motion planning problems. Our approach is also related to [35] but differs in the way to solve the approximate motion planning problem, where [35] spends heavy computations searching for the optimal solution over the entire graph before refinement while our approach swiftly solves for a small subset of the graph and proceeds to finer graphs. This brings the continuously increasing optimality to our approach and it incrementally improves solutions once more computational resources are given. The problem and algorithms presented in this paper are different from some other sampling-based feedback planning like [36], and the proposed algorithms have the advantage of being incremental (instead of batch). Furthermore, optimal performance guarantees for the feedback control problem have been provided.
III Notations and notions
Let be the -norm in . Define the distance from a point to a set by . Given a compact set and a function , denote the supremum norm over by . Let the Lebesgue measure of the set be . Denote the unit ball in by and the volume of the unit ball by . When no ambiguity is caused, the dimensionality in the subscript may be omitted. Define the Minkowski sum in for and by . When is a singleton, with slight abuse of notations, the Minkowski sum can be written as . Define the value assignment operator by where the value of is assigned to .
IV Problem Formulation
Consider a robot associated with a time-invariant dynamic system governed by the following differential equation:
(1) |
where is the state and is the control of robot. The following mild assumptions are imposed throughout the paper and they are inherited from [15].
Assumption IV.1.
The following assumptions hold for (1):
-
(A1)
and are non-empty and compact;
-
(A2)
is continuous in and Lipschitz continuous in for any with a Lipschitz constant ;
-
(A3)
is linear growth; i.e., s.t. ;
-
(A4)
is a convex set for any .
Let and be the obstacle region and the goal region, respectively. Define the obstacle free region as . Denote the trajectory of system (1) as given the initial state and a state feedback controller . Let be the first time when the trajectory hits while staying in before ; i.e.,
The problem of interest in this paper is to find the control policy for system (1) that incurs the minimal traveling time for every ; that is,
and the minimizer is referred to as the optimal policy.
Notice that the minimal traveling time function is not necessarily bounded, as may not be reachable for every , and would be positive infinity for such states. The positive infinity can cause difficulties in both computational expressions and mathematical analysis, and we address this issue by leveraging the Kruzhkov transform , defined as
and we let the transformed value function be denoted by inheriting all subscripts and superscripts of . Notice that the Kruzhkov transform is bijective and monotonically increasing. The positive infity is transformed to after the Kruzhkov transform while remains the same. The objective of this paper is equivalently to find the transformed minimal traveling time function and the optimal policy .
V The Incremental Policy (iPolicy) Algorithm
In this section, we present the incremental policy algorithm iPolicy. We leverage sampling-based methods in [5, 12, 37] and set-valued methods in [15, 35] along with asynchronous value iterations to incrementally approximate the minimal traveling time function. In particular, iPolicy consists of two components: graph expansion and minimal traveling time estimation. In graph expansion, iPolicy continually samples the free region and constructs a graph using the set-valued method to approximate the continuous time dynamic of system (1); in minimal traveling time estimation, asynchronous value iterations are executed over the approximate graph in a back propogation fashion, where only stale values are updated to save computations, to obtain an estimate of the minimal traveling time function. As more samples are added, iPolicy iteratively expands graph and estimates the minimal traveling time function until a certain limit (e.g., time limit, approximation error threshold) is reached. Denote the set of sampled states from after iterations of iPolicy by . With slight abuse of notations, denote the estimate of minimal traveling time, also known as the value function, after iterations by . The staleness function is defined as the number of iterations since the last update of a sample . We proceed to explain iPolicy in the rest of this section.
V-A Algorithm statement
We initialize iPolicy with a set of states , which we assume it intersects with the goal set to ensure the availability of boundary condition in value iteration execution. Each sampled state is initialized with its value set by and the its staleness by a staleness threshold , meaning their values shall be updated at the beginning of the main loop of iPolicy. See lines 1-1 in Algorithm 1.
Then iPolicy enters the main loop in line 1. Similar to exploration, the main loop samples and add newly sampled state to as line 1. Then the value and staleness of are initialized in the same way as those in the initialization stage, while for sampled states from the last grid , we retain their staleness and values as and . Since the new sample may incur paths with lower cost, value iterations shall be executed on as line 1 in Algorithm 1 and Algorithm 2 to obtain a better estimate of the minimal traveling time . The conventional value iteration is executed in a synchronous fashion, where the value at every sampled state on is updated in every iteration, and its computations grow quickly as more and more samples are added to . We employ the asynchronous value iteration to mitigate the computational complexity. Particularly, the value at is updated when two circumstances occur: first, a sampled state is newly added to , since its initial value may not be accurate; second, the value of a sampled state is too stale to reflect the true estimate of the minimal traveling time. Recall the staleness function represents the number of iterations since the last update of a sample . Let be the staleness function before the update at iteration . Then and for , we have . The staleness threshold is the maximum number of value iterations a sampled state can skip. When , the asynchronous update is identical to the synchronous update, while when is positive infinity, each sample will only be updated once. In Algorithm 2 ValueItertion, the set of stale samples are updated using Algorithm 3 BackProp that recursively searches for the minimum values from one’s neighbors. After updates, the staleness for the samples in will be cleared as line 2 in Algorithm 2 as they have been freshly updated while staleness of other samples would be increased by as line 2.
Algorithm 3 BackProp leverages the Bellman equation to recursively update values and partially solve for the estimate of minimal traveling time on . For simplicity, we drop the dependency on in the psuedo codes of BackProp. The vanilla update procedure of value iteration is summarized as the Bellman operator :
(2) |
where
(3) | ||||
is the set of one-hop neighbors of using Euler discretiztion and is the perturbation radius inherited from [15]. See Figure 1 for the visualization of the discretization, where the robot transits from to after applying a constant control for a fixed time , and we treat all sampled states (orange dots) in the circle of radius centered at as the one-hop neighbors of , as is not necessarily in . With the Kruzhkov transform, one may rewrite the update procedure (2) and define a tranformed Bellman operator as:
where is the transformed running cost, is the total discount factor in the transformed form and is the transformed Bellman operator. For each , BackProp computes in a recursive manner. See Figure 1 again for the visualization of BackProp. Algorithm 3 BackProp first checks whether the goal region is reached or the recursion allowance runs out as line 3, which are the boundary condition of Bellman equation and the limit of computations for each iteration, respectively. If neither is satisfied, BackProp will check and compute for each neighbor by calling itself as line 3. This will temporarily halt the exeuction of current BackProp (recursion allowance ) with all values saved, while processor will create a duplicate of BackProp with a reduced recursion allowance and continues executing the new BackProp When the new BackProp finishes according to the boundary conditions in line 3, the processor will exit it and resume the execution of the old BackProp with recursion allowance with its previously saved values and the returned value at a one-hop neighbor from the new BackProp. With all values at one-hop neighbors being ready, BackProp computes using the Bellman operator as line 3. In other words, a recursion with maximum allowance starting from updates the value at by propogating values at neighbors of within hops. This limits the search to a subset of to save computations. When , BackProp updates all sampled states on in the depth-first search manner.
The consistency of iPolicy follows the discretization scheme in [15], where convergence rate conditions are imposed for the temporal and spatial resolutions. Specifically, let be a conservative estimate of the spatial dispersion, be the associated temporal resolution and be the perturbation radius. Then is a distance within which every point can find a sampled state in and would be the minimal time elapse system (1) can transit. Since system (1) may not fall on a sample in if it takes a constant control for time units, the discrepancy of approximation of the dynamic system is accommodated by the perturbation so that the one-hop neighbors of state with pertubation would surely fall on ; i.e., . The choice of and is not unique but is subject to the following assumption for the purpose of consistent approximation.
Assumption V.1.
The selection of and satsifies the following relations:
-
(A5)
, where ;
-
(A6)
and monotonically decreases to as ;
-
(A7)
;
-
(A8)
.
Assumption (A5) follows from [5] and implies the lower bound of the spatial resolution that the union of balls centered at every sample with radius covers the whole region . Assumption (A6) characterizes the convergence rate for temporal and spatial resolutions. Notice that the perturbation radius is dimishing faster than ; this ensures the discrete time transition always falls on the graph for sufficiently small temporal-spatial resolutions. In Assumption (A6), the faster convergence of compared to indicates the discrete time transition on would be close to the discrete time transition on for sufficiently small ; and finalizes the consistency of approximation. Assumptions (A7) and (A8) are purely for the validity of computations and analysis of iPolicy. Assumption (A7) trivially holds for sufficiently large by Assumption (A6) while assumption (A8) holds when takes the lower bound in Assumption (A5).
V-B Performance guarantee
The consistency of the estimates requires proper selection of temporal resolution and sufficient value iterations executed on . The following assumption is adapted from Assumption IV.1 in [35] and it imposes the minimum required value iterations executed on each graph with convergence guarantee.
Assumption V.2.
There exists a graph index , an upper bound of intervals and an upper bound of accumulated discount such that , the total numbers of value iterations on each graph satisfy
where the graph index is denoted by .
In particular, Assumption V.2 partitions the graph indices into intervals of equal length , and the aggregation of the worst-case discount on each interval over consecutive intervals should be strictly lower than . This implies the distance between the estimated value function and the fixed point is strictly decreasing in a window of graphs.
The main result is summarized below, where the last estimate of minimal traveling time converges to the ground truth pointwise with probability one.
VI Analysis
In this section, we analyze the performance of iPolicy. Towards the proof of Theorem V.1, we proceed with the following result with a supplementary definition.
Theorem VI.1.
Remark VI.1.
Theorem VI.1 extends Theorem V.1 to the whole state space by manually setting as for every , indicating there is no feasible path connecting the goal set . This practice consolidates the obstacle regions where is only defined with the free regions and simplifies analysis by reducing the total number of cases considered in the analysis.
The following lemma shows the supplementary definition in Theorem VI.1 is not involved in any part in the execution of the proposed algorithms.
Lemma VI.1.
For any and , it holds that and , and .
Proof.
Lemma VI.1 shows that the supplementary definition of and over will not affect computations of and over the free region, and Theorem VI.1 is identical to Theorem V.1 over . Throughout the analysis in this section, we assume all conditions in Theorem VI.1 hold.
VI-A Preliminaries
In this subsection, we introduce a couple of preliminary results that facilicate the proof of our main theorem. The following lemma is borrowed from Lemma VII.4 in [35] and is listed below for the completeness of the paper.
Lemma VI.2 (Lemma VII.4 in [35]).
Given , consider and s.t. , where . Then for any set-valued map s.t. , we have
(4) | ||||
If and , then
(5) |
The following lemma shows that the maximum over a finite horizon of a convergent sequence is also convergent.
Lemma VI.3.
Given a sequence s.t. , consider a new sequence s.t. , where is finite set of integers. Then .
Proof.
Since , then , s.t. , . Then it holds that . This implies . ∎
The following lemma from Lemma 7.2 in [38] characterizes the upper bound of the spatial resolution .
Lemma VI.4.
Consider an estimate of spatial dispersion . Then it holds that
Remark VI.2.
The following lemma shows the whole graph will be periodically updated within an interval of length .
Lemma VI.5.
It holds that .
Proof.
Fix and three cases arise:
-
•
Case 1: . Then the lemma trivially holds;
-
•
Case 2: . This implies . Then the staleness of increases to at iteration . With that being said, and this completes the proof in Case 2;
-
•
Case 3: . Then is a newly added node at some iteration and . This completes the proof in Case 3.
Since the above holds , proof is completed. ∎
The following lemma characterizes the values of nodes within goal regions.
Lemma VI.6.
Given and any , it holds that .
Proof.
Following the definition of the Bellman operator and the initial values of , one sees that . The same holds for . This completes the proof. ∎
VI-B Contraction property on a fixed graph
In this subsection, we analyze the monotonicity of the Bellman operator on a fixed graph, which is summarized in the following lemma.
Lemma VI.7.
The distance between a value function and the fixed point is monotonically decreasing over the Bellman operator ; i.e.,
(6) |
Specially, strict monotonic decrease holds ; that is,
(7) |
Moreover, the perturbed version of the above special case holds with probablity one in the asymptotic sense:
(8) | ||||
where is a set of sufficiently small measure.
Proof.
We first proceed to show the proofs of (7) and (8). Once (7) and (8) are proven, we extend the results to (6).
Claim VI.1.
Inequality (7) holds for every .
Proof.
Claim VI.2.
Inequality (8) holds with probability one in the asymptotic sense.
Proof.
Following the argument towards Claim VI.1, Case 1 trivially holds for Claim VI.2. We rewrite (9) for and arrive at
(10) | ||||
It follows from the supplementary definition in Theorem VI.1 that . Then it follow from the definition of in (3) that
Same holds for . Then by Lemma VI.4 and (4) in Lemma VI.2, (10) renders with probability one at
(11) | ||||
Since , we apply (4) again and (11) renders at
(12) | ||||
Since is a set of sufficiently small measure, a new node falls into with sufficiently small probability; i.e., . With that being said, it follows again from (4) and (12) that the following holds with probability one:
Then it follows from (5), the above inequality renders at
Combining all above relations completes the proof. ∎
Claim VI.3.
Inequality (6) holds true.
Proof.
VI-C Asynchronous contraction over graphs
Following Algorithm 3, we recursively define the set of updated nodes by , and . Correspondingly, define the value function by and , and . Starting from this subsection, we consider two perturbed estimates of the minimal traveling time function and . Notice that the perturbation radii and below the minimization operator are different. Similar notations apply to and respectively for . The following lemma characterizes the contraction property between two consecutive graphs.
Lemma VI.8.
Consider . It holds with probability one:
where and is sufficiently small.
Proof.
Fix . By line 2 in Algorithm 2, we apply (7) in Lemma VI.7 for times and it renders at
(13) |
Then it again follows from (6), (8) in Lemma VI.7 and Assumption (A8) that
(14) | ||||
The last inequality follows from (4) in Lemma VI.2. Let and two cases arise. We proceed to show both cases render at the same result, summarized in (20).
Case 1, . It follows from Lemma VI.6 that ; that is, . Then we have
(15) | ||||
We focus on the first term. For , since no changes are made to values at , we have
(16) | ||||
Combining (15) and (16) renders at
(17) | ||||
Case 2, . Then by the triangular inequality of supremum norm, one has a result similar to (15):
(18) |
We focus on the first term . For any , it renders at the identical result as (16). For any , since , it follows line 1 in Algorithm 1 that . Then we have
(19) |
Combining (18) and (19) renders at
(20) |
which is identical to (17). This verifies that the two cases on render at the same result.
The following corollary characterizes the upper bound of the approximation errors on using the error on .
Corollary VI.1.
The following inequality holds for any with probability one:
where and is sufficiently small.
Proof.
Fix . When , it follows from Lemma VI.8 that . When , the worst case is that no updates are imposed on ; in such case, we have . Therefore, . This completes the proof. ∎
The following corollary characterizes the contraction property over consecutive graphs.
Corollary VI.2.
Consider for and . The following holds with probability one:
where and is sufficiently small.
VI-D Proof of Theorem VI.1
In this subsection, we present the proof of Theorem VI.1.
Proof of Theorem VI.1.
Let and consider an interval . We start with . It follows from the triangular inequality of the maximum norm that
(22) | ||||
where and and the last inequality follows from (5) in Lemma VI.2.
We fix and let be the last time when value at is updated. Notice that is a function of , but for the conciseness of the proof, we omit the depnedency on in notation. It follows from Lemma VI.5 that and . Then the following holds:
(23) | ||||
where the inequality is a direct result of the triangular inequality of the maximum norm.
We first analyze the first term in (23). Since , it follows from Lemma VI.8 that
Then we apply Corollary VI.2 and it arrives at
(24) | ||||
where is a function of , as inside the definition is a function of .
We then analyze the second term. It follows from the triangular inequality that the second term renders at ,
(25) |
where and otherwise. Combining (24) and (25), we have
(26) | ||||
Since (26) holds for every , taking the maximum over renders at
(27) | ||||
where . Now we would like to show the maximum term in (27) converges to zero by Lemma VI.3.
Claim VI.4.
The following holds true:
Proof.
We first show converges to zero for every . For notational simplicity, we use instead of in this proof. Fix . Since , we have , a finite sum of . It follows from Corollary .2 that
Then we have
Since , this completes the proof of .
For brevity, we rewrite (27) as
(28) |
where and . Notice that as a result of Claim VI.4. By Assumption V.2, we continue to relax over intervals, each of which consists of graphs, and it holds for any that
(29) | ||||
where is a finite sum of . Since , converges to as . It follows from Lemma VII.5 in [35] that . It follows from Corollary .2 that . Together with (22), we have
This completes the proof. ∎
VII Numerical Results and Discussion
In this section, we first present a computation-saving adaptation of the proposed iPolicy algorithm for a class of simpler but widely used dynamic systems. Then we present results of numerical experiments for three differential systems - point mass, simple car and Dubins car. Using the point mass, we show the correctness of the theoretical results. We show several simulations for unicycle robots in obstacle-free as well as cluttered environment to perform complex maneuvers like automatic parking. The experiments presented in this section are done to investigate the following characteristics of the proposed iPolicy algorithm.
-
1.
Can iPolicy recover the optimal value function for systems with differential constraints?
-
2.
Can the state-based feedback controllers handle complex motion maneuvers in the presence of obstacles?
-
3.
Can we demonstrate the anytime and incremental nature of the algorithm? Is the algorithm able to monotonically improve the plan towards the goal?
In the rest of this section, we will try to answer the above questions with systems of increasing complexity.
VII-A Computation-saving query
In this subsection, we introduce a class of simpler but widely used dynamical systems in the robotics community and an adaptation for iPolicy that significantly reduces computations in practice but maintains the optimality in Theorem V.1. Consider a class of systems that, in addition to the satisfaction of Assumption IV.1, also satisfy the following assumption:
Assumption VII.1.
System (1) satisfies that for any s.t. .
In other words, Assumption VII.1 indicates the robot can instantly stop at any state. The following lemma shows the one-hop neighbors is monotonically decreasing in .
Proof.
It follows from Assumption V.1 that and are monotonically decreasing in . Recall the definition of in (3) follows . Fix and . We proceed to prove by showing is included in each part of
We first show . Since , there exist and s.t. . It follows from Assumptions (A4) and VII.1 that for any and , . By the decreasing monotonicity of in , this implies . It again follows from the decreasing monotonicity of that . In summary, we have .
Then we show . Since , this trivially holds.
Finally, we show . Since , there exist and that . It follows from the decreasing monotonicity of that . The proof of is completed.
The above arguments show that . Since this holds for every and , the proof is completed. ∎
Based on Lemma VII.1, we can construct in line 3 of Algorithm 3 BackProp in a computationally efficient manner, where the execution of the computationally heaviest part, querying within a range of given a random state , is reduced. Specifically, we treat as a set of samples independent of graph index ; i.e., . See Algorithm 4. When is added to as , we initialize as its definition (3) using the resolutions and at that time, and update the one-hop neighbors of every other sampled state as lines 4 to 4; when querying for the one-hop neighbors of in a later iteration, we update by removing faraway one-hop neighbors according to new resolutions as lines 4 to 4. In contrast to the vanilla query command as line 3 in BackProp, where searching the whole is executed every time when is queried, the new adaptation skips this step and saves computations. This adaptation is leveraged throughout the results in Section VII.
VII-B Point mass
In this subsection, we try to show the correctness of the algorithm and the rate of convergence through a number of numerical simulations for a point mass. The point mass dynamic follows a single integrator , where the velocity on each coordinate can be directly manipulated. The control set is a unit disc; i.e., the maximum velocity is restricted to . The environment is a square of size populated with a rectangular obstacle and a circular obstacle, and the robot desires to rest at a central circle. We used and for computational simplicity. The experiment used and ; i.e., every vertex was updated at-most after iterations and the recursion depth is a constant of . Notice that a constant recursion depth does not necessarily satisfy Assumption V.2, but the results still show its applicability in practice.
Figures 2a to 2c qualitatively show the estimated value function over the whole environment after the computations of seconds, seconds and seconds, respectively, where the goal region is marked by the red dash circle. It is observed that the estimate of minimal traveling time to the goal region is monotonically improving.
Quantitative characterization of the approximation error is available, as the minimal traveling time function can be analytically computed for a point mass in the presence of regularly shaped obstacles. The approximation error is computed between the estimated value function and the ground truth and is characterized by the root of mean squared error (RMSE), which follows as . Notice that the estimated value function can be positive infinity when value iterations are not sufficiently executed and we exclude them in the comparison. We compare iPolicy with the multi-grid method, the single-robot version of [35]. We use grids with spatial resolutions and compute the temporal resolutions in the same way as [35] for the multi-grid method. All other parameters share the same values with those of iPolicy. Figure 2d quantitatively visualizes RMSEs over computational times for independently executed experiments. It is shown that the approximation error of iPolicy monotonically and continuously decreases in general as more computational time is consumed, and iPolicy obtains lower approximation errors compared to the multi-grid method within the allowed computational resources.
VII-C Simple car
In this subsection, we first show the optimality results with the simple car model. The simple car model follows the unicycle dynamic, a well-known driftless affine system in the configuration space [2]. The unicycle dynamics is given below:
where is the steering angle and is a circle . For computational simplicity, we redefine by , where indicates and are identical, and by using the Carnot-Carathéodory distance. To avoid the bias over different coordinates in computing distances, we let the region of interest be a square of and let . The control set of the simple car is given by , and the goal set is a sub-Riemannian ball of radius centered at . Figure 3 shows the estimated value function after approximately k seconds of computations over samples on the plane, where the orientation of the car is roughly parallel to the orientation at goal () with a maximum difference of . The shape of the level sets around the goal depict that the minimal traveling time is higher in the transverse direction when compared to that along the longitudinal direction parallel to the orientation of the goal state, as the car has a non-zero turn radius and it cannot move in the transverse direction.
In another simulation, we demonstrate the capability of iPolicy of handling automated parking, including head-in parking and parallel parking, of the simple car in the cluttered environment and show anytime property of the algorithm by visualizing the resulting trajectory under the guidance of the computed controller. See Figure 4a and Figure 4b. In the first simulation, a simple car desires to park in a one-way head-in-only parking lot, and it starts from and needs to safely stop at . In the second simulation, we simulate the parallel parking of a simple car on a two-way street, where the car starts from and needs to park at without causing any collisions. In Figure 4a, iPolicy uses samples and runs seconds to compute a safe and feasible controller for head-in parking in Figure 4a. In Figure 4b, iPolicy uses samples and runs seconds to command the simple car to safely park in parallel. Both cases imply the quick feasibility of iPolicy. Together with Theorem V.1 and Figure 2d, where the increasing optimality is verified, the anytime property of iPolicy is demonstrated. That is, iPolicy can produce a feasible solution given a short computational time, and can continuously optimize its solution if more computational time is given.
VII-D Dubins car
In this subsection, we show Algorithm 1 can handle more complex dynamic models. In particular, we also show the estimate of the value functions obtained using iPolicy for the Dubins car whose dynamics also lies in the space and is described by the same set of equations as the simple car; however, it can only move forward. Different from the Dubins car in [39] that can instantly stop, the Dubins car can only move forward at a constant speed, making its control set a singleton . The Dubins car is another example of nonholonomic system but without a reverse gear or a brake [12, 2]. Moreover, the Dubins car does not satisfy Assumption VII.1; however, we still apply Algorithm 4 in implementation and later result will show this quick adaptation does not affect the The goal set in this case is the sub-Riemannian ball of radius around the point and (again the goal is the point where we reach first inside the ball). In Figure 5, the estimated cost-to-go on a plane are shown where the orientation of the car is parallel to the orientation at goal with a maximum difference of . The estimate of the value functions show the more complex level sets obtained for the Dubins car as it can not move backwards. Unlike the simple car, the level sets are not rotationally symmetric about the origin. Notice that area on the right-hand side of in Figure 5 is mostly empty; this indicates iPolicy cannot find a solution in this area, mostly because the Dubins car starting from this area cannot reach the goal region without exiting the region of interest.
Remark VII.1.
Applying iPolicy to larger size environment does not require any additional modifications. However, at the early stage of computations, smaller size of samples may not display the desired probabilistic properties and some minor revisions may be needed for quick feasibility. Specifically, we can revise s.t. the one-hop neighbor of a sample does not exit the region of interest; i.e., for most and .
VIII Conclusions and Future Work
In this paper we presented an algorithm for feedback motion planning of dynamical systems. Using results from set-valued control theory and sampling-based algorithms, we presented the iPolicy algorithm which can be used for feedback motion planning of robotic systems and guarantees asymptotic optimality of the value functions. Numerical experiments with point mass, a simple car model and the Dubins car model are presented where the time-to-go costs are recovered using iPolicy. We discussed an asynchronous update rule of the value functions for computational efficiency of the algorithms and proved that it retains asymptotic optimality guarantees.
It was found that majority of the time of the algorithm is used in local connection and collision checking using the sub-Riemannian metric. Computational time is a critical bottleneck for the class of algorithms presented here and needs further research. Our future work includes the research on speeding up the calculations. Using parallel machine learning algorithms which can assist in connecting the states sampled during state-space construction using sampling could potentially help in relaxing the compute times. Learning-based algorithms [40] are shown to be efficient in addressing high-dimensional problems, which has the potential to help reduce computations of planners and makes it more applicable in practice. Furthermore, use of iPolicy for sub-optimal, trajectory-centric control of underactuated systems [41, 42] could also be an interesting topic of future research. Extension to affine or complementarity dynamical systems could be another interesting direction of research [43, 44, 45].
References
- [1] J. Reif, “Complexity of the mover’s problem and generalizations,” in Proceedings of the 20th Annual IEEE Conference on Foundations of Computer Science, pp. 421–427, 1979.
- [2] S. M. LaValle, Planning algorithms. Cambridge university press, 2006.
- [3] S. LaValle, “Motion planning,” Robotics Automation Magazine, IEEE, vol. 18, pp. 79–89, March 2011.
- [4] S. M. LaValle and J. J. Kuffner, “Randomized kinodynamic planning,” The International Journal of Robotics Research, vol. 20, no. 5, pp. 378–400, 2001.
- [5] S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimal motion planning,” The International Journal of Robotics Research, vol. 30, no. 7, pp. 846–894, 2011.
- [6] R. M. Murray and S. S. Sastry, “Nonholonomic motion planning: Steering using sinusoids,” Automatic Control, IEEE Transactions on, vol. 38, no. 5, pp. 700–716, 1993.
- [7] J.-P. Laumond, S. Sekhavat, and F. Lamiraux, Guidelines in nonholonomic motion planning for mobile robots. Springer, 1998.
- [8] Y. Wang, D. K. Jha, and Y. Akemi, “A two-stage RRT path planner for automated parking,” in 2017 13th IEEE Conference on Automation Science and Engineering (CASE), pp. 496–502, IEEE, 2017.
- [9] E. Schmerling, L. Janson, and M. Pavone, “Optimal sampling-based motion planning under differential constraints: the driftless case,” in 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 2368–2375, IEEE, 2015.
- [10] A. Perez, R. Platt, G. Konidaris, L. Kaelbling, and T. Lozano-Perez, “LQR-RRT*: Optimal sampling-based motion planning with automatically derived extension heuristics,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 2537–2542, IEEE, 2012.
- [11] D. J. Webb and J. van den Berg, “Kinodynamic RRT*: Asymptotically optimal motion planning for robots with linear dynamics,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 5054–5061, IEEE, 2013.
- [12] S. Karaman and E. Frazzoli, “Sampling-based optimal motion planning for non-holonomic dynamical systems,” in Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 5041–5047, IEEE, 2013.
- [13] K. Hauser, “Lazy collision checking in asymptotically-optimal motion planning,” in 2015 IEEE international conference on robotics and automation (ICRA), pp. 2951–2957, IEEE, 2015.
- [14] H. Peng, F. Li, J. Liu, and Z. Ju, “A symplectic instantaneous optimal control for robot trajectory tracking with differential-algebraic equation models,” IEEE Transactions on Industrial Electronics, vol. 67, no. 5, pp. 3819–3829, 2020.
- [15] P. Cardaliaguet, M. Quincampoix, and P. Saint-Pierre, “Set-valued numerical analysis for optimal control and differential games,” in Stochastic and differential games: theory and numerical methods, pp. 177–247, Birkhäuser Boston Boston, MA, 1999.
- [16] M. Kalakrishnan, S. Chitta, E. Theodorou, P. Pastor, and S. Schaal, “Stomp: Stochastic trajectory optimization for motion planning,” in 2011 IEEE International Conference on Robotics and Automation, pp. 4569–4574, 2011.
- [17] N. Ratliff, M. Zucker, J. A. Bagnell, and S. Srinivasa, “Chomp: Gradient optimization techniques for efficient motion planning,” in 2009 IEEE International Conference on Robotics and Automation, pp. 489–494, 2009.
- [18] M. Kelly, “An introduction to trajectory optimization: How to do your own direct collocation,” SIAM Review, vol. 59, no. 4, pp. 849–904, 2017.
- [19] J. Schulman, Y. Duan, J. Ho, A. Lee, I. Awwal, H. Bradlow, J. Pan, S. Patil, K. Goldberg, and P. Abbeel, “Motion planning with sequential convex optimization and convex collision checking,” The International Journal of Robotics Research, vol. 33, no. 9, pp. 1251–1270, 2014.
- [20] A. U. Raghunathan, D. K. Jha, and D. Romeres, “PYROBOCOP : Python-based robotic control & optimization package for manipulation and collision avoidance,” CoRR, vol. abs/2106.03220, 2021.
- [21] X. Wang, J. Liu, Y. Zhang, B. Shi, D. Jiang, and H. Peng, “A unified symplectic pseudospectral method for motion planning and tracking control of 3d underactuated overhead cranes,” International Journal of Robust and Nonlinear Control, vol. 29, no. 7, pp. 2236–2253, 2019.
- [22] Y. Tassa, T. Erez, and E. Todorov, “Synthesis and stabilization of complex behaviors through online trajectory optimization,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4906–4913, IEEE, 2012.
- [23] K. Zhang, Q. Sun, and Y. Shi, “Trajectory tracking control of autonomous ground vehicles using adaptive learning mpc,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 12, pp. 5554–5564, 2021.
- [24] J. De Schutter, M. Zanon, and M. Diehl, “Tunempc—a tool for economic tuning of tracking (n) mpc problems,” IEEE Control Systems Letters, vol. 4, no. 4, pp. 910–915, 2020.
- [25] A. Majumdar and R. Tedrake, “Funnel libraries for real-time robust feedback motion planning,” The International Journal of Robotics Research, vol. 36, no. 8, pp. 947–982, 2017.
- [26] O. Khatib, “Real-time obstacle avoidance for manipulators and mobile robots,” in Autonomous Robot Vehicles, pp. 396–404, Springer, 1986.
- [27] J. Borenstein and Y. Koren, “The vector field histogram - fast obstacle avoidance for mobile robots,” IEEE Journal of Robotics and Automation, vol. 7, no. 3, pp. 278–288, 1991.
- [28] D. E. Koditschek and E. Rimon, “Robot navigation functions on manifolds with boundary,” Advances in applied mathematics, vol. 11, no. 4, pp. 412–442, 1990.
- [29] A. Orthey, C. Chamzas, and L. E. Kavraki, “Sampling-based motion planning: A comparative review,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 7, no. 1, p. null, 2024.
- [30] R. Tedrake, I. R. Manchester, M. Tobenkin, and J. W. Roberts, “Lqr-trees: Feedback motion planning via sums-of-squares verification,” The International Journal of Robotics Research, vol. 29, no. 8, pp. 1038–1052, 2010.
- [31] R. Tedrake, “LQR-trees: Feedback motion planning on sparse randomized trees,” in Proceedings of Robotics: Science and Systems, (Seattle, USA), June 2009.
- [32] G. J. Maeda, S. P. Singh, and H. Durrant-Whyte, “A tuned approach to feedback motion planning with rrts under model uncertainty,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 2288–2294, IEEE, 2011.
- [33] V. A. Huynh, S. Karaman, and E. Frazzoli, “An incremental sampling-based algorithm for stochastic optimal control,” in Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 2865–2872, IEEE, 2012.
- [34] D. K. Jha, M. Zhu, and A. Ray, “Game theoretic controller synthesis for multi-robot motion planning-part II: Policy-based algorithms,” IFAC-PapersOnLine, vol. 48, no. 22, pp. 168–173, 2015.
- [35] G. Zhao and M. Zhu, “Pareto optimal multirobot motion planning,” IEEE Transactions on Automatic Control, vol. 66, no. 9, pp. 3984–3999, 2020.
- [36] A.-A. Agha-Mohammadi, S. Chakravorty, and N. M. Amato, “Firm: Sampling-based feedback motion-planning under motion uncertainty and imperfect measurements,” The International Journal of Robotics Research, vol. 33, no. 2, pp. 268–304, 2014.
- [37] E. Schmerling, L. Janson, and M. Pavone, “Optimal sampling-based motion planning under differential constraints: the driftless case,” arXiv preprint arXiv:1403.2483, 2014.
- [38] E. Mueller, M. Zhu, S. Karaman, and E. Frazzoli, “Anytime computation algorithms for approach-evasion differential games,” arXiv preprints, 2013. http://arxiv.longhoe.net/abs/1308.1174.
- [39] S. Levine and P. Abbeel, “Learning neural network policies with guided policy search under unknown dynamics,” in Advances in Neural Information Processing Systems, pp. 1071–1079, 2014.
- [40] C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” arXiv preprint arXiv:2303.04137, 2023.
- [41] A. Majumdar, A. A. Ahmadi, and R. Tedrake, “Control design along trajectories with sums of squares programming,” in 2013 IEEE International Conference on Robotics and Automation, pp. 4054–4061, IEEE, 2013.
- [42] P. Kolaric, D. K. Jha, A. U. Raghunathan, F. L. Lewis, M. Benosman, D. Romeres, and D. Nikovski, “Local policy optimization for trajectory-centric reinforcement learning,” International Conference on Robotics and Automation (ICRA), 2020. https://arxiv.longhoe.net/pdf/2001.08092.
- [43] Y. Shirai, D. K. Jha, and A. U. Raghunathan, “Covariance steering for uncertain contact-rich systems,” arXiv preprint arXiv:2303.13382, 2023.
- [44] Y. Shirai, D. K. Jha, A. U. Raghunathan, and D. Romeres, “Robust pivoting: Exploiting frictional stability using bilevel optimization,” in 2022 International Conference on Robotics and Automation (ICRA), pp. 992–998, 2022.
- [45] G. Yang, M. Cai, A. Ahmad, C. Belta, and R. Tron, “Efficient LQR-CBF-RRT*: Safe and optimal motion planning,” arXiv preprint arXiv:2304.00790, 2023.
- [46] H. L. Royden and P. Fitzpatrick, Real analysis. Prentice Hall, 4 ed., 2010.
Guoxiang Zhao is currently an Associate Professor in the School of Future Technology at Shanghai University, Shanghai, China. He received PhD in Electrical Engineering from Pennsylvania State University in May 2022. He received a M.S. degree in Mechanical Engineering from Purdue University, West Lafayette, IN, USA and a B.E. degree from Shanghai Jiao Tong University, Shanghai, China. He was a Software Engineer from 2022 to 2023 with TuSimple, Inc. in San Diego, CA, USA. His research interests are in the areas of motion planning and multi-robot systems. |
Devesh K. Jha is currently a Principal Research Scientist at Mitsubishi Electric Research Laboratories (MERL) in Cambridge, MA, USA. He received PhD in Mechanical Engineering from Penn State in Decemeber 2016. He received M.S. degrees in Mechanical Engineering and Mathematics also from Penn State. His research interests are in the areas of Machine Learning, Robotics and Deep Learning. He is a recipient of several best paper awards including the Kalman Best Paper Award 2019 from the American Society of Mechanical Engineers (ASME) Dynamic Systems and Control Division. He is a senior member of IEEE and an associate editor of IEEE Robotics and Automation Letters (RA-L). |
Yebin Wang (M’10-SM’16) received the B.Eng. degree in Mechatronics Engineering from Zhejiang University, Hangzhou, China, in 1997, M.Eng. degree in Control Theory & Control Engineering from Tsinghua University, Bei**g, China, in 2001, and Ph.D. in Electrical Engineering from the University of Alberta, Edmonton, Canada, in 2008. Dr. Wang has been with Mitsubishi Electric Research Laboratories in Cambridge, MA, USA, since 2009, and now is a Senior Principal Research Scientist. From 2001 to 2003 he was a Software Engineer, Project Manager, and Manager of R&D Dept. in industries, Bei**g, China. His research interests include nonlinear control and estimation, optimal control, adaptive systems and their applications including mechatronic systems. |
Minghui Zhu is an Associate Professor in the School of Electrical Engineering and Computer Science at the Pennsylvania State University. Prior to joining Penn State in 2013, he was a postdoctoral associate in the Laboratory for Information and Decision Systems at the Massachusetts Institute of Technology. He received Ph.D. in Engineering Science (Mechanical Engineering) from the University of California, San Diego in 2011. His research interests lie in distributed control and decision-making of multi-agent networks with applications in robotic networks, security and the smart grid. He is the co-author of the book ”Distributed optimization-based control of multi-agent networks in complex environments” (Springer, 2015). He is a recipient of the award of Outstanding Graduate Student of Mechanical and Aerospace Engineering at UCSD in 2011, the Dorothy Quiggle Career Development Professorship in Engineering at Penn State in 2013, the award of Outstanding Reviewer of Automatica in 2013 and 2014, and the National Science Foundation CAREER award in 2019. He is an associate editor of the IEEE Open Journal of Control Systems, the IET Cyber-systems and Robotics and the Conference Editorial Board of the IEEE Control Systems Society. |
This section contains preliminary results towards the complete proof of Theorem V.1. Most notations are borrowed from [15] but we make the following changes for the consistency throughout this paper. Let be a closed set and be a closed target set of following the definition in [15]. We rewrite the minimal time function , the estimated value function and the the spatial resolution in [15] as , and respectively. Notice that all superscripts and subscripts of are kept in . Denote the domain of by . The following theorem proves a slightly stronger result compared to Corollary 3.7 in [15], where the perturbation radius can be arbitrarily large.
Theorem .1.
Given the convergence in the epigraphic sense, i.e., , converges pointwise to , i.e., ,
where and .
Proof.
Define s.t. .
Claim .1.
The upper and lower bound of for follows respectively
(30) |
(31) |
Proof.
Following the definition of , we have
(32) | ||||
Notice that . We can simplify the conditions on as
It again follows from (32) that . Since searches for an infimum of , taking an infimum of would render at the same result but remove unattainable candidate minimizers. Thus, the following holds by letting :
(33) | ||||
where the last inequality is a direct result of expansion of on . This completes the proof of (31). ∎
We proceed to show that is bounded both from below and above by . The following claim shows is lower bounded by .
Claim .2.
It holds that for any ,
Proof.
The following claim shows the is upper bounded by the limit of .
Claim .3.
For any , it holds that
Proof.
Following the argument in the proof of Corollary 3.7 in [15], we have . By the definition of , we have , . In other words, there exists a sequence s.t. .
Fix and consider a sequence s.t. . It follows from Claim .2 that , where the last inequality follows from the definition of . Therefore, is bounded. Then it follows from the supplementary statement about the definition of on page 236 in [15] that . Since , we have
Let s.t. . Therefore, . Then we have . It follows from (31) in Claim .1 that
This completes the proof. ∎
We proceed to show the convergence of the value function estimate on each point . Different from Theorem .1, we remove the perturbation in the theorem statement and directly analyze the values . Towards this end, we first prove the following lemma quantifying the lowerbound of .
Lemma .1.
Given all conditions in Theorem .1, the following relationship holds:
Proof.
It follows from Proposition 2.21 in [15] that is a discrete viability domain for . Thus, it holds that . It follows from Theorem 3.2 and Lemma 3.6 that we can replace the discrete viability kernals with epigraphs of traveling time functions; i.e.,
Fix and there exists s.t. ; that is, . This implies . This completes the proof. ∎
The following corollary characterized the pointwise convergence of the value function estimate.
Corollary .1.
Given the conditions in Theorem .1 and for some . Then the following holds for every :
Proof.
Notice that Theorem .1 only characterizes the pointwise convergence over . On , the value of is not well defined and the convergence cannot be analyzed. The Kruzhkov transform can address this issue and let . Without proof, the theorem is stated below.
Lemma .2.
Given the convergence in the epigraphic sense, i.e., , converges pointwise to , i.e., ,
where and .
The following corollary shows the uniform convergence for the transformed value function estimates.
Corollary .2.
The transformed value function estimate converges uniformly to almost everywhere on ; i.e.,
where has sufficiently small measure, and .
Proof.
The proof leverages Egoroff’s theorem [46]. Since is compact, has finite measure. It follows from Theorem .1 that are measurable functions converging on pointwise to . Then converges uniformly to almost everywhere to by the Egoroff’s theorem, where has sufficiently small measure. This completes the proof of the first relation. ∎