Redundant driving mechanical arm path planning method based on DSAW offline reinforcement learning algorithmTechnical Field
The invention belongs to the field of path planning of a three-dimensional space of a mechanical arm, and particularly relates to a path planning method of a redundant driving mechanical arm based on a DSAW offline reinforcement learning algorithm.
Background
Path planning is a core task of the mechanical arm for performing autonomous operation, and aims to find an optimal or feasible path from a starting point to a finishing point in the environment for the tail end of the mechanical arm, and simultaneously, the mechanical arm is driven by redundancy to reversely solve an infinite number of paths. Various factors need to be considered in the redundant drive robot path planning process, such as avoiding obstacles, inverse solutions, minimizing path length or cost, following specific path rules, etc. Path planning is critical to improving the efficiency and safety of robots and automation systems.
With the development of technology, conventional path planning methods, while effective in certain scenarios, are challenging in complex, dynamic or unknown environments. These environments require systems with greater adaptability and learning capabilities, enabling decisions to be made based on environmental changes and historical experience. Reinforcement learning, however, is well suited to these problems due to its high degree of autonomous learning. Therefore, the combination of path planning and reinforcement learning provides a new view angle and a new method for solving the navigation problem in a complex environment.
However, under a more complex and uncertain high-dimensional environment, reinforcement learning is difficult to adapt to high-latitude multi-objective tasks due to low exploration performance of the reinforcement learning, so that not only is time-consuming exploration and convergence speed slow, but also dimension disasters are easy to cause reduction of planning capacity, but off-line reinforcement learning, also called batch reinforcement learning, is a variant of reinforcement learning, and the main idea is that a redundant driving mechanical arm learns from a fixed data set, but does not explore.
The path planning of the mechanical arm is different from that of the unmanned aerial vehicle and the trolley, so that the accurate joint angle is calculated in the path planning part, and the tail end of the mechanical arm can correctly reach the appointed position.
Disclosure of Invention
Aiming at the prior art, the invention provides a path planning method based on DSAW (Dense State Advantage Weighting) off-line reinforcement learning algorithm, which improves the node traversing speed, reduces the iteration times, reduces the output path length and ensures the optimal joint solving angle under a large-scale map.
The invention discloses a redundant driving mechanical arm path planning method based on a DSAW offline reinforcement learning algorithm, which comprises the following steps of:
Step 1, performing early stage preparation of reinforcement learning path planning of a redundant driving mechanical arm, initializing a global map, namely performing environment modeling on a moving scene of the redundant driving mechanical arm in a computer and establishing a three-dimensional coordinate system, simplifying all obstacles in the moving scene environment of the redundant driving mechanical arm into a cuboid and a sphere by adopting a surrounding method so as to facilitate collision detection of subsequent obstacles and prevent collision problems, wherein the number of obstacle setting blocks is k, the set of obstacle setting blocks is { obji }, the obstacle simplifying to the sphere comprises sphere radius and sphere center coordinate information, and the obstacle simplifying to the rectangle comprises minimum coordinate and maximum coordinate information on all direction axes, namely
Step 2, constructing a reinforcement learning path planning environment related to the redundant driving mechanical arm in a simulation simulator CoppeliaSim, wherein the path planning environment comprises reinforcement learning state space, reinforcement learning action space and reinforcement learning reward functions to guide the redundant driving mechanical arm to learn a behavior strategy reaching a target point;
Step 2.1, designing a reinforcement learning state space, namely designing the state space into a state= { start, goal, current, distobstacle,distgoal and step } according to the actual requirement of path planning and global map information, wherein the start= (xinit,yinit,zinit) represents the starting point position of a redundant driving mechanical arm in a path planning environment, goal = (xend,yend,zend) represents the end point position, current= (xcurrent,ycurrent,zcurrent) represents the current node position of the redundant driving mechanical arm, distobstacle represents the distance between the current node and an obstacle, distgoal represents the distance between the current node and the end point, and step represents the running step number;
Designing reinforcement learning action space, defining all possible actions of the redundant driving mechanical arm, wherein the most critical actions relate to directions and step sizes in the process of finding point positions, so that the action space is designed as action= { x, y, z, length }, wherein (x, y, z) represents a direction vector, the range is designed as (-1, 1), and length represents the step sizes of single movement of the redundant driving mechanical arm;
designing a reinforcement learning reward function, wherein the reward function comprises rewarding correct behaviors of the redundant driving mechanical arm and punishment of non-ideal behaviors, and a mechanism for guiding the redundant driving mechanical arm to avoid obstacles to realize path planning gradual learning and optimization in a three-dimensional space;
Step 3, constructing an offline reinforcement learning data set in the reinforcement learning path planning environmentThe offline reinforcement learning data set comprises historical interaction records generated in the training process, data generated in a simulation simulator and a manually-related path planning expert databasePerforming necessary normalization pretreatment to improve the usability thereof;
Step 4, utilizing the normalized preprocessed data setPerforming learning iteration on an offline reinforcement learning algorithm DSAW, and outputting to obtain a path node n0,n1,...,nn in the process of moving the tail end of the redundant driving mechanical arm;
The specific updating process of the offline reinforcement learning algorithm DSAW algorithm is as follows:
According to the reinforcement learning path planning environment, a dense state dominance weighting function w (s, s') is provided, and more weight is allocated to an ideal strategy, and less weight is allocated to a worse strategy:
w(s,s')=[A(s,s')≥0]=[(Qθ(s,s')-Vψ(s))≥0]=1[Qθ(s,s')≥Vψ(s)]
Wherein A (s, s ') represents a state dominance function, Qθ (s, s') and Vψ(s) represent a current-time state and a next-time state value function and an evaluation function for the current state, respectively,1[Qθ(s,s')≥Vψ(s) ] represents that w (s, s ') is 1 when Q (s, s') is greater than V(s), and w (s, s ') is 0 when Q (s, s') is less than V(s);
According to the updating process of the dense state dominance weighting function w (s, s'), the offline reinforcement learning algorithm DSAW first updates the state evaluation function Vδ(s):
In the formula,Represents an indicator function, k represents an adjustment parameter, Vδ(s) represents a state evaluation function,Representing a loss function when the state evaluation function is updated, a weight distribution functionTo emphasize the moment when Qθ' (s, s ') is greater than Vδ(s), the DSAW algorithm weights the good state when performing the state estimation function update, reduces the weight of the bad state, while Qθ' (s, s') is generated by the target network θ 'of the network parameter θ of the state value function at the current moment and the next moment, and the state estimation network Vδ(s) is generated by the network parameter δ, then updates the state value function Qθ (s, s'):
In the formula,A loss function representing the update of the current time state and the next time state value function;
in the update of Qθ (s, s') and Vδ(s), onlyTo ensure that the data of state s at the time of update and state s' at the next time is in an offline data setThereby avoiding the interference of extrapolation errors;
Generating a forward network fφ (s, a) from the forward network parameters phi and predicting the next time state by state s and action a, training by minimization:
In the formula,Representing a loss function when the forward network function is updated;
An inverse dynamic model Iω (s, s ') generated from the network parameters ω is constructed to extrapolate the action a, and the inverse dynamic model Iω (s, s ') is trained by the dense state dominance weighting function w (s, s '):
Wherein Iω represents an inverse dynamic network, s and s' represent a current time state and a next time state,Representing a loss function when updating the inverse dynamic model function;
Constructing an additional predictive network s '=τψ(s), evaluating the next time state s', performing training iterations to maintain a close data set byDistribution of (a):
In the formula,Representing a loss function when the predicted network function is updated;
Step 5, setting a redundant driving mechanical arm joint and an obstacle by adopting a surrounding method, simplifying a mechanical arm connecting rod into a cylinder, simplifying the obstacle into a sphere, calculating the distance between the mechanical arm joint and the obstacle, performing obstacle avoidance detection of the redundant driving mechanical arm joint, and performing collision prevention test;
And 6, establishing a model predictive control algorithm MPC state equation of the redundant driving mechanical arm, performing inverse solution angle calculation on a path node n0,n1,...,nn in the tail end moving process of the redundant driving mechanical arm to obtain a redundant driving joint angle value, and then moving the redundant driving mechanical arm according to the redundant driving joint angle value to complete the whole path planning.
Further, the specific procedure for simplifying the obstacle using the surrounding method is as follows:
regarding the redundant driving mechanical arm as a small ball with radius r, for a sphere obstacle, the collision distance between the two is as follows on the premise of not allowing mutual collision:
In the formula,The sphere center coordinates of the sphere obstacle are represented, and the collision detection formula is as follows:
In the formula,The radius of the redundant driving mechanical arm and the radius of the sphere obstacle are respectively represented, if tcollision is 1, the redundant driving mechanical arm collides with the ith obstacle, and if tcollision is 0, the redundant driving mechanical arm does not collide with the ith obstacle;
for a cuboid obstruction, a judging formula for detecting whether collision occurs between the cuboid obstruction and the cuboid obstruction is as follows:
Simplifying irregular obstacles by an AABB surrounding method, ensuring that each surface of the simplified cuboid is parallel to a direction axis, setting the smallest coordinate position on the cuboid as (xmin,ymin,zmin), the largest coordinate as (xmax,ymax,zmax), setting the sphere center coordinate of the sphere simplified by the redundant driving mechanical arm as (xcurrent,ycurrent,zcurrent), setting the radius as r, setting the nearest point coordinate as (x, y, z), and calculating the nearest point of the sphere center of the redundant driving mechanical arm and the AABB cuboid, wherein if the following formula is satisfied:
The nearest point (x, y, z) is the sphere center (xcurrent,ycurrent,zcurrent) of the redundant driving mechanical arm, collision occurs at the moment, when the nearest point is not in the range of a cuboid, the distance relation between the sphere center (xcurrent,ycurrent,zcurrent) and the cuboid (xmin,xmax),(ymin,ymax),(zmin,zmax) is judged, if the sphere center xcurrent<xmin exists, the coordinate of the nearest point is xmin,xcurrent>xmax and is xmax, and the nearest coordinate y, z can be obtained in the same way;
When the closest point is not within the cuboid range, calculating the distance from the closest point to the center of the sphere by using the Euclidean distance formula:
if distobji < r exists, judging that the redundant driving mechanical arm collides with the obstacle.
Further, the design of the reinforcement learning reward function follows the following principle:
(1) A punishment rewarding function, namely adding a punishment parameter Rcollision=-λcollision,Rcollision to represent a collision punishment rewarding when collision occurs in order to safely plan a collision-free path, wherein lambdacollision represents that the punishment parameter is a constant, rewarding the action of the redundant driving mechanical arm running at the moment if collision occurs, and helping the redundant driving mechanical arm to learn how to avoid collision by using a punishment item;
(2) And guiding a reward function, namely decomposing a path planning task into a plurality of stages from far from the target to near to the target according to the distance between the redundant driving mechanical arm and the target position, wherein each stage uses a unified or independent learning target and a reward mechanism according to the requirement, and each stage uses an independent learning target and a reward mechanism with the following specific formulas:
Wherein d represents an actual euclidean distance between the redundant drive arm position and the target position, d1 represents a first euclidean distance between the preset arm position and the target position, d2 represents a second euclidean distance between the preset arm position and the target position, λ1 represents a reward parameter when the distance d is smaller than d1, λ2 represents a reward parameter when the distance d is equal to or greater than d1 and equal to or less than d2, λ3 represents a reward parameter when the distance d is greater than d2, and R1 represents a stepwise reward function;
(3) And a basic rewarding function, namely providing positive rewarding for the behavior approaching to the target state, guiding the redundant driving mechanical arm to obtain positive learning rewarding even if the whole task cannot be completed in the learning process, wherein the formula is as follows:
wherein dcurrent represents the Euclidean distance between the redundant driving mechanical arm and the target position at the current moment, (xcurrent,ycurrent,zcurrent) represents the current node position of the redundant driving mechanical arm in the three-dimensional space, (xgoal,ygoal,zgoal) represents the target position expected to be reached in the path planning task, alpha1 represents that the adjustment parameter is a constant, R2 represents the basic rewarding function, and the punishment rewarding function, the guiding rewarding function and the basic rewarding function are added to obtain the complete path planning rewarding function Rsum=Rcollision+R1+R2.
Further, offline reinforcement learning data setThe specific collection steps of the method comprise that the sampling algorithm of the offline data set comprises a twin delay depth deterministic strategy gradient algorithm TD3, a depth deterministic strategy gradient algorithm DDPG and an expert data set;
The TD3 algorithm is suitable for processing the problem of continuous action space, the improved strategy updating mechanism and the value estimation method make the problem excellent in a complex environment, the DDPG algorithm shows stronger robustness and data efficiency when data collection is carried out, and is suitable for application scenes with limitation on data quality and quantity, the TD3 algorithm and the DDPG algorithm are applied to the reinforcement learning path planning environment constructed in the step 2 for training iteration, data pairs (s, a, r and s ') generated in the training process are collected, wherein s represents the current moment state, a represents action, r represents rewarding, s' represents the next moment state, and the expert data set is used as a correct behavior example to accelerate the learning process;
Preprocessing the data (s, a, r, s') collected in the path planning environment to improve the availability thereof, normalizing the current time state s, and the formula is as follows:
Where λ represents the state data s after normalization processing, α2 represents the adjustment parameter, Q (si,si ') represents the state value function, s ' represents the next time state, and N represents the number of Q (si,si ').
The process of simplifying the cylinder and the sphere by adopting the surrounding method to drive the mechanical arm joint and the obstacle redundantly comprises the steps of converting the collision problem between the mechanical arm and the obstacle into the positional relationship problem between the cylinder and the sphere by using the surrounding box method, wherein the positional relationship between the sphere and the cylinder, namely the sphere center Pobj of the sphere obstacle and the cylinder, needs to be judged before collision detectionThe position relation of the drop foot D of the axis of the ball body is calculated by the dot product of the vectorWith the axis of the cylinderThe position relation of the drop foot D comprises the following steps:
(1) Calculating vectorsAndVector;
(2) Calculating the projection length of the vector on the vector:
(3) Judging the position of the drop foot D through the projection length z, and if z is more than or equal to 0 and less than or equal to 1, judging the on-line segment of the drop foot DOn the upper part, otherwise, the drop foot D is on-lineIs arranged on the extension line of the (c);
if the drop foot is on the cylinder axis, the collision distance is calculated as follows:
if the drop foot is on the extension of the cylinder axis, the collision distance is calculated as follows:
wherein dobj represents the distance between the joint of the redundant driving mechanical arm and the sphere obstacle, min represents one of the two formulas with small number,Represents the position of the center coordinates of the sphere obstacle,The formula for collision detection is as follows:
Wherein dobj represents the distance between the joint of the redundant driving mechanical arm and the spherical obstacle, robj represents the radius of the spherical obstacle, ri represents the minimum cylindrical radius of the joint of the redundant driving mechanical arm, if tcollision is 1, the ith joint of the redundant driving mechanical arm collides with the obstacle, and if tcollision is 0, the ith joint of the redundant driving mechanical arm does not collide with the obstacle.
Further, the process of calculating the path node by the redundant driving mechanical arm model predictive control algorithm MPC state equation is as follows:
First, the redundant drive robot model predictive control algorithm MPC is as follows:
In the formula, (xk+1,yk+1,zk+1) represents the position of the redundant driving mechanical arm at the time k+1, (xk,yk,zk) represents the position of the redundant driving mechanical arm at the time k, delta theta represents the angular velocity of each joint, and Jx,Jy,Jz represents the jacobian linear velocity matrix in the x, y and z directions respectively, so that the state equation of the model predictive control algorithm of the redundant driving mechanical arm is as follows:
x(k+1)=Ax(k)+Bu(k)=I[x y z]T+J[Δθ1 Δθ2 Δθ3 Δθ4 Δθ5 Δθ6 Δθ7]T
Wherein A represents a state transition matrix, B represents a control input matrix, I represents a3×3 identity matrix, [ xyz ] represents a robot arm end position, [ delta ]i represents an increment of a joint angle, x (k) represents a state variable at time k, x (k+1) represents a state variable at time k+1, and u (k) represents a control input variable at time k;
Whereas the model predictive control algorithm MPC predictive control performs optimal control by performing in a certain period of time in the future, expressed in discrete state space, namely:
Xk+1=AXk+BUk
Let u (k+i|k) denote the input value at the k+1st time predicted at the k time, and the input control matrix is as follows in the prediction interval N:
similarly, x (k+i|k) represents the system state at the time predicted at time k as follows:
The position error is E=xk-xc because of solving the inverse solution of the redundant drive mechanical arm, wherein xc represents the target position, and the cost equation is also:
formula Xk+1 is written as in combination with formula Xk and formula Uk:
and the cost matrix at time k can be expressed as:
Finally, the formula Xk+1 is brought into a cost matrix Jk+1 to obtain a cost matrix to be solved:
In the process, letWherein the method comprises the steps ofThe relation between the loss function Jk and the initial condition xk and the relation between the loss function Jk and the input Uk are expressed as an augmentation matrix, and finally, a formula for solving the inverse solution of the redundant mechanical arm by using an MPC algorithm is successfully obtained;
In order to prevent collision between the mechanical arm joint and the obstacle when the MPC algorithm performs inverse solution calculation, the obstacle avoidance detection algorithm needs to be added into the MPC equation as a constraint, and the specific steps of the obstacle avoidance constraint term are as follows:
the position of each current node can be obtained through DH matrix:
In the formula,The DH matrix is used for calculating the joint position of the redundant driving mechanical arm, thetaj is used for indicating the joint angle value of the current mechanical arm, j is used for indicating the joint number, the angle increment, namely the input value of the MPC formula, is u(k+i)=[Δθ1Δθ2Δθ3Δθ4Δθ5Δθ6Δθ7]T,oj=(xj,yj,zj), the position of the tail end of the mechanical arm in the three-dimensional space is used for preventing collision by restraining that the distance dobj(oj between each rod piece of the mechanical arm and an obstacle is larger than the collision distance:
Wherein dobj represents a distance calculation formula, and if the obtained distance is larger than ri+robj, no collision is judged, and if the obtained distance is smaller than ri+robj, the collision is judged.
The invention has the beneficial effects that (1) the invention discloses a method for performing the path planning task of the redundant driving mechanical arm in the three-dimensional space, which can avoid the occurrence of collision, provides a dense state dominance weighting function by combining the path planning environment on the basis of the SAW algorithm, ensures that the strategy is biased to high-quality data in the data set, reduces the negative influence of low-quality data, and improves the path planning efficiency.
(2) Aiming at three design problems of a state space, an action space and a reward function of a reinforcement learning path planning environment, a basic reward function, a collision punishment reward function and a target reward function are designed from a plurality of aspects of position, collision constraint, step length limitation and the like, the target reward function is used for guiding a redundant driving mechanical arm to learn a behavior strategy for reaching a target point, and the success rate of planning a path in a three-dimensional space with an obstacle by the redundant driving mechanical arm is improved
(3) Solving the redundant driving joint angles of the redundant driving mechanical arm and solving the joint movement path nodes of the redundant driving mechanical arm through a model predictive control algorithm, so that the optimal joint angles of the redundant driving mechanical arm are solved under the condition of avoiding the obstacle.
Drawings
FIG. 1 is a schematic flow chart of a redundant drive mechanical arm path planning method based on the combination of a DSAW algorithm and an MPC algorithm;
FIG. 2 is a simplified schematic diagram of a redundant drive arm linkage and an obstacle using the surrounding theory for cylinders and spheres in an embodiment of the present invention;
FIG. 3 is a schematic view of a link and an obstacle of a robot arm simplified to a cylinder and a sphere drop foot on an axis in an embodiment of the present invention;
FIG. 4 is a schematic view of a link and an obstacle of a robot arm simplified to a cylinder and a sphere foot on an axis extension line in an embodiment of the present invention;
FIG. 5 is a schematic diagram of a model for constructing redundant drive mechanical arms and obstacles in CoppeliaSim's environment in an embodiment of the present invention;
FIG. 6 is a schematic diagram of a path planning result in three-dimensional coordinates according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a path planning result in an obstacle environment according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of the angle of the corresponding path node of the redundant driving mechanical arm joint calculated by the MPC algorithm in the embodiment of the present invention;
FIG. 9 is a schematic diagram illustrating a motion prediction in CoppeliaSim environments according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be further described with reference to the accompanying drawings:
As shown in fig. 1, the invention discloses a redundant driving mechanical arm path planning method based on a DSAW offline reinforcement learning algorithm, which comprises the following specific steps:
Step 1, performing early stage preparation of reinforcement learning path planning of a redundant driving mechanical arm, initializing a global map, namely performing environment modeling on a moving scene of the redundant driving mechanical arm in a computer and establishing a three-dimensional coordinate system, simplifying all obstacles in the moving scene environment of the redundant driving mechanical arm into a cuboid and a sphere by adopting a surrounding method so as to facilitate collision detection of subsequent obstacles and prevent collision problems, wherein the number of obstacle setting blocks is k, the set of obstacle setting blocks is { obji }, the obstacle simplifying to the sphere comprises sphere radius and sphere center coordinate information, and the obstacle simplifying to the rectangle comprises minimum coordinate and maximum coordinate information on all direction axes, namely
The specific process for simplifying the obstacle by using the surrounding method is as follows, the redundant driving mechanical arm is regarded as a small sphere with radius r, and for the sphere obstacle, the collision distance between the two is as follows under the premise of not allowing mutual collision:
In the formula,The sphere center coordinates of the sphere obstacle are represented, and the collision detection formula is as follows:
Wherein, r is,The radius of the redundant driving mechanical arm and the radius of the sphere obstacle are respectively represented, if tcollision is 1, the redundant driving mechanical arm collides with the ith obstacle, and if tcollision is 0, the redundant driving mechanical arm does not collide with the ith obstacle;
for a cuboid obstruction, a judging formula for detecting whether collision occurs between the cuboid obstruction and the cuboid obstruction is as follows:
Simplifying irregular obstacles by an AABB surrounding method, ensuring that each surface of the simplified cuboid is parallel to a direction axis, setting the smallest coordinate position on the cuboid as (xmin,ymin,zmin), the largest coordinate as (xmax,ymax,zmax), setting the sphere center coordinate of the sphere simplified by the redundant driving mechanical arm as (xcurrent,ycurrent,zcurrent), setting the radius as r, setting the nearest point coordinate as (x, y, z), and calculating the nearest point of the sphere center of the redundant driving mechanical arm and the AABB cuboid, wherein if the following formula is satisfied:
The nearest point (x, y, z) is the sphere center (xcurrent,ycurrent,zcurrent) of the redundant driving mechanical arm, collision occurs at the moment, when the nearest point is not in the range of a cuboid, the distance relation between the sphere center (xcurrent,ycurrent,zcurrent) and the cuboid (xmin,xmax),(ymin,ymax),(zmin,zmax) is judged, if the sphere center xcurrent<xmin exists, the coordinate of the nearest point is xmin,xcurrent>xmax and is xmax, and the nearest coordinate y, z can be obtained in the same way;
When the closest point is not within the cuboid range, calculating the distance from the closest point to the center of the sphere by using the Euclidean distance formula:
If presentThe redundant drive robot arm is judged to collide with the obstacle.
Step 2, constructing a reinforcement learning path planning environment related to the redundant driving mechanical arm in a simulation simulator CoppeliaSim, wherein the path planning environment comprises reinforcement learning state space, reinforcement learning action space and reinforcement learning reward functions to guide the redundant driving mechanical arm to learn a behavior strategy reaching a target point, as shown in fig. 5, 6, 7 and 9;
step 2.1, designing a reinforcement learning state space, namely designing the state space into a state= { start, goal, current, distobstacle,distgoal and step } according to the actual requirement of path planning and global map information, wherein the start= (xinit,yinit,zinit) represents the starting point position of a redundant driving mechanical arm in the path planning environment, goal = (xend,yend,zend) represents the end point position, current= (xcurrent,ycurrent,zcurrent) represents the current node position, distobstacle represents the distance between the current node and an obstacle, distgoal represents the distance between the current node and the end point, and step represents the running step number;
Designing reinforcement learning action space, defining all possible actions of the redundant driving mechanical arm, wherein the most critical actions relate to directions and step sizes in the process of finding point positions, so that the action space is designed as action= { x, y, z, length }, wherein (x, y, z) represents a direction vector, the range is designed as (-1, 1), and length represents the step sizes of single movement of the redundant driving mechanical arm;
designing a reinforcement learning reward function, wherein the reward function comprises rewarding correct behaviors of the redundant driving mechanical arm and punishment of non-ideal behaviors, and a mechanism for guiding the redundant driving mechanical arm to avoid obstacles to realize path planning gradual learning and optimization in a three-dimensional space;
The design of the reinforcement learning reward function follows the principle that (1) a punishment reward function is adopted, namely, a punishment parameter Rcollision=-λcollision,Rcollision is required to be added when a collision happens in order to safely plan a collision-free path, wherein the punishment reward is expressed, lambdacollision is expressed as a constant, if the collision happens, the action of the redundant driving mechanical arm running at the moment is rewarded, and the punishment item is utilized to help the redundant driving mechanical arm learn how to avoid the collision.
(2) And guiding a reward function, namely decomposing a path planning task into a plurality of stages from far from the target to near to the target according to the distance between the redundant driving mechanical arm and the target position, wherein each stage uses a unified or independent learning target and a reward mechanism according to the requirement, and each stage uses an independent learning target and a reward mechanism with the following specific formulas:
Wherein d represents the Euclidean distance between the redundant driving mechanical arm position and the target position, d1 represents the first Euclidean distance between the preset mechanical arm position and the target position, d2 represents the second Euclidean distance between the preset mechanical arm position and the target position, lambdai represents the rewarding parameter when the distance d is smaller than d1, lambda2 represents the rewarding parameter when the distance d is larger than or equal to d1 and smaller than or equal to d2, lambda3 represents the rewarding parameter when the distance d is larger than d2, and R1 represents the staged rewarding function;
(3) And a basic rewarding function, namely providing positive rewarding for the behavior approaching to the target state, guiding the redundant driving mechanical arm to obtain positive learning rewarding even if the whole task cannot be completed in the learning process, wherein the formula is as follows:
Wherein dcurrent represents the Euclidean distance between the redundant driving mechanical arm and the target position at the current moment, (xcurrent,ycurrent,zcurrent) represents the position of the redundant driving mechanical arm in the three-dimensional space, (xgoal,ygoal,zgoal) represents the target position expected to be reached in the path planning task, alpha1 represents that the adjustment parameter is a constant, and R2 represents a basic rewarding function;
and adding the punishment reward function, the guide reward function and the basic reward function to obtain a complete path planning reward function Rsum=Rcollision+R1+R2.
Step 3, constructing an offline reinforcement learning data set in the reinforcement learning path planning environmentThe offline reinforcement learning data set comprises historical interaction records generated in the training process, data generated in a simulation simulator and a manually-related path planning expert databasePerforming necessary normalization pretreatment to improve the usability thereof;
Step 4, utilizing the normalized preprocessed data setPerforming learning iteration on an offline reinforcement learning algorithm DSAW, and outputting to obtain a path node n0,n1,...,nn in the process of moving the tail end of the redundant driving mechanical arm;
The specific updating process of the offline reinforcement learning algorithm DSAW algorithm is that a dense state dominance weighting function w (s, s') is provided according to the reinforcement learning path planning environment, and a larger weight is distributed to an ideal strategy and a worse strategy is given a smaller weight:
w(s,s')=[A(s,s')≥0]=[(Qθ(s,s')-Vψ(s))≥0]=1[Qθ(s,s')≥Vψ(s)]
Where A (s, s ') represents a state dominance function, and Qθ (s, s') and Vψ(s) represent a current-time state and next-time state value function and an evaluation function for the current state, respectively.1[Qθ(s,s')≥Vψ (s) ] represents that w (s, s ') has a value of 1 when Q (s, s') is larger than V(s), and w (s, s ') has a value of 0 when Q (s, s') is smaller than V(s);
According to the updating process of the dense state dominance weighting function w (s, s'), the offline reinforcement learning algorithm DSAW first updates the state evaluation function Vδ(s):
In the formula,Represents an indicator function, k represents an adjustment parameter, Vδ(s) represents a state evaluation function,Representing a loss function when the state evaluation function is updated, a weight distribution functionTo emphasize the moment when Qθ' (s, s ') is greater than Vδ(s), the DSAW algorithm weights the good state when the state estimation function is updated, the bad state is reduced, Qθ' (s, s ') is generated by the target network θ ' of the network parameter θ of the state value function at the current moment and the next moment, and the state estimation network Vδ(s) is generated by the network parameter δ. The state value function Qθ (s, s') is then updated:
in the formula, Es,s′~D (what meaning is represented? the loss expectation function in the case of a state s' at a moment,A loss function representing the update of the current time state and the next time state value function;
in the update of Qθ (s, s') and Vδ(s), onlyTo ensure that the data of state s at the time of update and state s' at the next time is in an offline data setThereby avoiding the interference of extrapolation errors;
Generating a forward network fφ (s, a) from the forward network parameters phi and predicting the next time state by state s and action a, training by minimization:
In the formula,Representing a loss function when the forward network function is updated;
An inverse dynamic model Iω (s, s ') generated from the network parameters ω is constructed to extrapolate the action a, and the inverse dynamic model Iω (s, s ') is trained by the dense state dominance weighting function w (s, s '):
Wherein Iω represents an inverse dynamic network, s and s' represent a current time state and a next time state,Representing a loss function when updating the inverse dynamic model function;
Constructing an additional predictive network s '=τ(s), evaluating the next time state s', performing training iterations to maintain a close data set byDistribution of (a):
In the formula,Representing the loss function when the predicted network function is updated.
Step 5, setting a redundant driving mechanical arm joint and an obstacle by adopting a surrounding method, simplifying a mechanical arm connecting rod into a cylinder, simplifying the obstacle into a sphere, calculating the distance between the mechanical arm joint and the obstacle, and performing obstacle avoidance detection and collision prevention test on the redundant driving mechanical arm joint as shown in fig. 3, 4 and 5;
The process of simplifying the cylinder and the sphere of the redundant driving mechanical arm joint and the obstacle by adopting the bounding box method comprises the steps of converting the collision problem between the mechanical arm and the obstacle into the positional relationship problem between the cylinder and the sphere by adopting the bounding box method, and judging the positional relationship between the sphere and the cylinder before collision detection, namely the sphere center Pobj of the sphere obstacle and the cylinderThe position relation of the drop foot D of the axis of the ball body is calculated by the dot product of the vectorWith the axis of the cylinderThe position relation of the drop foot D comprises the following steps:
(1) Calculating vectorsAndVector;
(2) Calculating the projection length of the vector on the vector:
(3) Judging the position of the drop foot D through the projection length z, and if z is more than or equal to 0 and less than or equal to 1, judging the on-line segment of the drop foot DOn the upper part, otherwise, the drop foot D is on-lineIs arranged on the extension line of the (c);
if the drop foot is on the cylinder axis, the collision distance is calculated as follows:
if the drop foot is on the extension of the cylinder axis, the collision distance is calculated as follows:
wherein dobj represents the distance between the joint of the redundant driving mechanical arm and the sphere obstacle, min represents one of the two formulas with small number,Represents the position of the center coordinates of the sphere obstacle,The formula for collision detection is as follows:
Wherein dobj represents the distance between the joint of the redundant driving mechanical arm and the spherical obstacle, robj represents the radius of the spherical obstacle, ri represents the minimum cylindrical radius of the joint of the redundant driving mechanical arm, if tcollision is 1, the ith joint of the redundant driving mechanical arm collides with the obstacle, and if tcollision is 0, the ith joint of the redundant driving mechanical arm does not collide with the obstacle.
Offline reinforcement learning data setThe specific collection steps of the method comprise that the sampling algorithm of the offline data set comprises a twin delay depth deterministic strategy gradient algorithm TD3, a depth deterministic strategy gradient algorithm DDPG and an expert data set;
The TD3 algorithm is suitable for processing the problem of continuous action space, an improved strategy updating mechanism and a value estimation method of the TD3 algorithm make the problem of continuous action space excellent in a complex environment, the DDPG algorithm shows stronger robustness and data efficiency when data collection is carried out, the TD3 algorithm and the DDPG algorithm are particularly suitable for application scenes with limited data quality and quantity, the TD3 algorithm and the DDPG algorithm are applied to the reinforcement learning path planning environment constructed in the step 2 for training iteration, data pairs (s, a, r and s ') generated in the training process are collected, wherein s represents the current moment state, a represents action, r represents rewarding, s' represents the next moment state, and the expert data set is used as a correct behavior example to accelerate the learning process.
Preprocessing the data (s, a, r, s') collected in the path planning environment to improve the availability thereof, normalizing the current time state s, and the formula is as follows:
Where λ represents the state data s after normalization processing, α2 represents the adjustment parameter, Q (si,si ') represents the state value function, s ' represents the next time state, and N represents the number of Q (si,si ').
And 6, establishing a model predictive control algorithm MPC state equation of the redundant driving mechanical arm, performing inverse solution angle calculation on a path node n0,n1,...,nn in the tail end moving process of the redundant driving mechanical arm to obtain a redundant driving joint angle value, and then moving the redundant driving mechanical arm according to the redundant driving joint angle value to complete the whole path planning.
The process for calculating path nodes by using a redundant driving mechanical arm model predictive control algorithm MPC state equation is as follows:
First, the redundant drive robot model predictive control algorithm MPC is as follows:
In the formula, (xk+1,yk+1,zk+1) represents the position of the redundant driving mechanical arm at the k moment, (xk,yk,zk) represents the position of the redundant driving mechanical arm at the k+1 moment, delta theta represents the angular velocity of each joint, and Jx,Jy,Jz represents the jacobian linear velocity matrix in the x, y and z directions respectively, so that the state equation of the model predictive control algorithm of the redundant driving mechanical arm is as follows:
x(k+1)=Ax(k)+Bu(k)=I[x y z]T+J[Δθ1 Δθ2 Δθ3 Δθ4 Δθ5 Δθ6 Δθ7]T
Wherein A represents a state transition matrix, B represents a control input matrix, I represents a 3×3 identity matrix, [ xyz ] represents a robot arm end position, [ delta ]i represents an increment of a joint angle, x (k) represents a state variable at time k, x (k+1) represents a state variable at time k+1, and u (k) represents an input variable at time k;
Whereas the model predictive control algorithm MPC predictive control performs optimal control by performing in a certain period of time in the future, expressed in discrete state space, namely:
Xk+1=AXk+BUk
Let u (k+i|k) denote the input value at the k+1st time predicted at the k time, and the input control matrix is as follows in the prediction interval N:
similarly, x (k+i|k) represents the system state at the time predicted at time k as follows:
The position error is E=xk-xc because of solving the inverse solution of the redundant drive mechanical arm, wherein xc represents the target position, and the cost equation is also:
formula Xk+1 is written as in combination with formula Xk and formula Uk:
and the cost matrix at time k can be expressed as:
Finally, the formula Xk+1 is brought into a cost matrix Jk+1 to obtain a cost matrix to be solved:
In the process, letWherein the method comprises the steps ofRepresented as an augmentation matrix. From this, the relation between the loss function Jk and the initial condition xk and the input Uk can be clearly seen, and finally, a formula for solving the inverse solution of the redundant mechanical arm by using the MPC algorithm is successfully obtained, as shown in fig. 8.
In order to prevent collision between the mechanical arm joint and the obstacle when the MPC algorithm performs inverse solution calculation, the obstacle avoidance detection algorithm needs to be added into the MPC equation as a constraint, and the specific steps of the obstacle avoidance constraint term are as follows:
the position of each current node can be obtained through DH matrix:
In the formula,The DH matrix is used for calculating the joint position of the redundant driving mechanical arm, thetaj is used for indicating the joint angle value of the current mechanical arm, j is used for indicating the joint number, the angle increment, namely the input value of the MPC formula, is u(k+i)=[Δθ1Δθ2Δθ3Δθ4Δθ5Δθ6Δθ7]T,oj=(xj,yj,zj), the position of the tail end of the mechanical arm in the three-dimensional space is used for preventing collision by restraining that the distance dobj(oj between the rod ends of the mechanical arm and an obstacle is larger than the collision distance:
Wherein dobj represents a distance calculation formula, and if the obtained distance is larger than ri+robj, no collision is judged, and if the obtained distance is smaller than ri+robj, the collision is judged.
The invention discloses a redundant driving mechanical arm path planning method based on a dense state advantage weighting (DENSE STATE ADVANTAGE WEIGHTING, DSAW) offline reinforcement learning algorithm. The DSAW algorithm applies a dense state dominance weighting function to a state dominance weighting (STATE ADVANTAGE WEIGHTING, SAW) offline reinforcement learning algorithm in a constraint mode of weighting assignment, the probability that a strategy is biased to high-quality data is improved, the negative influence of low-quality data is reduced, the collision relation between a path node and an obstacle is simplified into the collision relation between a sphere and a cuboid through a surrounding method aiming at the problem of collision detection in a three-dimensional space, the problem of collision is avoided, and a behavior strategy that a redundant driving mechanical arm learns to reach a target point is guided by designing a basic rewarding function, a collision punishment rewarding function, a target rewarding function and the like from multiple aspects of position, collision constraint, step size limitation and the like aiming at the design problem of reinforcement learning path planning environment. And then, carrying out inverse solution calculation on nodes on the path by adopting a model predictive control algorithm (Model Predicative Control, MPC) algorithm, solving the corresponding joint angles, and simultaneously adding distance error constraint, angle increment constraint and obstacle avoidance constraint into the model predictive control algorithm by considering the existence of obstacles and the characteristic that the redundant driving mechanical arm has infinite solutions, so as to ensure that the optimal joint angles are solved when the mechanical arm does not collide with the obstacles. Compared with a classical offline reinforcement algorithm, the method has the advantages that the effect is remarkable in the aspects of exploration steps, iteration times and path length, meanwhile, the MPC algorithm is adopted to solve the joint angle, so that the solving efficiency is improved, and the optimal solving angle is ensured.