Movatterモバイル変換


[0]ホーム

URL:


CN118700133B - Path planning method for redundantly driven robotic arm based on DSAW offline reinforcement learning algorithm - Google Patents

Path planning method for redundantly driven robotic arm based on DSAW offline reinforcement learning algorithm
Download PDF

Info

Publication number
CN118700133B
CN118700133BCN202410807695.8ACN202410807695ACN118700133BCN 118700133 BCN118700133 BCN 118700133BCN 202410807695 ACN202410807695 ACN 202410807695ACN 118700133 BCN118700133 BCN 118700133B
Authority
CN
China
Prior art keywords
state
collision
redundant
current
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410807695.8A
Other languages
Chinese (zh)
Other versions
CN118700133A (en
Inventor
陈正升
梁爽
王雪松
程玉虎
田阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
System Equipment Co ltd Of 28th Research Institute Li Yang
China University of Mining and Technology Beijing CUMTB
Original Assignee
System Equipment Co ltd Of 28th Research Institute Li Yang
China University of Mining and Technology Beijing CUMTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by System Equipment Co ltd Of 28th Research Institute Li Yang, China University of Mining and Technology Beijing CUMTBfiledCriticalSystem Equipment Co ltd Of 28th Research Institute Li Yang
Priority to CN202410807695.8ApriorityCriticalpatent/CN118700133B/en
Publication of CN118700133ApublicationCriticalpatent/CN118700133A/en
Application grantedgrantedCritical
Publication of CN118700133BpublicationCriticalpatent/CN118700133B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention discloses a redundant driving mechanical arm path planning method based on a DSAW offline reinforcement learning algorithm, and belongs to the field of mechanical arm three-dimensional space path planning. Aiming at the problem of collision detection in a three-dimensional space, the collision relation between a path node and an obstacle is simplified into the collision relation between a sphere and a cuboid through an enclosing method, the collision problem is avoided, and aiming at the design problem of a reinforcement learning path planning environment, a basic rewarding function, a collision punishment rewarding function, a target rewarding function and the like are designed from multiple aspects of position, collision constraint, step size limitation and the like to guide a redundant driving mechanical arm to learn a behavior strategy reaching a target point. And then, carrying out inverse solution calculation on the nodes on the path by adopting a model predictive control algorithm to obtain the corresponding joint angles. The method is convenient to use and high in calculation efficiency, and ensures the optimal solving of the joint movement angle of the redundant driving mechanical arm path.

Description

Redundant driving mechanical arm path planning method based on DSAW offline reinforcement learning algorithm
Technical Field
The invention belongs to the field of path planning of a three-dimensional space of a mechanical arm, and particularly relates to a path planning method of a redundant driving mechanical arm based on a DSAW offline reinforcement learning algorithm.
Background
Path planning is a core task of the mechanical arm for performing autonomous operation, and aims to find an optimal or feasible path from a starting point to a finishing point in the environment for the tail end of the mechanical arm, and simultaneously, the mechanical arm is driven by redundancy to reversely solve an infinite number of paths. Various factors need to be considered in the redundant drive robot path planning process, such as avoiding obstacles, inverse solutions, minimizing path length or cost, following specific path rules, etc. Path planning is critical to improving the efficiency and safety of robots and automation systems.
With the development of technology, conventional path planning methods, while effective in certain scenarios, are challenging in complex, dynamic or unknown environments. These environments require systems with greater adaptability and learning capabilities, enabling decisions to be made based on environmental changes and historical experience. Reinforcement learning, however, is well suited to these problems due to its high degree of autonomous learning. Therefore, the combination of path planning and reinforcement learning provides a new view angle and a new method for solving the navigation problem in a complex environment.
However, under a more complex and uncertain high-dimensional environment, reinforcement learning is difficult to adapt to high-latitude multi-objective tasks due to low exploration performance of the reinforcement learning, so that not only is time-consuming exploration and convergence speed slow, but also dimension disasters are easy to cause reduction of planning capacity, but off-line reinforcement learning, also called batch reinforcement learning, is a variant of reinforcement learning, and the main idea is that a redundant driving mechanical arm learns from a fixed data set, but does not explore.
The path planning of the mechanical arm is different from that of the unmanned aerial vehicle and the trolley, so that the accurate joint angle is calculated in the path planning part, and the tail end of the mechanical arm can correctly reach the appointed position.
Disclosure of Invention
Aiming at the prior art, the invention provides a path planning method based on DSAW (Dense State Advantage Weighting) off-line reinforcement learning algorithm, which improves the node traversing speed, reduces the iteration times, reduces the output path length and ensures the optimal joint solving angle under a large-scale map.
The invention discloses a redundant driving mechanical arm path planning method based on a DSAW offline reinforcement learning algorithm, which comprises the following steps of:
Step 1, performing early stage preparation of reinforcement learning path planning of a redundant driving mechanical arm, initializing a global map, namely performing environment modeling on a moving scene of the redundant driving mechanical arm in a computer and establishing a three-dimensional coordinate system, simplifying all obstacles in the moving scene environment of the redundant driving mechanical arm into a cuboid and a sphere by adopting a surrounding method so as to facilitate collision detection of subsequent obstacles and prevent collision problems, wherein the number of obstacle setting blocks is k, the set of obstacle setting blocks is { obji }, the obstacle simplifying to the sphere comprises sphere radius and sphere center coordinate information, and the obstacle simplifying to the rectangle comprises minimum coordinate and maximum coordinate information on all direction axes, namely
Step 2, constructing a reinforcement learning path planning environment related to the redundant driving mechanical arm in a simulation simulator CoppeliaSim, wherein the path planning environment comprises reinforcement learning state space, reinforcement learning action space and reinforcement learning reward functions to guide the redundant driving mechanical arm to learn a behavior strategy reaching a target point;
Step 2.1, designing a reinforcement learning state space, namely designing the state space into a state= { start, goal, current, distobstacle,distgoal and step } according to the actual requirement of path planning and global map information, wherein the start= (xinit,yinit,zinit) represents the starting point position of a redundant driving mechanical arm in a path planning environment, goal = (xend,yend,zend) represents the end point position, current= (xcurrent,ycurrent,zcurrent) represents the current node position of the redundant driving mechanical arm, distobstacle represents the distance between the current node and an obstacle, distgoal represents the distance between the current node and the end point, and step represents the running step number;
Designing reinforcement learning action space, defining all possible actions of the redundant driving mechanical arm, wherein the most critical actions relate to directions and step sizes in the process of finding point positions, so that the action space is designed as action= { x, y, z, length }, wherein (x, y, z) represents a direction vector, the range is designed as (-1, 1), and length represents the step sizes of single movement of the redundant driving mechanical arm;
designing a reinforcement learning reward function, wherein the reward function comprises rewarding correct behaviors of the redundant driving mechanical arm and punishment of non-ideal behaviors, and a mechanism for guiding the redundant driving mechanical arm to avoid obstacles to realize path planning gradual learning and optimization in a three-dimensional space;
Step 3, constructing an offline reinforcement learning data set in the reinforcement learning path planning environmentThe offline reinforcement learning data set comprises historical interaction records generated in the training process, data generated in a simulation simulator and a manually-related path planning expert databasePerforming necessary normalization pretreatment to improve the usability thereof;
Step 4, utilizing the normalized preprocessed data setPerforming learning iteration on an offline reinforcement learning algorithm DSAW, and outputting to obtain a path node n0,n1,...,nn in the process of moving the tail end of the redundant driving mechanical arm;
The specific updating process of the offline reinforcement learning algorithm DSAW algorithm is as follows:
According to the reinforcement learning path planning environment, a dense state dominance weighting function w (s, s') is provided, and more weight is allocated to an ideal strategy, and less weight is allocated to a worse strategy:
w(s,s')=[A(s,s')≥0]=[(Qθ(s,s')-Vψ(s))≥0]=1[Qθ(s,s')≥Vψ(s)]
Wherein A (s, s ') represents a state dominance function, Qθ (s, s') and Vψ(s) represent a current-time state and a next-time state value function and an evaluation function for the current state, respectively,1[Qθ(s,s')≥Vψ(s) ] represents that w (s, s ') is 1 when Q (s, s') is greater than V(s), and w (s, s ') is 0 when Q (s, s') is less than V(s);
According to the updating process of the dense state dominance weighting function w (s, s'), the offline reinforcement learning algorithm DSAW first updates the state evaluation function Vδ(s):
In the formula,Represents an indicator function, k represents an adjustment parameter, Vδ(s) represents a state evaluation function,Representing a loss function when the state evaluation function is updated, a weight distribution functionTo emphasize the moment when Qθ' (s, s ') is greater than Vδ(s), the DSAW algorithm weights the good state when performing the state estimation function update, reduces the weight of the bad state, while Qθ' (s, s') is generated by the target network θ 'of the network parameter θ of the state value function at the current moment and the next moment, and the state estimation network Vδ(s) is generated by the network parameter δ, then updates the state value function Qθ (s, s'):
In the formula,A loss function representing the update of the current time state and the next time state value function;
in the update of Qθ (s, s') and Vδ(s), onlyTo ensure that the data of state s at the time of update and state s' at the next time is in an offline data setThereby avoiding the interference of extrapolation errors;
Generating a forward network fφ (s, a) from the forward network parameters phi and predicting the next time state by state s and action a, training by minimization:
In the formula,Representing a loss function when the forward network function is updated;
An inverse dynamic model Iω (s, s ') generated from the network parameters ω is constructed to extrapolate the action a, and the inverse dynamic model Iω (s, s ') is trained by the dense state dominance weighting function w (s, s '):
Wherein Iω represents an inverse dynamic network, s and s' represent a current time state and a next time state,Representing a loss function when updating the inverse dynamic model function;
Constructing an additional predictive network s '=τψ(s), evaluating the next time state s', performing training iterations to maintain a close data set byDistribution of (a):
In the formula,Representing a loss function when the predicted network function is updated;
Step 5, setting a redundant driving mechanical arm joint and an obstacle by adopting a surrounding method, simplifying a mechanical arm connecting rod into a cylinder, simplifying the obstacle into a sphere, calculating the distance between the mechanical arm joint and the obstacle, performing obstacle avoidance detection of the redundant driving mechanical arm joint, and performing collision prevention test;
And 6, establishing a model predictive control algorithm MPC state equation of the redundant driving mechanical arm, performing inverse solution angle calculation on a path node n0,n1,...,nn in the tail end moving process of the redundant driving mechanical arm to obtain a redundant driving joint angle value, and then moving the redundant driving mechanical arm according to the redundant driving joint angle value to complete the whole path planning.
Further, the specific procedure for simplifying the obstacle using the surrounding method is as follows:
regarding the redundant driving mechanical arm as a small ball with radius r, for a sphere obstacle, the collision distance between the two is as follows on the premise of not allowing mutual collision:
In the formula,The sphere center coordinates of the sphere obstacle are represented, and the collision detection formula is as follows:
In the formula,The radius of the redundant driving mechanical arm and the radius of the sphere obstacle are respectively represented, if tcollision is 1, the redundant driving mechanical arm collides with the ith obstacle, and if tcollision is 0, the redundant driving mechanical arm does not collide with the ith obstacle;
for a cuboid obstruction, a judging formula for detecting whether collision occurs between the cuboid obstruction and the cuboid obstruction is as follows:
Simplifying irregular obstacles by an AABB surrounding method, ensuring that each surface of the simplified cuboid is parallel to a direction axis, setting the smallest coordinate position on the cuboid as (xmin,ymin,zmin), the largest coordinate as (xmax,ymax,zmax), setting the sphere center coordinate of the sphere simplified by the redundant driving mechanical arm as (xcurrent,ycurrent,zcurrent), setting the radius as r, setting the nearest point coordinate as (x, y, z), and calculating the nearest point of the sphere center of the redundant driving mechanical arm and the AABB cuboid, wherein if the following formula is satisfied:
The nearest point (x, y, z) is the sphere center (xcurrent,ycurrent,zcurrent) of the redundant driving mechanical arm, collision occurs at the moment, when the nearest point is not in the range of a cuboid, the distance relation between the sphere center (xcurrent,ycurrent,zcurrent) and the cuboid (xmin,xmax),(ymin,ymax),(zmin,zmax) is judged, if the sphere center xcurrent<xmin exists, the coordinate of the nearest point is xmin,xcurrent>xmax and is xmax, and the nearest coordinate y, z can be obtained in the same way;
When the closest point is not within the cuboid range, calculating the distance from the closest point to the center of the sphere by using the Euclidean distance formula:
if distobji < r exists, judging that the redundant driving mechanical arm collides with the obstacle.
Further, the design of the reinforcement learning reward function follows the following principle:
(1) A punishment rewarding function, namely adding a punishment parameter Rcollision=-λcollision,Rcollision to represent a collision punishment rewarding when collision occurs in order to safely plan a collision-free path, wherein lambdacollision represents that the punishment parameter is a constant, rewarding the action of the redundant driving mechanical arm running at the moment if collision occurs, and helping the redundant driving mechanical arm to learn how to avoid collision by using a punishment item;
(2) And guiding a reward function, namely decomposing a path planning task into a plurality of stages from far from the target to near to the target according to the distance between the redundant driving mechanical arm and the target position, wherein each stage uses a unified or independent learning target and a reward mechanism according to the requirement, and each stage uses an independent learning target and a reward mechanism with the following specific formulas:
Wherein d represents an actual euclidean distance between the redundant drive arm position and the target position, d1 represents a first euclidean distance between the preset arm position and the target position, d2 represents a second euclidean distance between the preset arm position and the target position, λ1 represents a reward parameter when the distance d is smaller than d1, λ2 represents a reward parameter when the distance d is equal to or greater than d1 and equal to or less than d2, λ3 represents a reward parameter when the distance d is greater than d2, and R1 represents a stepwise reward function;
(3) And a basic rewarding function, namely providing positive rewarding for the behavior approaching to the target state, guiding the redundant driving mechanical arm to obtain positive learning rewarding even if the whole task cannot be completed in the learning process, wherein the formula is as follows:
wherein dcurrent represents the Euclidean distance between the redundant driving mechanical arm and the target position at the current moment, (xcurrent,ycurrent,zcurrent) represents the current node position of the redundant driving mechanical arm in the three-dimensional space, (xgoal,ygoal,zgoal) represents the target position expected to be reached in the path planning task, alpha1 represents that the adjustment parameter is a constant, R2 represents the basic rewarding function, and the punishment rewarding function, the guiding rewarding function and the basic rewarding function are added to obtain the complete path planning rewarding function Rsum=Rcollision+R1+R2.
Further, offline reinforcement learning data setThe specific collection steps of the method comprise that the sampling algorithm of the offline data set comprises a twin delay depth deterministic strategy gradient algorithm TD3, a depth deterministic strategy gradient algorithm DDPG and an expert data set;
The TD3 algorithm is suitable for processing the problem of continuous action space, the improved strategy updating mechanism and the value estimation method make the problem excellent in a complex environment, the DDPG algorithm shows stronger robustness and data efficiency when data collection is carried out, and is suitable for application scenes with limitation on data quality and quantity, the TD3 algorithm and the DDPG algorithm are applied to the reinforcement learning path planning environment constructed in the step 2 for training iteration, data pairs (s, a, r and s ') generated in the training process are collected, wherein s represents the current moment state, a represents action, r represents rewarding, s' represents the next moment state, and the expert data set is used as a correct behavior example to accelerate the learning process;
Preprocessing the data (s, a, r, s') collected in the path planning environment to improve the availability thereof, normalizing the current time state s, and the formula is as follows:
Where λ represents the state data s after normalization processing, α2 represents the adjustment parameter, Q (si,si ') represents the state value function, s ' represents the next time state, and N represents the number of Q (si,si ').
The process of simplifying the cylinder and the sphere by adopting the surrounding method to drive the mechanical arm joint and the obstacle redundantly comprises the steps of converting the collision problem between the mechanical arm and the obstacle into the positional relationship problem between the cylinder and the sphere by using the surrounding box method, wherein the positional relationship between the sphere and the cylinder, namely the sphere center Pobj of the sphere obstacle and the cylinder, needs to be judged before collision detectionThe position relation of the drop foot D of the axis of the ball body is calculated by the dot product of the vectorWith the axis of the cylinderThe position relation of the drop foot D comprises the following steps:
(1) Calculating vectorsAndVector;
(2) Calculating the projection length of the vector on the vector:
(3) Judging the position of the drop foot D through the projection length z, and if z is more than or equal to 0 and less than or equal to 1, judging the on-line segment of the drop foot DOn the upper part, otherwise, the drop foot D is on-lineIs arranged on the extension line of the (c);
if the drop foot is on the cylinder axis, the collision distance is calculated as follows:
if the drop foot is on the extension of the cylinder axis, the collision distance is calculated as follows:
wherein dobj represents the distance between the joint of the redundant driving mechanical arm and the sphere obstacle, min represents one of the two formulas with small number,Represents the position of the center coordinates of the sphere obstacle,The formula for collision detection is as follows:
Wherein dobj represents the distance between the joint of the redundant driving mechanical arm and the spherical obstacle, robj represents the radius of the spherical obstacle, ri represents the minimum cylindrical radius of the joint of the redundant driving mechanical arm, if tcollision is 1, the ith joint of the redundant driving mechanical arm collides with the obstacle, and if tcollision is 0, the ith joint of the redundant driving mechanical arm does not collide with the obstacle.
Further, the process of calculating the path node by the redundant driving mechanical arm model predictive control algorithm MPC state equation is as follows:
First, the redundant drive robot model predictive control algorithm MPC is as follows:
In the formula, (xk+1,yk+1,zk+1) represents the position of the redundant driving mechanical arm at the time k+1, (xk,yk,zk) represents the position of the redundant driving mechanical arm at the time k, delta theta represents the angular velocity of each joint, and Jx,Jy,Jz represents the jacobian linear velocity matrix in the x, y and z directions respectively, so that the state equation of the model predictive control algorithm of the redundant driving mechanical arm is as follows:
x(k+1)=Ax(k)+Bu(k)=I[x y z]T+J[Δθ1 Δθ2 Δθ3 Δθ4 Δθ5 Δθ6 Δθ7]T
Wherein A represents a state transition matrix, B represents a control input matrix, I represents a3×3 identity matrix, [ xyz ] represents a robot arm end position, [ delta ]i represents an increment of a joint angle, x (k) represents a state variable at time k, x (k+1) represents a state variable at time k+1, and u (k) represents a control input variable at time k;
Whereas the model predictive control algorithm MPC predictive control performs optimal control by performing in a certain period of time in the future, expressed in discrete state space, namely:
Xk+1=AXk+BUk
Let u (k+i|k) denote the input value at the k+1st time predicted at the k time, and the input control matrix is as follows in the prediction interval N:
similarly, x (k+i|k) represents the system state at the time predicted at time k as follows:
The position error is E=xk-xc because of solving the inverse solution of the redundant drive mechanical arm, wherein xc represents the target position, and the cost equation is also:
formula Xk+1 is written as in combination with formula Xk and formula Uk:
and the cost matrix at time k can be expressed as:
Finally, the formula Xk+1 is brought into a cost matrix Jk+1 to obtain a cost matrix to be solved:
In the process, letWherein the method comprises the steps ofThe relation between the loss function Jk and the initial condition xk and the relation between the loss function Jk and the input Uk are expressed as an augmentation matrix, and finally, a formula for solving the inverse solution of the redundant mechanical arm by using an MPC algorithm is successfully obtained;
In order to prevent collision between the mechanical arm joint and the obstacle when the MPC algorithm performs inverse solution calculation, the obstacle avoidance detection algorithm needs to be added into the MPC equation as a constraint, and the specific steps of the obstacle avoidance constraint term are as follows:
the position of each current node can be obtained through DH matrix:
In the formula,The DH matrix is used for calculating the joint position of the redundant driving mechanical arm, thetaj is used for indicating the joint angle value of the current mechanical arm, j is used for indicating the joint number, the angle increment, namely the input value of the MPC formula, is u(k+i)=[Δθ1Δθ2Δθ3Δθ4Δθ5Δθ6Δθ7]T,oj=(xj,yj,zj), the position of the tail end of the mechanical arm in the three-dimensional space is used for preventing collision by restraining that the distance dobj(oj between each rod piece of the mechanical arm and an obstacle is larger than the collision distance:
Wherein dobj represents a distance calculation formula, and if the obtained distance is larger than ri+robj, no collision is judged, and if the obtained distance is smaller than ri+robj, the collision is judged.
The invention has the beneficial effects that (1) the invention discloses a method for performing the path planning task of the redundant driving mechanical arm in the three-dimensional space, which can avoid the occurrence of collision, provides a dense state dominance weighting function by combining the path planning environment on the basis of the SAW algorithm, ensures that the strategy is biased to high-quality data in the data set, reduces the negative influence of low-quality data, and improves the path planning efficiency.
(2) Aiming at three design problems of a state space, an action space and a reward function of a reinforcement learning path planning environment, a basic reward function, a collision punishment reward function and a target reward function are designed from a plurality of aspects of position, collision constraint, step length limitation and the like, the target reward function is used for guiding a redundant driving mechanical arm to learn a behavior strategy for reaching a target point, and the success rate of planning a path in a three-dimensional space with an obstacle by the redundant driving mechanical arm is improved
(3) Solving the redundant driving joint angles of the redundant driving mechanical arm and solving the joint movement path nodes of the redundant driving mechanical arm through a model predictive control algorithm, so that the optimal joint angles of the redundant driving mechanical arm are solved under the condition of avoiding the obstacle.
Drawings
FIG. 1 is a schematic flow chart of a redundant drive mechanical arm path planning method based on the combination of a DSAW algorithm and an MPC algorithm;
FIG. 2 is a simplified schematic diagram of a redundant drive arm linkage and an obstacle using the surrounding theory for cylinders and spheres in an embodiment of the present invention;
FIG. 3 is a schematic view of a link and an obstacle of a robot arm simplified to a cylinder and a sphere drop foot on an axis in an embodiment of the present invention;
FIG. 4 is a schematic view of a link and an obstacle of a robot arm simplified to a cylinder and a sphere foot on an axis extension line in an embodiment of the present invention;
FIG. 5 is a schematic diagram of a model for constructing redundant drive mechanical arms and obstacles in CoppeliaSim's environment in an embodiment of the present invention;
FIG. 6 is a schematic diagram of a path planning result in three-dimensional coordinates according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a path planning result in an obstacle environment according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of the angle of the corresponding path node of the redundant driving mechanical arm joint calculated by the MPC algorithm in the embodiment of the present invention;
FIG. 9 is a schematic diagram illustrating a motion prediction in CoppeliaSim environments according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be further described with reference to the accompanying drawings:
As shown in fig. 1, the invention discloses a redundant driving mechanical arm path planning method based on a DSAW offline reinforcement learning algorithm, which comprises the following specific steps:
Step 1, performing early stage preparation of reinforcement learning path planning of a redundant driving mechanical arm, initializing a global map, namely performing environment modeling on a moving scene of the redundant driving mechanical arm in a computer and establishing a three-dimensional coordinate system, simplifying all obstacles in the moving scene environment of the redundant driving mechanical arm into a cuboid and a sphere by adopting a surrounding method so as to facilitate collision detection of subsequent obstacles and prevent collision problems, wherein the number of obstacle setting blocks is k, the set of obstacle setting blocks is { obji }, the obstacle simplifying to the sphere comprises sphere radius and sphere center coordinate information, and the obstacle simplifying to the rectangle comprises minimum coordinate and maximum coordinate information on all direction axes, namely
The specific process for simplifying the obstacle by using the surrounding method is as follows, the redundant driving mechanical arm is regarded as a small sphere with radius r, and for the sphere obstacle, the collision distance between the two is as follows under the premise of not allowing mutual collision:
In the formula,The sphere center coordinates of the sphere obstacle are represented, and the collision detection formula is as follows:
Wherein, r is,The radius of the redundant driving mechanical arm and the radius of the sphere obstacle are respectively represented, if tcollision is 1, the redundant driving mechanical arm collides with the ith obstacle, and if tcollision is 0, the redundant driving mechanical arm does not collide with the ith obstacle;
for a cuboid obstruction, a judging formula for detecting whether collision occurs between the cuboid obstruction and the cuboid obstruction is as follows:
Simplifying irregular obstacles by an AABB surrounding method, ensuring that each surface of the simplified cuboid is parallel to a direction axis, setting the smallest coordinate position on the cuboid as (xmin,ymin,zmin), the largest coordinate as (xmax,ymax,zmax), setting the sphere center coordinate of the sphere simplified by the redundant driving mechanical arm as (xcurrent,ycurrent,zcurrent), setting the radius as r, setting the nearest point coordinate as (x, y, z), and calculating the nearest point of the sphere center of the redundant driving mechanical arm and the AABB cuboid, wherein if the following formula is satisfied:
The nearest point (x, y, z) is the sphere center (xcurrent,ycurrent,zcurrent) of the redundant driving mechanical arm, collision occurs at the moment, when the nearest point is not in the range of a cuboid, the distance relation between the sphere center (xcurrent,ycurrent,zcurrent) and the cuboid (xmin,xmax),(ymin,ymax),(zmin,zmax) is judged, if the sphere center xcurrent<xmin exists, the coordinate of the nearest point is xmin,xcurrent>xmax and is xmax, and the nearest coordinate y, z can be obtained in the same way;
When the closest point is not within the cuboid range, calculating the distance from the closest point to the center of the sphere by using the Euclidean distance formula:
If presentThe redundant drive robot arm is judged to collide with the obstacle.
Step 2, constructing a reinforcement learning path planning environment related to the redundant driving mechanical arm in a simulation simulator CoppeliaSim, wherein the path planning environment comprises reinforcement learning state space, reinforcement learning action space and reinforcement learning reward functions to guide the redundant driving mechanical arm to learn a behavior strategy reaching a target point, as shown in fig. 5, 6, 7 and 9;
step 2.1, designing a reinforcement learning state space, namely designing the state space into a state= { start, goal, current, distobstacle,distgoal and step } according to the actual requirement of path planning and global map information, wherein the start= (xinit,yinit,zinit) represents the starting point position of a redundant driving mechanical arm in the path planning environment, goal = (xend,yend,zend) represents the end point position, current= (xcurrent,ycurrent,zcurrent) represents the current node position, distobstacle represents the distance between the current node and an obstacle, distgoal represents the distance between the current node and the end point, and step represents the running step number;
Designing reinforcement learning action space, defining all possible actions of the redundant driving mechanical arm, wherein the most critical actions relate to directions and step sizes in the process of finding point positions, so that the action space is designed as action= { x, y, z, length }, wherein (x, y, z) represents a direction vector, the range is designed as (-1, 1), and length represents the step sizes of single movement of the redundant driving mechanical arm;
designing a reinforcement learning reward function, wherein the reward function comprises rewarding correct behaviors of the redundant driving mechanical arm and punishment of non-ideal behaviors, and a mechanism for guiding the redundant driving mechanical arm to avoid obstacles to realize path planning gradual learning and optimization in a three-dimensional space;
The design of the reinforcement learning reward function follows the principle that (1) a punishment reward function is adopted, namely, a punishment parameter Rcollision=-λcollision,Rcollision is required to be added when a collision happens in order to safely plan a collision-free path, wherein the punishment reward is expressed, lambdacollision is expressed as a constant, if the collision happens, the action of the redundant driving mechanical arm running at the moment is rewarded, and the punishment item is utilized to help the redundant driving mechanical arm learn how to avoid the collision.
(2) And guiding a reward function, namely decomposing a path planning task into a plurality of stages from far from the target to near to the target according to the distance between the redundant driving mechanical arm and the target position, wherein each stage uses a unified or independent learning target and a reward mechanism according to the requirement, and each stage uses an independent learning target and a reward mechanism with the following specific formulas:
Wherein d represents the Euclidean distance between the redundant driving mechanical arm position and the target position, d1 represents the first Euclidean distance between the preset mechanical arm position and the target position, d2 represents the second Euclidean distance between the preset mechanical arm position and the target position, lambdai represents the rewarding parameter when the distance d is smaller than d1, lambda2 represents the rewarding parameter when the distance d is larger than or equal to d1 and smaller than or equal to d2, lambda3 represents the rewarding parameter when the distance d is larger than d2, and R1 represents the staged rewarding function;
(3) And a basic rewarding function, namely providing positive rewarding for the behavior approaching to the target state, guiding the redundant driving mechanical arm to obtain positive learning rewarding even if the whole task cannot be completed in the learning process, wherein the formula is as follows:
Wherein dcurrent represents the Euclidean distance between the redundant driving mechanical arm and the target position at the current moment, (xcurrent,ycurrent,zcurrent) represents the position of the redundant driving mechanical arm in the three-dimensional space, (xgoal,ygoal,zgoal) represents the target position expected to be reached in the path planning task, alpha1 represents that the adjustment parameter is a constant, and R2 represents a basic rewarding function;
and adding the punishment reward function, the guide reward function and the basic reward function to obtain a complete path planning reward function Rsum=Rcollision+R1+R2.
Step 3, constructing an offline reinforcement learning data set in the reinforcement learning path planning environmentThe offline reinforcement learning data set comprises historical interaction records generated in the training process, data generated in a simulation simulator and a manually-related path planning expert databasePerforming necessary normalization pretreatment to improve the usability thereof;
Step 4, utilizing the normalized preprocessed data setPerforming learning iteration on an offline reinforcement learning algorithm DSAW, and outputting to obtain a path node n0,n1,...,nn in the process of moving the tail end of the redundant driving mechanical arm;
The specific updating process of the offline reinforcement learning algorithm DSAW algorithm is that a dense state dominance weighting function w (s, s') is provided according to the reinforcement learning path planning environment, and a larger weight is distributed to an ideal strategy and a worse strategy is given a smaller weight:
w(s,s')=[A(s,s')≥0]=[(Qθ(s,s')-Vψ(s))≥0]=1[Qθ(s,s')≥Vψ(s)]
Where A (s, s ') represents a state dominance function, and Qθ (s, s') and Vψ(s) represent a current-time state and next-time state value function and an evaluation function for the current state, respectively.1[Qθ(s,s')≥Vψ (s) ] represents that w (s, s ') has a value of 1 when Q (s, s') is larger than V(s), and w (s, s ') has a value of 0 when Q (s, s') is smaller than V(s);
According to the updating process of the dense state dominance weighting function w (s, s'), the offline reinforcement learning algorithm DSAW first updates the state evaluation function Vδ(s):
In the formula,Represents an indicator function, k represents an adjustment parameter, Vδ(s) represents a state evaluation function,Representing a loss function when the state evaluation function is updated, a weight distribution functionTo emphasize the moment when Qθ' (s, s ') is greater than Vδ(s), the DSAW algorithm weights the good state when the state estimation function is updated, the bad state is reduced, Qθ' (s, s ') is generated by the target network θ ' of the network parameter θ of the state value function at the current moment and the next moment, and the state estimation network Vδ(s) is generated by the network parameter δ. The state value function Qθ (s, s') is then updated:
in the formula, Es,s′~D (what meaning is represented? the loss expectation function in the case of a state s' at a moment,A loss function representing the update of the current time state and the next time state value function;
in the update of Qθ (s, s') and Vδ(s), onlyTo ensure that the data of state s at the time of update and state s' at the next time is in an offline data setThereby avoiding the interference of extrapolation errors;
Generating a forward network fφ (s, a) from the forward network parameters phi and predicting the next time state by state s and action a, training by minimization:
In the formula,Representing a loss function when the forward network function is updated;
An inverse dynamic model Iω (s, s ') generated from the network parameters ω is constructed to extrapolate the action a, and the inverse dynamic model Iω (s, s ') is trained by the dense state dominance weighting function w (s, s '):
Wherein Iω represents an inverse dynamic network, s and s' represent a current time state and a next time state,Representing a loss function when updating the inverse dynamic model function;
Constructing an additional predictive network s '=τ(s), evaluating the next time state s', performing training iterations to maintain a close data set byDistribution of (a):
In the formula,Representing the loss function when the predicted network function is updated.
Step 5, setting a redundant driving mechanical arm joint and an obstacle by adopting a surrounding method, simplifying a mechanical arm connecting rod into a cylinder, simplifying the obstacle into a sphere, calculating the distance between the mechanical arm joint and the obstacle, and performing obstacle avoidance detection and collision prevention test on the redundant driving mechanical arm joint as shown in fig. 3, 4 and 5;
The process of simplifying the cylinder and the sphere of the redundant driving mechanical arm joint and the obstacle by adopting the bounding box method comprises the steps of converting the collision problem between the mechanical arm and the obstacle into the positional relationship problem between the cylinder and the sphere by adopting the bounding box method, and judging the positional relationship between the sphere and the cylinder before collision detection, namely the sphere center Pobj of the sphere obstacle and the cylinderThe position relation of the drop foot D of the axis of the ball body is calculated by the dot product of the vectorWith the axis of the cylinderThe position relation of the drop foot D comprises the following steps:
(1) Calculating vectorsAndVector;
(2) Calculating the projection length of the vector on the vector:
(3) Judging the position of the drop foot D through the projection length z, and if z is more than or equal to 0 and less than or equal to 1, judging the on-line segment of the drop foot DOn the upper part, otherwise, the drop foot D is on-lineIs arranged on the extension line of the (c);
if the drop foot is on the cylinder axis, the collision distance is calculated as follows:
if the drop foot is on the extension of the cylinder axis, the collision distance is calculated as follows:
wherein dobj represents the distance between the joint of the redundant driving mechanical arm and the sphere obstacle, min represents one of the two formulas with small number,Represents the position of the center coordinates of the sphere obstacle,The formula for collision detection is as follows:
Wherein dobj represents the distance between the joint of the redundant driving mechanical arm and the spherical obstacle, robj represents the radius of the spherical obstacle, ri represents the minimum cylindrical radius of the joint of the redundant driving mechanical arm, if tcollision is 1, the ith joint of the redundant driving mechanical arm collides with the obstacle, and if tcollision is 0, the ith joint of the redundant driving mechanical arm does not collide with the obstacle.
Offline reinforcement learning data setThe specific collection steps of the method comprise that the sampling algorithm of the offline data set comprises a twin delay depth deterministic strategy gradient algorithm TD3, a depth deterministic strategy gradient algorithm DDPG and an expert data set;
The TD3 algorithm is suitable for processing the problem of continuous action space, an improved strategy updating mechanism and a value estimation method of the TD3 algorithm make the problem of continuous action space excellent in a complex environment, the DDPG algorithm shows stronger robustness and data efficiency when data collection is carried out, the TD3 algorithm and the DDPG algorithm are particularly suitable for application scenes with limited data quality and quantity, the TD3 algorithm and the DDPG algorithm are applied to the reinforcement learning path planning environment constructed in the step 2 for training iteration, data pairs (s, a, r and s ') generated in the training process are collected, wherein s represents the current moment state, a represents action, r represents rewarding, s' represents the next moment state, and the expert data set is used as a correct behavior example to accelerate the learning process.
Preprocessing the data (s, a, r, s') collected in the path planning environment to improve the availability thereof, normalizing the current time state s, and the formula is as follows:
Where λ represents the state data s after normalization processing, α2 represents the adjustment parameter, Q (si,si ') represents the state value function, s ' represents the next time state, and N represents the number of Q (si,si ').
And 6, establishing a model predictive control algorithm MPC state equation of the redundant driving mechanical arm, performing inverse solution angle calculation on a path node n0,n1,...,nn in the tail end moving process of the redundant driving mechanical arm to obtain a redundant driving joint angle value, and then moving the redundant driving mechanical arm according to the redundant driving joint angle value to complete the whole path planning.
The process for calculating path nodes by using a redundant driving mechanical arm model predictive control algorithm MPC state equation is as follows:
First, the redundant drive robot model predictive control algorithm MPC is as follows:
In the formula, (xk+1,yk+1,zk+1) represents the position of the redundant driving mechanical arm at the k moment, (xk,yk,zk) represents the position of the redundant driving mechanical arm at the k+1 moment, delta theta represents the angular velocity of each joint, and Jx,Jy,Jz represents the jacobian linear velocity matrix in the x, y and z directions respectively, so that the state equation of the model predictive control algorithm of the redundant driving mechanical arm is as follows:
x(k+1)=Ax(k)+Bu(k)=I[x y z]T+J[Δθ1 Δθ2 Δθ3 Δθ4 Δθ5 Δθ6 Δθ7]T
Wherein A represents a state transition matrix, B represents a control input matrix, I represents a 3×3 identity matrix, [ xyz ] represents a robot arm end position, [ delta ]i represents an increment of a joint angle, x (k) represents a state variable at time k, x (k+1) represents a state variable at time k+1, and u (k) represents an input variable at time k;
Whereas the model predictive control algorithm MPC predictive control performs optimal control by performing in a certain period of time in the future, expressed in discrete state space, namely:
Xk+1=AXk+BUk
Let u (k+i|k) denote the input value at the k+1st time predicted at the k time, and the input control matrix is as follows in the prediction interval N:
similarly, x (k+i|k) represents the system state at the time predicted at time k as follows:
The position error is E=xk-xc because of solving the inverse solution of the redundant drive mechanical arm, wherein xc represents the target position, and the cost equation is also:
formula Xk+1 is written as in combination with formula Xk and formula Uk:
and the cost matrix at time k can be expressed as:
Finally, the formula Xk+1 is brought into a cost matrix Jk+1 to obtain a cost matrix to be solved:
In the process, letWherein the method comprises the steps ofRepresented as an augmentation matrix. From this, the relation between the loss function Jk and the initial condition xk and the input Uk can be clearly seen, and finally, a formula for solving the inverse solution of the redundant mechanical arm by using the MPC algorithm is successfully obtained, as shown in fig. 8.
In order to prevent collision between the mechanical arm joint and the obstacle when the MPC algorithm performs inverse solution calculation, the obstacle avoidance detection algorithm needs to be added into the MPC equation as a constraint, and the specific steps of the obstacle avoidance constraint term are as follows:
the position of each current node can be obtained through DH matrix:
In the formula,The DH matrix is used for calculating the joint position of the redundant driving mechanical arm, thetaj is used for indicating the joint angle value of the current mechanical arm, j is used for indicating the joint number, the angle increment, namely the input value of the MPC formula, is u(k+i)=[Δθ1Δθ2Δθ3Δθ4Δθ5Δθ6Δθ7]T,oj=(xj,yj,zj), the position of the tail end of the mechanical arm in the three-dimensional space is used for preventing collision by restraining that the distance dobj(oj between the rod ends of the mechanical arm and an obstacle is larger than the collision distance:
Wherein dobj represents a distance calculation formula, and if the obtained distance is larger than ri+robj, no collision is judged, and if the obtained distance is smaller than ri+robj, the collision is judged.
The invention discloses a redundant driving mechanical arm path planning method based on a dense state advantage weighting (DENSE STATE ADVANTAGE WEIGHTING, DSAW) offline reinforcement learning algorithm. The DSAW algorithm applies a dense state dominance weighting function to a state dominance weighting (STATE ADVANTAGE WEIGHTING, SAW) offline reinforcement learning algorithm in a constraint mode of weighting assignment, the probability that a strategy is biased to high-quality data is improved, the negative influence of low-quality data is reduced, the collision relation between a path node and an obstacle is simplified into the collision relation between a sphere and a cuboid through a surrounding method aiming at the problem of collision detection in a three-dimensional space, the problem of collision is avoided, and a behavior strategy that a redundant driving mechanical arm learns to reach a target point is guided by designing a basic rewarding function, a collision punishment rewarding function, a target rewarding function and the like from multiple aspects of position, collision constraint, step size limitation and the like aiming at the design problem of reinforcement learning path planning environment. And then, carrying out inverse solution calculation on nodes on the path by adopting a model predictive control algorithm (Model Predicative Control, MPC) algorithm, solving the corresponding joint angles, and simultaneously adding distance error constraint, angle increment constraint and obstacle avoidance constraint into the model predictive control algorithm by considering the existence of obstacles and the characteristic that the redundant driving mechanical arm has infinite solutions, so as to ensure that the optimal joint angles are solved when the mechanical arm does not collide with the obstacles. Compared with a classical offline reinforcement algorithm, the method has the advantages that the effect is remarkable in the aspects of exploration steps, iteration times and path length, meanwhile, the MPC algorithm is adopted to solve the joint angle, so that the solving efficiency is improved, and the optimal solving angle is ensured.

Claims (6)

Translated fromChinese
1.一种基于DSAW离线强化学习算法的冗余驱动机械臂路径规划方法,其特征在于,包括如下步骤:1. A redundant drive robot arm path planning method based on DSAW offline reinforcement learning algorithm, characterized in that it includes the following steps:步骤1:进行冗余驱动机械臂的强化学习路径规划的前期准备,全局地图初始化:在计算机中对冗余驱动机械臂移动场景进行环境建模并建立三维坐标系,采用包围法将冗余驱动机械臂移动场景环境中的所有障碍物简化为长方体和球体,以方便后续障碍物碰撞检测,防止出现碰撞问题;设障碍物数量为k,组成障碍物集合为{obji},简化为球体的障碍物包括球半径及球心坐标信息,简化为长方形的障碍物包括各方向轴上的最小坐标和最大坐标信息,即Step 1: Perform preliminary preparation for the reinforcement learning path planning of the redundant driven robot arm and initialize the global map: Model the environment of the redundant driven robot arm moving scene in the computer and establish a three-dimensional coordinate system. Use the encirclement method to simplify all obstacles in the redundant driven robot arm moving scene environment into cuboids and spheres to facilitate subsequent obstacle collision detection and prevent collision problems; let the number of obstacles be k, and the obstacle set be {obji }. The obstacles simplified to spheres include the sphere radius and sphere center coordinate information, and the obstacles simplified to rectangles include the minimum and maximum coordinate information on each direction axis, that is,步骤2:在仿真模拟器CoppeliaSim中构建关于冗余驱动机械臂的强化学习路径规划环境,路径规划环境包括了强化学习状态空间、强化学习动作空间和强化学习的奖励函数,来引导冗余驱动机械臂学习达到目标点的行为策略;Step 2: Construct a reinforcement learning path planning environment for the redundant driving robot in the simulation simulator CoppeliaSim. The path planning environment includes the reinforcement learning state space, reinforcement learning action space, and reinforcement learning reward function to guide the redundant driving robot to learn the behavior strategy to reach the target point.步骤2.1设计强化学习状态空间;依据路径规划的实际需求以及全局地图信息将状态空间设计为state={start,goal,current,distobstacle,distgoal,step},其中start=(xinit,yinit,zinit)表示路径规划环境中冗余驱动机械臂的起点位置,goal=(xend,yend,zend)表示终点位置,current=(xcurrent,ycurrent,zcurrent),表示冗余驱动机械臂的当前节点位置,distobstacle表示当前节点与障碍物之间的距离,distgoal表示当前节点与终点之间的距离,step表示运行步数;Step 2.1 Design the reinforcement learning state space; according to the actual needs of path planning and the global map information, the state space is designed as state = {start, goal, current, distobstacle , distgoal , step}, where start = (xinit , yinit , zinit ) represents the starting position of the redundant drive robot in the path planning environment, goal = (xend , yend , zend ) represents the end position, current = (xcurrent , ycurrent , zcurrent ), represents the current node position of the redundant drive robot, distobstacle represents the distance between the current node and the obstacle, distgoal represents the distance between the current node and the end point, and step represents the number of running steps;步骤2.2:设计强化学习动作空间,定义冗余驱动机械臂所有可能执行的动作,其中最关键的动作涉及到寻找点位过程中的方向和步长,因此将动作空间设计为action={x,y,z,length},其中(x,y,z)表示方向向量,范围设计为(-1,1),length表示冗余驱动机械臂单次移动的步长;Step 2.2: Design the reinforcement learning action space and define all possible actions that the redundant driving robot can perform. The most critical actions involve the direction and step length in the process of finding the point. Therefore, the action space is designed as action = {x, y, z, length}, where (x, y, z) represents the direction vector, the range is designed to be (-1, 1), and length represents the step length of a single movement of the redundant driving robot.步骤2.3:设计强化学习的奖励函数;奖励函数包括对冗余驱动机械臂正确行为的奖励、不理想行为的惩罚,以及引导冗余驱动机械臂避开障碍物实现在三维空间中进行路径规划逐步学习和优化的机制;Step 2.3: Design a reward function for reinforcement learning; the reward function includes rewards for correct behavior of the redundantly driven robot arm, penalties for undesirable behavior, and a mechanism to guide the redundantly driven robot arm to avoid obstacles and achieve path planning in three-dimensional space. Learning and optimization;步骤3:在强化学习路径规划环境中搭建离线强化学习数据集离线强化学习数据集包括训练过程中产生的历史交互记录、利用仿真模拟器中生成的数据以及人工涉及好的路径规划专家数据库;对数据集进行必要的归一化预处理,以提高其可用性;Step 3: Build an offline reinforcement learning dataset in a reinforcement learning path planning environment The offline reinforcement learning dataset includes historical interaction records generated during the training process, data generated by the simulation simulator, and an artificial path planning expert database. Perform necessary normalization preprocessing to improve its usability;步骤4:利用归一化预处理后的数据集对离线强化学习算法DSAW进行学习迭代,输出得到冗余驱动机械臂末端移动过程中的路径节点n0,n1,...,nnStep 4: Use normalized preprocessed dataset The offline reinforcement learning algorithm DSAW is iterated to output the path nodes n0 ,n1 ,...,nn during the movement of the redundant driving robot end;离线强化学习算法DSAW算法的具体更新过程为:The specific update process of the offline reinforcement learning algorithm DSAW algorithm is:依据强化学习路径规划环境,提出密集状态优势加权函数w(s,s′),给理想的策略分配更大的权重,给较差的策略更小的权重:According to the reinforcement learning path planning environment, a dense state advantage weighting function w(s,s′) is proposed to assign a larger weight to the ideal strategy and a smaller weight to the poor strategy:w(s,s')=[A(s,s')≥0]=[(Qθ(s,s')-Vψ(s))≥0]=1[Qθ(s,s')≥Vψ(s)]w(s,s')=[A(s,s')≥0]=[(Qθ (s,s')-Vψ (s))≥0]=1 [Qθ (s,s')≥Vψ (s)]式中,A(s,s')表示状态优势函数,Qθ(s,s')和Vψ(s)分别表示当前时刻状态和下一时刻状态值函数和对当前状态的评估函数;1[Qθ(s,s')≥Vψ(s)]表示在Q(s,s′)大于V(s)时w(s,s′)值为1,在Q(s,s′)小于V(s)时w(s,s′)值为0;Where A(s,s') represents the state advantage function, Qθ (s,s') and Vψ (s) represent the state value function of the current state and the next state and the evaluation function of the current state, respectively;1 [Qθ (s,s') ≥ Vψ (s)] means that when Q(s,s') is greater than V(s), the value of w(s,s') is 1, and when Q(s,s') is less than V(s), the value of w(s,s') is 0;依据密集状态优势加权函数w(s,s′),离线强化学习算法DSAW的更新过程,首先对状态评估函数Vδ(s)进行更新:According to the dense state advantage weighting function w(s,s′), the updating process of the offline reinforcement learning algorithm DSAW first updates the state evaluation function Vδ (s):式中,表示指示器函数,k表示调整参数,Vδ(s)表示状态评估函数,表示状态评估函数进行更新时的损失函数;权重分配函数用以强调Qθ'(s,s')大于Vδ(s)值的时刻,提高DSAW算法在进行状态评估函数更新时对优质状态的权重,降低劣质状态的权重,而Qθ'(s,s')是由当前时刻状态和下一时刻状态值函数的网络参数θ的目标网络θ′生成,而状态评估网络Vδ(s)由网络参数δ生成;然后对状态值函数Qθ(s,s')进行更新:In the formula, represents the indicator function, k represents the adjustment parameter, Vδ (s) represents the state evaluation function, Represents the loss function when the state evaluation function is updated; weight distribution function It is used to emphasize the moment when Qθ' (s, s') is greater than the value of Vδ (s), to increase the weight of the high-quality state and reduce the weight of the low-quality state when the DSAW algorithm updates the state evaluation function. Qθ' (s, s') is generated by the target network θ' of the current state and the network parameter θ of the state value function at the next moment, and the state evaluation network Vδ (s) is generated by the network parameter δ; then the state value function Qθ (s, s') is updated:式中,表示当前时刻状态和下一时刻状态值函数进行更新时的损失函数;In the formula, Represents the loss function when the current state and the next state value function are updated;在进行Qθ(s,s')和Vδ(s)的更新时,只使用来确保在进行更新时状态s和下一时刻状态s'的数据在离线数据集中,从而避免了外推误差的干扰;When updating Qθ (s, s') and Vδ (s), only To ensure that the data of state s and the next state s' are in the offline dataset when updating In this way, the interference of extrapolation error is avoided;由前向网络参数φ生成前向网络fφ(s,a)并通过状态s和动作a去预测下一时刻状态,通过最小化来进行训练:The forward network parameter φ generates the forward network fφ (s, a) and predicts the next state through the state s and action a, and trains by minimizing:式中,表示前向网络函数进行更新时的损失函数;In the formula, Represents the loss function when the forward network function is updated;构建一个由网络参数ω生成的逆动力学模型Iω(s,s')去反推出动作a,逆动力学模型Iω(s,s')通过密集状态优势加权函数w(s,s′)进行训练:Construct an inverse dynamics model Iω (s, s') generated by the network parameters ω to infer the action a. The inverse dynamics model Iω (s, s') is trained by the dense state advantage weighting function w(s, s'):式中,Iω表示逆动力学网络,s和s′表示当前时刻状态和下一时刻状态,表示逆动力学模型函数进行更新时的损失函数;In the formula, Iω represents the inverse dynamics network, s and s′ represent the current state and the next state, Represents the loss function when the inverse dynamics model function is updated;构建一个额外的预测网络s′=τψ(s),评估下一时刻状态s′,通过以下方式进行训练迭代,以保持接近数据集中的分布:Construct an additional prediction network s′=τψ (s) to evaluate the next state s′ and perform training iterations in the following way to keep close to the dataset Distribution in:式中,表示预测网络函数进行更新时的损失函数;In the formula, Represents the loss function when the prediction network function is updated;步骤5:采用包围法设置冗余驱动机械臂关节和障碍物,将机械臂连杆简化为圆柱体,将障碍物简化为球体后计算机械臂关节和障碍物的距离,进行冗余驱动机械臂关节的避障检测,和防止碰撞测试;Step 5: Use the encirclement method to set the redundant drive robot arm joints and obstacles, simplify the robot arm connecting rods into cylinders, simplify the obstacles into spheres, calculate the distance between the robot arm joints and obstacles, and perform obstacle avoidance detection and collision prevention tests on the redundant drive robot arm joints;步骤6:建立冗余驱动机械臂模型预测控制算法MPC状态方程,对冗余驱动机械臂末端移动过程中的路径节点n0,n1,...,nn进行逆解角度计算,得到冗余驱动关节角度值,之后冗余驱动机械臂即可按照冗余驱动关节角度值移动,完成整个路径规划。Step 6: Establish the MPC state equation of the redundant drive robot model predictive control algorithm, perform inverse angle calculation on the path nodes n0 ,n1 ,...,nn during the movement of the redundant drive robot end, and obtain the redundant drive joint angle value. Then the redundant drive robot can move according to the redundant drive joint angle value to complete the entire path planning.2.根据权利要求1所述的基于DSAW离线强化学习算法的冗余驱动机械臂路径规划方法,其特征在于,使用包围法对障碍物进行简化的具体过程如下:2. The redundant drive robot arm path planning method based on the DSAW offline reinforcement learning algorithm according to claim 1 is characterized in that the specific process of simplifying obstacles using the encirclement method is as follows:将冗余驱动机械臂视为一个半径为r的小球,对于球体障碍物,在不允许相互碰撞的前提下,二者之间的碰撞距离为:The redundant drive robot arm is regarded as a small ball with a radius of r. For the spherical obstacle, under the premise that mutual collision is not allowed, the collision distance between the two is:式中,表示球体障碍物的球心坐标,碰撞检测的公式如下所示:In the formula, Represents the coordinates of the center of the spherical obstacle. The collision detection formula is as follows:式中,r,分别表示冗余驱动机械臂的半径和球体障碍物半径;如果tcollision为1则表示冗余驱动机械臂与第i个障碍物发生了碰撞,如果tcollision为0则表示冗余驱动机械臂与第i个障碍物没有发生碰撞;In the formula, r, They represent the radius of the redundant driving robot arm and the radius of the spherical obstacle respectively; if tcollision is 1, it means that the redundant driving robot arm collides with the i-th obstacle; if tcollision is 0, it means that the redundant driving robot arm does not collide with the i-th obstacle;对于长方体障碍物,检测二者之间是否发生碰撞的判断公式如下所示:For a rectangular obstacle, the judgment formula for detecting whether a collision occurs between the two is as follows:通过AABB包围法对不规则障碍物进行简化,确保简化后的长方体的各个面与方向轴平行,设长方体上最小的坐标位置为(xmin,ymin,zmin),最大坐标为(xmax,ymax,zmax),冗余驱动机械臂简化的球体其球心坐标为(xcurrent,ycurrent,zcurrent),半径为r,最近点坐标为(x,y,z):计算冗余驱动机械臂球心与AABB长方体的最近点,如果满足如下公式:Simplify the irregular obstacles by using the AABB enclosing method to ensure that the faces of the simplified cuboid are parallel to the direction axis. Assume that the minimum coordinate position on the cuboid is (xmin ,ymin ,zmin ), the maximum coordinate is (xmax ,ymax ,zmax ), the center coordinates of the sphere simplified by the redundant drive robot arm are (xcurrent ,ycurrent ,zcurrent ), the radius is r, and the coordinates of the nearest point are (x,y,z): Calculate the nearest point between the center of the redundant drive robot arm and the AABB cuboid if the following formula is satisfied:则最近点(x,y,z)是冗余驱动机械臂的球心(xcurrent,ycurrent,zcurrent),此时发生碰撞,当最近点不在长方体范围内的,则判断球心(xcurrent,ycurrent,zcurrent)到长方体(xmin,xmax),(ymin,ymax),(zmin,zmax)的距离关系,如果存在球心xcurrent<xmin则最近点的坐标为xmin,xcurrent>xmax则是xmax,同理可得最近坐标y,z;Then the nearest point (x, y, z) is the center of the ball of the redundant drive robot arm (xcurrent , ycurrent , zcurrent ), and a collision occurs at this time. When the nearest point is not within the range of the cuboid, the distance relationship between the center of the ball (xcurrent , ycurrent , zcurrent ) and the cuboid (xmin , xmax ), (ymin , ymax ), (zmin , zmax ) is judged. If the center of the ball xcurrent <xmin , the coordinates of the nearest point are xmin , and if xcurrent >xmax , then it is xmax . Similarly, the nearest coordinates y and z can be obtained;当最近点不在长方体范围内,使用欧几里得距离公式,计算最近点到球心的距离:When the nearest point is not within the cuboid, use the Euclidean distance formula to calculate the distance from the nearest point to the center of the sphere:如果存在则判断冗余驱动机械臂与障碍物发生碰撞。If exists It is determined that the redundant drive robot arm collides with the obstacle.3.根据权利要求1所述的基于DSAW离线强化学习算法的冗余驱动机械臂路径规划方法,其特征在于:强化学习奖励函数的设计遵循以下原则:3. The redundant drive robot arm path planning method based on the DSAW offline reinforcement learning algorithm according to claim 1 is characterized in that the design of the reinforcement learning reward function follows the following principles:(1)惩罚奖励函数:为了安全的规划处一条无碰撞的路径,需要在发生碰撞时添加惩罚参数Rcollision=-λcollision,Rcollision其中表示碰撞惩罚奖励,λcollision表示惩罚参数是一个常数,如果发生碰撞则给冗余驱动机械臂此时运行的动作奖励,利用惩罚项帮助冗余驱动机械臂学习如何避开碰撞;(1) Penalty reward function: In order to safely plan a collision-free path, it is necessary to add a penalty parameter Rcollision = -λcollision when a collision occurs, where Rcollision represents the collision penalty reward and λcollision represents that the penalty parameter is a constant. If a collision occurs, a reward is given to the redundant driving robot arm for the action it is running at this time. The penalty term is used to help the redundant driving robot arm learn how to avoid collisions.(2)引导奖励函数:依据冗余驱动机械臂与目标位置的距离远近,将路径规划的任务分解为从远离目标到接近目标的多个阶段,每个阶段根据需要使用统一或者独立的学习目标和奖励机制,每个阶段使用独立的学习目标和奖励机制具体公式如下:(2) Guided reward function: Based on the distance between the redundant driving robot and the target position, the path planning task is decomposed into multiple stages from far away from the target to close to the target. Each stage uses a unified or independent learning goal and reward mechanism as needed. The specific formula for using an independent learning goal and reward mechanism in each stage is as follows:式中,d表示冗余驱动机械臂位置和目标位置的实际欧氏距离,d1表示预设的机械臂位置与目标位置之间的第一欧氏距离,d2表示预设的机械臂位置与目标位置之间的第二欧氏距离,λ1表示在距离d小于d1时的奖励参数,λ2表示距离d大于等于d1小于等于d2时的奖励参数,λ3表示距离d大于d2时的奖励参数,R1表示阶段性奖励函数;Wherein, d represents the actual Euclidean distance between the redundant driving manipulator position and the target position,d1 represents the first Euclidean distance between the preset manipulator position and the target position,d2 represents the second Euclidean distance between the preset manipulator position and the target position,λ1 represents the reward parameter when the distance d is less thand1 ,λ2 represents the reward parameter when the distance d is greater than or equal tod1 and less than or equal tod2 ,λ3 represents the reward parameter when the distance d is greater thand2 , andR1 represents the stage reward function;(3)基础奖励函数:为接近目标状态的行为提供的正奖励,引导冗余驱动机械臂在学习过程中即使还未能完成整个任务,也能获得正向的学习奖励,公式如下所示:(3) Basic reward function: This function provides positive rewards for behaviors that approach the target state. It guides the redundant driving robot arm to obtain positive learning rewards even if it has not yet completed the entire task during the learning process. The formula is as follows:式中,dcurrent表示当前时刻冗余驱动机械臂与目标位置之间的欧氏距离,(xcurrent,ycurrent,zcurrent)表示冗余驱动机械臂在三维空间的当前节点位置,(xgoal,ygoal,zgoal)表示路径规划任务中期望达到的目标位置,α1表示调整参数是个常数,R2表示基础奖励函数;将惩罚奖励函数、引导奖励函数和基础奖励函数相加,得到了完整的路径规划奖励函数Rsum=Rcollision+R1+R2In the formula, dcurrent represents the Euclidean distance between the redundant driving robot arm and the target position at the current moment, (xcurrent ,ycurrent ,zcurrent ) represents the current node position of the redundant driving robot arm in three-dimensional space, (xgoal ,ygoal ,zgoal ) represents the target position expected to be achieved in the path planning task, α1 indicates that the adjustment parameter is a constant, and R2 represents the basic reward function; by adding the penalty reward function, the guidance reward function and the basic reward function, the complete path planning reward function Rsum =Rcollision +R1 +R2 is obtained.4.根据权利要求1所述的基于DSAW离线强化学习算法的冗余驱动机械臂路径规划方法,其特征在于,离线强化学习数据集的具体收集步骤为:4. The redundant drive robot arm path planning method based on the DSAW offline reinforcement learning algorithm according to claim 1 is characterized in that the offline reinforcement learning data set The specific collection steps are:离线数据集的采样算法包括:孪生延迟深度确定性策略梯度算法TD3、深度确定性策略梯度算法DDPG以及专家数据集;The sampling algorithms for offline data sets include: Twin Delayed Deep Deterministic Policy Gradient Algorithm TD3, Deep Deterministic Policy Gradient Algorithm DDPG, and Expert Datasets;TD3算法适用于处理连续动作空间的问题,其改进的策略更新机制和价值估计方法使其在复杂环境中表现出色;DDPG算法在进行数据收集时表现出较强的稳健性和数据效率,适合于那些对数据质量和数量有限制的应用场景;将TD3算法与DDPG算法应用到步骤2所构建好的强化学习路径规划环境中,进行训练迭代,收集训练过程中产生的数据对(s,a,r,s')其中s表示当前时刻状态,a表示动作,r表示奖励,s’表示下一时刻状态;使用专家数据集作为正确行为示例加速学习过程;The TD3 algorithm is suitable for dealing with problems in continuous action spaces. Its improved policy update mechanism and value estimation method enable it to perform well in complex environments. The DDPG algorithm shows strong robustness and data efficiency when collecting data, and is suitable for application scenarios with restrictions on data quality and quantity. The TD3 algorithm and the DDPG algorithm are applied to the reinforcement learning path planning environment constructed in step 2, and training iterations are performed to collect data pairs (s, a, r, s') generated during the training process, where s represents the current state, a represents the action, r represents the reward, and s' represents the state at the next moment. Expert data sets are used as examples of correct behavior to accelerate the learning process.将在路径规划环境中收集到的数据(s,a,r,s')进行预处理,以提高其可用性,对当前时刻状态s进行归一化处理,公式如下所示:The data (s, a, r, s') collected in the path planning environment are preprocessed to improve their usability, and the current state s is normalized. The formula is as follows:式中,λ表示进行归一化处理后的状态数据s,α2表示调整参数,Q(si,si′)表示状态值函数,s′表示下一时刻状态,N表示Q(si,si′)的数目。In the formula, λ represents the normalized state data s, α2 represents the adjustment parameter, Q(si ,si ′) represents the state value function, s′ represents the state at the next moment, and N represents the number of Q(si ,si ′).5.根据权利要求1所述的基于DSAW离线强化学习算法的冗余驱动机械臂路径规划方法,其特征在于,采用包围法对冗余驱动机械臂关节和障碍物进行圆柱体和球体简化的过程为:5. The redundant drive robot arm path planning method based on the DSAW offline reinforcement learning algorithm according to claim 1 is characterized in that the process of simplifying the redundant drive robot arm joints and obstacles into cylinders and spheres by using the encirclement method is:通过包围盒方法,将机械臂与障碍物之间的碰撞问题转换成了圆柱体与球体之间的位置关系问题,进行碰撞检测之前需要先判断球体和圆柱体的位置关系,即球体障碍物的球心Pobj与圆柱体的轴线的垂足D位置关系,通过向量的点积来计算球体的球心与圆柱体的轴线的垂足D位置关系,具体分为以下几个步骤:Through the bounding box method, the collision problem between the robot arm and the obstacle is converted into the position relationship problem between the cylinder and the sphere. Before collision detection, it is necessary to determine the position relationship between the sphere and the cylinder, that is, the center Pobj of the spherical obstacle and the cylinder. The position relationship of the foot of the perpendicular axis D is used to calculate the center of the sphere through the dot product of the vectors. The axis of the cylinder The foot D position relationship is divided into the following steps:(1)计算向量向量;(1) Calculate vector and vector;(2)计算向量在向量上的投影长度:(2) Calculate the projection length of the vector on the vector:(3)通过投影长度z判断垂足D的位置,如果0≤z≤1则垂足D在线段上,否则垂足D在线段的延长线上;(3) Determine the position of the foot of the perpendicular D by the projection length z. If 0≤z≤1, the foot of the perpendicular D is on the line segment. On, otherwise the foot of the perpendicular D is on the line segment On the extension line of如果垂足在圆柱体轴线上,则碰撞距离计算如下所示:If the foot of the perpendicular is on the axis of the cylinder, the collision distance calculation is as follows:如果垂足在圆柱体轴线延长线上,则碰撞距离计算如下所示:If the foot of the perpendicular is on the extended line of the cylinder axis, the collision distance calculation is as follows:式中,dobj表示冗余驱动机械臂关节与球体障碍物的距离,min表示两个公式中选取数字小的一个,表示球体障碍物的圆心坐标位置,表示第i个关节的末端位置;碰撞检测的公式如下所示:Where, dobj represents the distance between the redundant drive robot joint and the spherical obstacle, and min represents the smaller number in the two formulas. Indicates the coordinate position of the center of the spherical obstacle. Represents the end position of the i-th joint; the collision detection formula is as follows:式中,dobj表示冗余驱动机械臂关节与球体障碍物的距离,robj表示球体障碍物半径,ri表示冗余驱动机械臂关节最小圆柱半径;如果tcollision为1则表示冗余驱动机械臂第i个关节与障碍物发生了碰撞,如果tcollision为0则表示冗余驱动机械臂第i个关节与障碍物没有发生碰撞。Wherein, dobj represents the distance between the redundant driven robot arm joint and the spherical obstacle, robj represents the radius of the spherical obstacle, andri represents the minimum cylindrical radius of the redundant driven robot arm joint; if tcollision is 1, it means that the i-th joint of the redundant driven robot arm collides with the obstacle, and if tcollision is 0, it means that the i-th joint of the redundant driven robot arm does not collide with the obstacle.6.根据权利要求5所述的基于DSAW离线强化学习算法的冗余驱动机械臂路径规划方法,其特征在于,冗余驱动机械臂模型预测控制算法MPC状态方程计算路径节点的过程为:6. The redundant drive manipulator path planning method based on the DSAW offline reinforcement learning algorithm according to claim 5 is characterized in that the process of calculating the path nodes by the redundant drive manipulator model predictive control algorithm MPC state equation is:首先,冗余驱动机械臂模型预测控制算法MPC如下:First, the redundant drive manipulator model predictive control algorithm MPC is as follows:式中,(xk+1,yk+1,zk+1)表示k+1时刻冗余驱动机械臂位置,(xk,yk,zk)表示k时刻冗余驱动机械臂位置,Δθ表示各关节的角速度,Jx,Jy,Jz分别表示x,y,z方向上的雅可比线速度矩阵,因此冗余驱动机械臂的模型预测控制算法状态方程:In the formula, (xk+1 ,yk+1 ,zk+1 ) represents the position of the redundant drive manipulator at time k+1, (xk ,yk ,zk ) represents the position of the redundant drive manipulator at time k, Δθ represents the angular velocity of each joint, Jx ,Jy ,Jz represent the Jacobian linear velocity matrices in the x, y, and z directions respectively, so the state equation of the model predictive control algorithm of the redundant drive manipulator is:x(k+1)=Ax(k)+Bu(k)=I[x y z]T+J[Δθ1 Δθ2 Δθ3 Δθ4 Δθ5 Δθ6 Δθ7]Tx(k+1)=Ax(k)+Bu(k)=I[xyz]T +J[Δθ1 Δθ2 Δθ3 Δθ4 Δθ5 Δθ6 Δθ7 ]T式中,A表示状态转移矩阵,B表示控制输入矩阵,I表示为3×3的单位矩阵,[x y z]表示机械臂末端位置,Δθi表示个关节角度的增量,x(k)为k时刻的状态变量,x(k+1)为k+1时刻的状态变量,u(k)为k时刻的控制输入变量;Where A represents the state transfer matrix, B represents the control input matrix, I represents the 3×3 unit matrix, [xyz] represents the end position of the robot arm, Δθi represents the increment of the joint angle, x(k) is the state variable at time k, x(k+1) is the state variable at time k+1, and u(k) is the control input variable at time k;而模型预测控制算法MPC预测控制通过在未来某一段时间内的表现来进行优化控制,用离散的状态空间表达,即:The model predictive control algorithm MPC predictive control optimizes control by performing performance in a certain period of time in the future, expressed in discrete state space, namely:Xk+1=AXk+BUkXk+1 =AXk +BUk设u(k+i|k)表示在k时刻预测的第k+1时刻的输入值,在预测区间N内,输入控制矩阵如下所示:Let u(k+i|k) represent the input value at the k+1th moment predicted at the kth moment. Within the prediction interval N, the input control matrix is as follows:同理,x(k+i|k)表示在k时刻预测的第时刻的系统状态如下所示:Similarly, x(k+i|k) represents the system state predicted at time k as follows:由于是求解冗余驱动机械臂的逆解所以位置误差为E=xk-xc,其中xc表示目标位置;代价方程也为:Since the inverse solution of the redundant drive manipulator is to be solved, the position error is E=xk -xc , where xc represents the target position; the cost equation is also:结合公式Xk和公式Uk将公式Xk+1写为:Combining the formula Xk and the formula Uk, the formula Xk+1 can be written as:而在k时刻的代价矩阵可以表达为:The cost matrix at time k can be expressed as:最终将公式Xk+1带入代价矩阵Jk+1得到所需要求解的代价矩阵:Finally, the formula Xk+1 is substituted into the cost matrix Jk+1 to obtain the required cost matrix:式中,令其中表示为增广矩阵,由此可以清晰地看出损失函数Jk与xk初始条件及输入Uk的关系,最终成功的求得了运用MPC算法求解冗余机械臂逆解的公式;In the formula, let in It is expressed as an augmented matrix, from which we can clearly see the relationship between the loss function Jk and the initial conditions of xk and the input Uk . Finally, we successfully obtained the formula for solving the inverse solution of the redundant manipulator using the MPC algorithm;为了防止MPC算法在进行逆解计算时出现机械臂关节和障碍物发生碰撞,因此需要将避障检测算法作为约束添加进MPC方程中,而避障约束项的具体步骤如下:In order to prevent the robot joints from colliding with obstacles when the MPC algorithm is performing inverse calculations, the obstacle avoidance detection algorithm needs to be added as a constraint into the MPC equation. The specific steps of the obstacle avoidance constraint term are as follows:通过DH矩阵则可以得出当前各关节点的位置:The DH matrix can be used to obtain the current position of each joint point:式中,表示DH矩阵用于计算冗余驱动机械臂关节位置,θj表示当前机械臂关节角度值,其中j表示各关节编号,角度增量即MPC公式的输入值为u(k+i)=[Δθ1Δθ2Δθ3Δθ4Δθ5Δθ6Δθ7]T,oj=(xj,yj,zj)表示机械臂末端在三维空间中的位置,通过约束机械臂各杆件到障碍物之间距离dobj(oj)大于碰撞距离,从而预防碰撞:In the formula, represents the DH matrix used to calculate the joint positions of the redundantly driven manipulator,θj represents the current joint angle valueof the manipulator, where j represents the jointnumber ,and the angle increment, i.e., the input value of the MPC formula, isu (k +i )=[Δθ1Δθ2Δθ3Δθ4Δθ5Δθ6Δθ7]T ,oj =(xj ,yj ,zj ) represents the position of the end of the manipulator in three-dimensional space. By constraining the distancedobj (oj ) from each rod of the manipulator tothe obstacle to be greater than the collision distance, collision can be prevented:式中,dobj表示距离计算公式;如果所得出的距离大于ri+robj则判断没发生碰撞,如果小于ri+robj则判断发生碰撞。Wherein, dobj represents the distance calculation formula; if the obtained distance is greater thanri + robj , it is determined that no collision has occurred; if it is less thanri + robj , it is determined that a collision has occurred.
CN202410807695.8A2024-06-212024-06-21 Path planning method for redundantly driven robotic arm based on DSAW offline reinforcement learning algorithmActiveCN118700133B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202410807695.8ACN118700133B (en)2024-06-212024-06-21 Path planning method for redundantly driven robotic arm based on DSAW offline reinforcement learning algorithm

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202410807695.8ACN118700133B (en)2024-06-212024-06-21 Path planning method for redundantly driven robotic arm based on DSAW offline reinforcement learning algorithm

Publications (2)

Publication NumberPublication Date
CN118700133A CN118700133A (en)2024-09-27
CN118700133Btrue CN118700133B (en)2025-02-07

Family

ID=92819437

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202410807695.8AActiveCN118700133B (en)2024-06-212024-06-21 Path planning method for redundantly driven robotic arm based on DSAW offline reinforcement learning algorithm

Country Status (1)

CountryLink
CN (1)CN118700133B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119458357B (en)*2024-12-122025-09-26中国矿业大学Mechanical arm inverse kinematics solving and track planning method based on digital twin technology
CN119748459B (en)*2025-02-212025-07-25南京工业大学 Trajectory tracking method of mobile manipulator with variable load based on reinforcement learning and model predictive control

Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115890670A (en)*2022-11-192023-04-04无锡慧眼人工智能科技有限公司Method for training motion trail of seven-degree-of-freedom redundant mechanical arm based on intensive deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
KR102211012B1 (en)*2016-09-152021-02-03구글 엘엘씨 Deep reinforcement learning for robot operation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115890670A (en)*2022-11-192023-04-04无锡慧眼人工智能科技有限公司Method for training motion trail of seven-degree-of-freedom redundant mechanical arm based on intensive deep learning

Also Published As

Publication numberPublication date
CN118700133A (en)2024-09-27

Similar Documents

PublicationPublication DateTitle
EP3837641B1 (en)Deep reinforcement learning-based techniques for end to end robot navigation
Bency et al.Neural path planning: Fixed time, near-optimal path generation via oracle imitation
CN118700133B (en) Path planning method for redundantly driven robotic arm based on DSAW offline reinforcement learning algorithm
CN108803321B (en)Autonomous underwater vehicle track tracking control method based on deep reinforcement learning
Li et al.Navigation of mobile robots based on deep reinforcement learning: Reward function optimization and knowledge transfer
Ma et al.Local learning enabled iterative linear quadratic regulator for constrained trajectory planning
Strudel et al.Learning obstacle representations for neural motion planning
CN110488611B (en) A bionic robotic fish motion control method, controller and bionic robotic fish
CN113110478A (en)Method, system and storage medium for multi-robot motion planning
Hua et al.Reinforcement learning-based collision-free path planner for redundant robot in narrow duct
Aljalbout et al.Learning vision-based reactive policies for obstacle avoidance
CN114943182B (en)Robot cable shape control method and equipment based on graph neural network
CN116551703A (en) A Machine Learning-Based Motion Planning Method in Complex Environment
CN113485323B (en)Flexible formation method for cascading multiple mobile robots
CN113959446B (en)Autonomous logistics transportation navigation method for robot based on neural network
Tran et al.Predicting sample collision with neural networks
Tenhumberg et al.Efficient learning of fast inverse kinematics with collision avoidance
De Carvalho et al.Data-driven motion planning: A survey on deep neural networks, reinforcement learning, and large language model approaches
Pratama et al.Enhance Deep Reinforcement Learning with Denoising Autoencoder for Self-Driving Mobile Robot
Zhang et al.Path planning of mobile robot in dynamic obstacle avoidance environment based on deep reinforcement learning
Yu et al.Path Planning of Surface vessel based on improved IRRT* algorithm
Jia et al.Rl-mpc: Reinforcement learning aided model predictive controller for autonomous vehicle lateral control
Gao et al.Integrating GAT and LSTM for Robotic Arms Motion Planning
ZhuReinforcement learning-based state space dimensionality reduction and optimal control strategy design in robot navigation systems
CN119469169B (en) End-to-end UAV autonomous navigation method based on differential theory

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp