Background
The artificial intelligence algorithm is widely applied to robot control, the robot control algorithm gradually transfers from equation solution to data drive, and more robot control adopts the artificial intelligence algorithm. The design adopts a deep reinforcement learning algorithm DDPG (deep Deterministic Policy gradient) to replace a positive (inverse) kinematics resolving method in the traditional control algorithm of the mechanical arm, and a neural network model is obtained through data-driven training directly to control the tail end of the mechanical arm to reach a target position. According to the method, the trained model can be rapidly deployed on the mechanical arm control platform, so that the mechanical arm can be rapidly moved to any given target position point, the mechanical arm is trained by using a deep reinforcement learning algorithm DDPG in a simulation environment, a training mode of firstly carrying out 2D modeling and then carrying out 3D modeling is adopted, the training time is greatly shortened, finally, the trained algorithm model is realized and verified on a real mechanical arm, and the control effect of the trained algorithm model meets the application requirement.
In the deep reinforcement learning algorithm, there are the following 5 major elements: agent, Environment, Action, State, Reward, and Reward. As shown in fig. 1, the agent interacts with the environment in real time, and after observing a state, the agent outputs an action according to a policy model, and the action acts on the environment to influence the state, and in addition, the environment gives a reward to the agent according to the action and the state, and the agent updates the policy model of selecting the action according to the action state and the reward. By trying continuously in the environment, the maximum reward is obtained, and the mapping from state to action is learned, namely the strategy model, or simply the model, which is expressed by a parameterized deep neural network.
The current DDPG algorithm is already widely used in the intelligent control of the robot arm, but the following difficulties still exist in implementation:
1. the data-driven deep reinforcement learning algorithm acquires data for learning by interacting the simulation mechanical arm and the virtual environment, so that an effective control model is obtained.
2. Aiming at the training process of the mechanical arm, how to set the state parameters of the mechanical arm and the environment and how to set the reward function of the training process, the method ensures that the control effect of the mechanical arm obtained by training reaches the best.
Disclosure of Invention
In order to solve the technical problems, the invention provides a mechanical arm intelligent control quick training method based on deep reinforcement learning, the training time of the method is short, and the obtained model has a good control effect.
The technical scheme adopted by the invention is as follows: a mechanical arm intelligent control rapid training method based on deep reinforcement learning comprises the following steps:
s1, training a 2D mechanical arm by adopting a deep reinforcement learning algorithm DDPG in a physical-attribute-free 2D mechanical arm simulation environment, and finding out an optimal state vector representation and an optimal reward function form;
s2, training the 3D mechanical arm by adopting a deep reinforcement learning algorithm DDPG in a 3D mechanical arm simulation environment with physical attributes, and using an optimal state vector representation and an optimal reward function form in the DDPG to obtain an optimal result obtained in the 2D mechanical arm simulation environment so as to obtain a control strategy model;
and S3, deploying a control strategy model obtained by training in a 3D mechanical arm simulation environment with physical attributes to the real mechanical arm.
The 2D robotic arm comprises an axis a, an axis b, and a tip c; the lengths of the rod ab and the rod bc are L, the axis a is a fixed rotary joint, the axis b is a movable rotary joint, the c is the tail end of the mechanical arm, the included angle between the rod ab and the horizontal line is & lt theta, and the included angle between the rod bc and the horizontal line is & lt alpha.
The optimal state vector obtained in step S1 is:
where c _ x represents the x-axis coordinate of the robot arm end c, c _ y represents the y-axis coordinate of the robot arm end c, | c _ x-x | represents the x-axis distance of the robot arm end c from the target point, | c _ y-y | represents the y-axis distance of the robot arm end c from the target point, and (x, y) represents the target location point on any given 2D plane.
The optimal reward function obtained in step S1 is:
the DDPG includes four neural networks, respectively: the target network and the evaluation network of the Actor and the target network and the evaluation network of the Critic are the same in structure, and the target network and the evaluation network of the Critic are the same in structure.
Using a mean square error loss function
Updating parameters of the criticic's evaluation network by gradient back propagation of the neural network; m represents the number of samples of the batch gradient descent, y
iTarget Q value of target network representing Critic obtained at ith sample, ω represents parameter of evaluation network of Critic, s
iRepresenting the state in the ith sample, a
iRepresenting the action in the ith sample.
Use of
Updating parameters of an evaluation network of the Actor through gradient back propagation of the neural network as a loss function; m denotes the number of samples of the batch gradient descent, ω denotes the parameter of the evaluation network of Critic, s
iRepresenting the state in the ith sample, a
iRepresenting the action in the ith sample.
If T% C is 1, updating the parameter of the Actor target network by θ '← τ θ + (1- τ) θ', and updating the parameter of the criticic target network by ω '← τ ω + (1- τ) ω';
wherein C represents the updating step number of the target network parameters; t denotes the maximum number of iterations, θ denotes a parameter of the evaluation network of Actor, θ 'denotes a parameter of the target network of Actor, ω denotes a parameter of the evaluation network of Critic, ω' denotes a parameter of the target network of Critic, ← denotes assigning the calculation result of the right equation to the left, and τ denotes a soft update weight coefficient.
The invention has the beneficial effects that: according to the method, a deep reinforcement learning algorithm is adopted for training in a physical-attribute-free 2D mechanical arm simulation environment, the training complexity is greatly reduced, the training time is greatly shortened, and the training of a control strategy model of the mechanical arm is accelerated. Meanwhile, the optimal state vector representation and the optimal reward function form are found through training in the 2D mechanical arm simulation environment, so that the convergence speed and the stability of the trained control strategy model are optimal, and the tail end of the mechanical arm can be controlled to quickly reach the target position.
Detailed Description
In order to facilitate the understanding of the technical contents of the present invention by those skilled in the art, the present invention will be further explained with reference to the accompanying drawings.
The Deep reinforcement learning algorithm used in the invention is a Deep Deterministic Policy Gradient algorithm (DDPG), which combines a Policy Network in a Deterministic Policy Gradient algorithm (DPG), adopts an Actor-critical framework, combines experience playback in a Deep Q Network (DQN, Deep Q-Network) and a technique for separating a target Network (target Network) from an evaluation Network (even Network), and obtains good effect in an environment aiming at a continuous action space. In the DDPG, there are four neural networks, namely a target network and an evaluation network of an Actor and a target network and an evaluation network of Critic, and the structures of the two Actor networks are completely the same, and the structures of the two Critic networks are completely the same.
Fig. 2 is a flowchart of the DDPG algorithm, in which an Actor evaluation network outputs a current action according to a current state, an Actor target network outputs a next action according to a next state, a Critic evaluation network outputs a current Q value according to the current state and the current action, the Critic target network outputs a target Q value according to the next state and the next action, the Actor evaluation network updates itself according to the current Q value, the Critic evaluation network updates itself according to the current Q value, the target Q value, and an award, and parameters of the Actor evaluation network are copied to the target network in a weighted average manner at intervals.
The reference values in FIG. 2 are shown in Table 1:
TABLE 1 meanings of parameters in FIG. 2
| Parameter name | Meaning of parameters |
| S | Current state |
| S_ | Next state |
| R | Reward |
| Actor | Actor network for outputting actions according to status |
| Critic | The reviewer network evaluates the actions according to the states |
| Eval_Net | Evaluating a network |
| Target_Net | Target network |
| Target_Q | Target Q value |
| TD_Error | TD error |
| Critic_Train | For updating reviewer networks |
| Policy_Grads | Policy gradient |
| Actor_Train | For updating actor networks |
The detailed flow of the DDPG algorithm is described as follows:
inputting: the Actor evaluates the network, and the parameter is theta; an Actor target network, the parameter being theta'; critic evaluation network, parameter omega; an Actor target network, wherein the parameter is omega'; an attenuation factor γ; soft update weight coefficient τ; the number m of samples in batch gradient descent; updating step number C of target network parameters; the maximum number of iterations T.
And (3) outputting: the optimal Actor evaluates the network parameter theta, and the optimal Critic evaluates the network parameter omega. The Actor evaluates the network and is the policy model.
1. Randomly initializing theta and omega, enabling theta 'to be theta and omega', and emptying the experience playback set D.
2. Iterations are performed from 1 to T (total training round).
Initializing an initial state s;
fortor evaluation network obtains action a ═ pi based on state sθ(s)+N;
Executing action a to obtain a new state s', rewarding r, and judging whether the state is a termination state done or not;
fourthly, saving the { s, a, r, s', done } in the experience playback set D;
fromEmpirical playback of m samples s uniformly sampled in Di,ai,ri,s′i,donei1,2, m, Actor target network according to s'iOutput a ═ piθ′(s') + N, Critic evaluation network based on si,aiOutput the current Q value Q(s)i,aiω), Critic target network according to s'i,a′iOutput Q '(s'i,a′iω'), calculating the target Q value yi:
Using the mean square error loss function
Updating a parameter omega of the criticic evaluation network through gradient back propagation of the neural network;
is used
Updating a parameter theta of the Actor evaluation network through back propagation of the neural network as a loss function;
if T% C ═ 1 (every C step), then θ '← τ θ + (1- τ) θ', ω '← τ ω + (1- τ) ω', update parameters θ 'and ω' of the Actor target network and Critic target network.
Ninthly, if s 'is in a termination state, the iteration of the round is ended, otherwise, the process returns to s ═ s', and returns to the step (②).
A mechanical arm simulation model is built in a computer virtual environment, a control strategy model of the mechanical arm is obtained through the training process of the DDPG algorithm, and the model is deployed on a real mechanical arm, so that the real mechanical arm can control an end effector to reach any given spatial position in real time, and a control function for further completing an automation task is laid.
The method is based on a deep reinforcement learning algorithm, rapid training of the control model of the mechanical arm is completed, an optimized state vector representation and a stable reward function form are found in the training process, the control model obtained through training can be effectively deployed on the real mechanical arm, any given space target point can be given, the mechanical arm can automatically move the tail end to the position of the space target point, and a foundation is laid for the control application of the mechanical arm.
The method specifically comprises the following steps:
s1, training the mechanical arm by adopting a deep reinforcement learning algorithm DDPG in a physical-attribute-free 2D mechanical arm simulation environment, finding out the optimal state vector representation, and finding out the optimal reward function form.
S2, training the mechanical arm by adopting a deep reinforcement learning algorithm DDPG in a 3D mechanical arm simulation environment with physical attributes, expressing an optimal state vector in the DDPG algorithm, and continuously using an optimal result obtained in the 2D mechanical arm simulation environment in an optimal reward function form.
And S3, directly deploying the control model obtained by training in the 3D mechanical arm simulation environment with physical attributes to a real mechanical arm, wherein the physical attributes of the real mechanical arm and the physical attributes of the 3D simulation mechanical arm are required to be close to the same. And deploying the trained control model on the real mechanical arm. The model can control the movement of the real mechanical arm end to a limited area range of any given space position target point.
In step S1, a depth-enhanced learning algorithm DDPG is used to train a 2D mechanical arm in a physical-attribute-free 2D mechanical arm simulation environment, a schematic diagram of the 2D mechanical arm simulation environment is shown in fig. 3, a side length of a 2D square plane frame is 400 (unit is not limited), a lower left corner is a coordinate origin (0,0), parameters of the 2D mechanical arm include an axis a, an axis b, a terminal c, a rod length L, the axis a is a fixed rotary joint, which is located at a central point of the square, coordinates are (200 ), the axis b is a movable rotary joint, and c is a mechanical arm terminal. The included angle between the rod ab, the rod bc and the horizontal line is &, &, and the central point of the lower left corner black part represents a target position point (x, y) on any given 2D plane.
Training is carried out based on a deep reinforcement learning algorithm DDPG, and the optimal state vector representation is obtained by:
the physical meaning of each parameter entry in the state vector is shown in table 2.
TABLE 2 optimal State vector representation and State vector membership parameters
Training is carried out based on a deep reinforcement learning algorithm DDPG, and the optimal reward function form is obtained by:
the negative of the linear distance between the end c of the arm and the target point. The reward function parameter description is shown in table 3.
TABLE 3 description of the optimal reward function parameters
| Variable of reward function | Physical significance |
| c_x | X-axis coordinate of end c of arm |
| c_y | Y-axis coordinate of end c of arm |
| x | X-axis coordinate of target point |
| y | Y-axis coordinate of target point |
In step S2, the mechanical arm is trained by using a DDPG algorithm in a 3D mechanical arm simulation environment with physical attributes, the optimal state vector in the DDPG algorithm represents, and the optimal reward function form follows the optimal result obtained in the 2D mechanical arm simulation environment. A schematic diagram of a 3D mechanical arm simulation environment with physical attributes is shown in fig. 4, where the physical attributes of the mechanical arm in the simulation environment include the position relationship, rotation axis, and maximum angular velocity of each joint, and also include the shape, quality, collision detection, etc. of the mechanical arm connecting rod, and the above physical attributes of the simulation model are approximately consistent with the physical attributes of the real mechanical arm. The method comprises the steps of obtaining a mechanical arm control model after training through a deep reinforcement learning algorithm DDPG in a 3D mechanical arm simulation environment with physical attributes, and outputting the angle change quantity of each joint required by the mechanical arm to reach a target position by taking an optimal state vector as input.
Joint_values←Joint_values+DDPG(state)
Joint _ values: angle of each joint of mechanical arm
Ddpg (state): model output mechanical arm joint angle variation
In step S3, the control model is deployed directly onto the real robot arm as shown in fig. 5, and since the 3D simulated robot arm is generated entirely from the real robot arm model, their physical properties are close to unity. And deploying the trained control model on the real mechanical arm. The model can control the movement of the real mechanical arm end to a limited area range of any given space position target point. The joints of the real robot arm shown in fig. 5 include J1, J2, J3, and J4.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.