CN112338921A

Movatterモバイル変換

Info

Publication number: CN112338921A
Application number: CN202011277634.3A
Authority: CN
Inventors: 冯正勇; 赵寅甫
Original assignee: China West Normal University
Current assignee: China West Normal University
Priority date: 2020-11-16
Filing date: 2020-11-16
Publication date: 2021-02-09

Abstract

Translated fromChinese

本发明公开一种基于深度强化学习的机械臂智能控制快速训练方法，应用于机器人智能控制领域，针对现有的训练方法训练时间长，控制效果差的问题，本发明首先在无物理属性的2D机械臂仿真环境中采用深度强化学习算法进行训练，其训练复杂度极大的降低，使得训练时间极大的缩短，加速了机械臂的控制策略模型的训练；然后将在2D机械臂仿真环境下训练找到最优的状态向量表示和最优的奖励函数形式，作为3D机械臂深度强化学习算法训练的最优状态向量表示，最优奖励函数形式；从而得到3D机械臂的控制模型，采用本发明的方法不仅可以极大缩短训练时间，并且可以使得训练得到的控制策略模型的效果到达应用要求。

The invention discloses a fast training method for intelligent control of robotic arms based on deep reinforcement learning, which is applied to the field of intelligent control of robots. Aiming at the problems of long training time and poor control effect of the existing training methods, the present invention is firstly used in 2D without physical attributes. The deep reinforcement learning algorithm is used for training in the robotic arm simulation environment, which greatly reduces the training complexity, greatly shortens the training time, and accelerates the training of the control strategy model of the robotic arm. Training to find the optimal state vector representation and the optimal reward function form, as the optimal state vector representation and optimal reward function form for the training of the 3D robotic arm deep reinforcement learning algorithm; thus obtaining the control model of the 3D robotic arm, using the present invention The method can not only greatly shorten the training time, but also make the effect of the trained control strategy model meet the application requirements.

Description

Mechanical arm intelligent control rapid training method based on deep reinforcement learning

Technical Field

The invention belongs to the field of intelligent control of robots, and particularly relates to an intelligent control technology for a mechanical arm.

Background

The artificial intelligence algorithm is widely applied to robot control, the robot control algorithm gradually transfers from equation solution to data drive, and more robot control adopts the artificial intelligence algorithm. The design adopts a deep reinforcement learning algorithm DDPG (deep Deterministic Policy gradient) to replace a positive (inverse) kinematics resolving method in the traditional control algorithm of the mechanical arm, and a neural network model is obtained through data-driven training directly to control the tail end of the mechanical arm to reach a target position. According to the method, the trained model can be rapidly deployed on the mechanical arm control platform, so that the mechanical arm can be rapidly moved to any given target position point, the mechanical arm is trained by using a deep reinforcement learning algorithm DDPG in a simulation environment, a training mode of firstly carrying out 2D modeling and then carrying out 3D modeling is adopted, the training time is greatly shortened, finally, the trained algorithm model is realized and verified on a real mechanical arm, and the control effect of the trained algorithm model meets the application requirement.

In the deep reinforcement learning algorithm, there are the following 5 major elements: agent, Environment, Action, State, Reward, and Reward. As shown in fig. 1, the agent interacts with the environment in real time, and after observing a state, the agent outputs an action according to a policy model, and the action acts on the environment to influence the state, and in addition, the environment gives a reward to the agent according to the action and the state, and the agent updates the policy model of selecting the action according to the action state and the reward. By trying continuously in the environment, the maximum reward is obtained, and the mapping from state to action is learned, namely the strategy model, or simply the model, which is expressed by a parameterized deep neural network.

The current DDPG algorithm is already widely used in the intelligent control of the robot arm, but the following difficulties still exist in implementation:

1. the data-driven deep reinforcement learning algorithm acquires data for learning by interacting the simulation mechanical arm and the virtual environment, so that an effective control model is obtained.

2. Aiming at the training process of the mechanical arm, how to set the state parameters of the mechanical arm and the environment and how to set the reward function of the training process, the method ensures that the control effect of the mechanical arm obtained by training reaches the best.

Disclosure of Invention

In order to solve the technical problems, the invention provides a mechanical arm intelligent control quick training method based on deep reinforcement learning, the training time of the method is short, and the obtained model has a good control effect.

The technical scheme adopted by the invention is as follows: a mechanical arm intelligent control rapid training method based on deep reinforcement learning comprises the following steps:

s1, training a 2D mechanical arm by adopting a deep reinforcement learning algorithm DDPG in a physical-attribute-free 2D mechanical arm simulation environment, and finding out an optimal state vector representation and an optimal reward function form;

s2, training the 3D mechanical arm by adopting a deep reinforcement learning algorithm DDPG in a 3D mechanical arm simulation environment with physical attributes, and using an optimal state vector representation and an optimal reward function form in the DDPG to obtain an optimal result obtained in the 2D mechanical arm simulation environment so as to obtain a control strategy model;

and S3, deploying a control strategy model obtained by training in a 3D mechanical arm simulation environment with physical attributes to the real mechanical arm.

The 2D robotic arm comprises an axis a, an axis b, and a tip c; the lengths of the rod ab and the rod bc are L, the axis a is a fixed rotary joint, the axis b is a movable rotary joint, the c is the tail end of the mechanical arm, the included angle between the rod ab and the horizontal line is & lt theta, and the included angle between the rod bc and the horizontal line is & lt alpha.

The optimal state vector obtained in step S1 is:

where c _ x represents the x-axis coordinate of the robot arm end c, c _ y represents the y-axis coordinate of the robot arm end c, | c _ x-x | represents the x-axis distance of the robot arm end c from the target point, | c _ y-y | represents the y-axis distance of the robot arm end c from the target point, and (x, y) represents the target location point on any given 2D plane.

The optimal reward function obtained in step S1 is:

the DDPG includes four neural networks, respectively: the target network and the evaluation network of the Actor and the target network and the evaluation network of the Critic are the same in structure, and the target network and the evaluation network of the Critic are the same in structure.

Using a mean square error loss function

Updating parameters of the criticic's evaluation network by gradient back propagation of the neural network; m represents the number of samples of the batch gradient descent, y_iTarget Q value of target network representing Critic obtained at ith sample, ω represents parameter of evaluation network of Critic, s_iRepresenting the state in the ith sample, a_iRepresenting the action in the ith sample.

Use of

Updating parameters of an evaluation network of the Actor through gradient back propagation of the neural network as a loss function; m denotes the number of samples of the batch gradient descent, ω denotes the parameter of the evaluation network of Critic, s_iRepresenting the state in the ith sample, a_iRepresenting the action in the ith sample.

If T% C is 1, updating the parameter of the Actor target network by θ '← τ θ + (1- τ) θ', and updating the parameter of the criticic target network by ω '← τ ω + (1- τ) ω';

wherein C represents the updating step number of the target network parameters; t denotes the maximum number of iterations, θ denotes a parameter of the evaluation network of Actor, θ 'denotes a parameter of the target network of Actor, ω denotes a parameter of the evaluation network of Critic, ω' denotes a parameter of the target network of Critic, ← denotes assigning the calculation result of the right equation to the left, and τ denotes a soft update weight coefficient.

The invention has the beneficial effects that: according to the method, a deep reinforcement learning algorithm is adopted for training in a physical-attribute-free 2D mechanical arm simulation environment, the training complexity is greatly reduced, the training time is greatly shortened, and the training of a control strategy model of the mechanical arm is accelerated. Meanwhile, the optimal state vector representation and the optimal reward function form are found through training in the 2D mechanical arm simulation environment, so that the convergence speed and the stability of the trained control strategy model are optimal, and the tail end of the mechanical arm can be controlled to quickly reach the target position.

Drawings

FIG. 1 is a flow chart of reinforcement learning;

FIG. 2 is a flow chart of DDPG;

fig. 3 is a schematic diagram of a physical attribute-free 2D robot simulation environment according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a simulation environment of a 3D robot arm with physical properties according to an embodiment of the present invention;

fig. 5 is a schematic view of a real robot arm provided in an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding of the technical contents of the present invention by those skilled in the art, the present invention will be further explained with reference to the accompanying drawings.

The Deep reinforcement learning algorithm used in the invention is a Deep Deterministic Policy Gradient algorithm (DDPG), which combines a Policy Network in a Deterministic Policy Gradient algorithm (DPG), adopts an Actor-critical framework, combines experience playback in a Deep Q Network (DQN, Deep Q-Network) and a technique for separating a target Network (target Network) from an evaluation Network (even Network), and obtains good effect in an environment aiming at a continuous action space. In the DDPG, there are four neural networks, namely a target network and an evaluation network of an Actor and a target network and an evaluation network of Critic, and the structures of the two Actor networks are completely the same, and the structures of the two Critic networks are completely the same.

Fig. 2 is a flowchart of the DDPG algorithm, in which an Actor evaluation network outputs a current action according to a current state, an Actor target network outputs a next action according to a next state, a Critic evaluation network outputs a current Q value according to the current state and the current action, the Critic target network outputs a target Q value according to the next state and the next action, the Actor evaluation network updates itself according to the current Q value, the Critic evaluation network updates itself according to the current Q value, the target Q value, and an award, and parameters of the Actor evaluation network are copied to the target network in a weighted average manner at intervals.

The reference values in FIG. 2 are shown in Table 1:

TABLE 1 meanings of parameters in FIG. 2

Parameter name	Meaning of parameters
		S	Current state
S_	Next state
		R	Reward
Actor	Actor network for outputting actions according to status
		Critic	The reviewer network evaluates the actions according to the states
Eval_Net	Evaluating a network
		Target_Net	Target network
Target_Q	Target Q value
		TD_Error	TD error
Critic_Train	For updating reviewer networks
		Policy_Grads	Policy gradient
Actor_Train	For updating actor networks

The detailed flow of the DDPG algorithm is described as follows:

inputting: the Actor evaluates the network, and the parameter is theta; an Actor target network, the parameter being theta'; critic evaluation network, parameter omega; an Actor target network, wherein the parameter is omega'; an attenuation factor γ; soft update weight coefficient τ; the number m of samples in batch gradient descent; updating step number C of target network parameters; the maximum number of iterations T.

And (3) outputting: the optimal Actor evaluates the network parameter theta, and the optimal Critic evaluates the network parameter omega. The Actor evaluates the network and is the policy model.

1. Randomly initializing theta and omega, enabling theta 'to be theta and omega', and emptying the experience playback set D.

2. Iterations are performed from 1 to T (total training round).

Initializing an initial state s;

fortor evaluation network obtains action a ═ pi based on state s_θ(s)+N；

Executing action a to obtain a new state s', rewarding r, and judging whether the state is a termination state done or not;

fourthly, saving the { s, a, r, s', done } in the experience playback set D;

fromEmpirical playback of m samples s uniformly sampled in D_i,a_i,r_i,s′_i,done_i1,2, m, Actor target network according to s'_iOutput a ═ pi_θ′(s') + N, Critic evaluation network based on s_i,a_iOutput the current Q value Q(s)_i,a_iω), Critic target network according to s'_i,a′_iOutput Q '(s'_i,a′_iω'), calculating the target Q value y_i:

Using the mean square error loss function

Updating a parameter omega of the criticic evaluation network through gradient back propagation of the neural network;

is used

Updating a parameter theta of the Actor evaluation network through back propagation of the neural network as a loss function;

if T% C ═ 1 (every C step), then θ '← τ θ + (1- τ) θ', ω '← τ ω + (1- τ) ω', update parameters θ 'and ω' of the Actor target network and Critic target network.

Ninthly, if s 'is in a termination state, the iteration of the round is ended, otherwise, the process returns to s ═ s', and returns to the step (②).

A mechanical arm simulation model is built in a computer virtual environment, a control strategy model of the mechanical arm is obtained through the training process of the DDPG algorithm, and the model is deployed on a real mechanical arm, so that the real mechanical arm can control an end effector to reach any given spatial position in real time, and a control function for further completing an automation task is laid.

The method is based on a deep reinforcement learning algorithm, rapid training of the control model of the mechanical arm is completed, an optimized state vector representation and a stable reward function form are found in the training process, the control model obtained through training can be effectively deployed on the real mechanical arm, any given space target point can be given, the mechanical arm can automatically move the tail end to the position of the space target point, and a foundation is laid for the control application of the mechanical arm.

The method specifically comprises the following steps:

s1, training the mechanical arm by adopting a deep reinforcement learning algorithm DDPG in a physical-attribute-free 2D mechanical arm simulation environment, finding out the optimal state vector representation, and finding out the optimal reward function form.

S2, training the mechanical arm by adopting a deep reinforcement learning algorithm DDPG in a 3D mechanical arm simulation environment with physical attributes, expressing an optimal state vector in the DDPG algorithm, and continuously using an optimal result obtained in the 2D mechanical arm simulation environment in an optimal reward function form.

And S3, directly deploying the control model obtained by training in the 3D mechanical arm simulation environment with physical attributes to a real mechanical arm, wherein the physical attributes of the real mechanical arm and the physical attributes of the 3D simulation mechanical arm are required to be close to the same. And deploying the trained control model on the real mechanical arm. The model can control the movement of the real mechanical arm end to a limited area range of any given space position target point.

In step S1, a depth-enhanced learning algorithm DDPG is used to train a 2D mechanical arm in a physical-attribute-free 2D mechanical arm simulation environment, a schematic diagram of the 2D mechanical arm simulation environment is shown in fig. 3, a side length of a 2D square plane frame is 400 (unit is not limited), a lower left corner is a coordinate origin (0,0), parameters of the 2D mechanical arm include an axis a, an axis b, a terminal c, a rod length L, the axis a is a fixed rotary joint, which is located at a central point of the square, coordinates are (200 ), the axis b is a movable rotary joint, and c is a mechanical arm terminal. The included angle between the rod ab, the rod bc and the horizontal line is &, &, and the central point of the lower left corner black part represents a target position point (x, y) on any given 2D plane.

Training is carried out based on a deep reinforcement learning algorithm DDPG, and the optimal state vector representation is obtained by:

the physical meaning of each parameter entry in the state vector is shown in table 2.

TABLE 2 optimal State vector representation and State vector membership parameters

Training is carried out based on a deep reinforcement learning algorithm DDPG, and the optimal reward function form is obtained by:

the negative of the linear distance between the end c of the arm and the target point. The reward function parameter description is shown in table 3.

TABLE 3 description of the optimal reward function parameters

Variable of reward function	Physical significance
		c_x	X-axis coordinate of end c of arm
c_y	Y-axis coordinate of end c of arm
		x	X-axis coordinate of target point
y	Y-axis coordinate of target point

In step S2, the mechanical arm is trained by using a DDPG algorithm in a 3D mechanical arm simulation environment with physical attributes, the optimal state vector in the DDPG algorithm represents, and the optimal reward function form follows the optimal result obtained in the 2D mechanical arm simulation environment. A schematic diagram of a 3D mechanical arm simulation environment with physical attributes is shown in fig. 4, where the physical attributes of the mechanical arm in the simulation environment include the position relationship, rotation axis, and maximum angular velocity of each joint, and also include the shape, quality, collision detection, etc. of the mechanical arm connecting rod, and the above physical attributes of the simulation model are approximately consistent with the physical attributes of the real mechanical arm. The method comprises the steps of obtaining a mechanical arm control model after training through a deep reinforcement learning algorithm DDPG in a 3D mechanical arm simulation environment with physical attributes, and outputting the angle change quantity of each joint required by the mechanical arm to reach a target position by taking an optimal state vector as input.

Joint_values←Joint_values+DDPG(state)

Joint _ values: angle of each joint of mechanical arm

Ddpg (state): model output mechanical arm joint angle variation

In step S3, the control model is deployed directly onto the real robot arm as shown in fig. 5, and since the 3D simulated robot arm is generated entirely from the real robot arm model, their physical properties are close to unity. And deploying the trained control model on the real mechanical arm. The model can control the movement of the real mechanical arm end to a limited area range of any given space position target point. The joints of the real robot arm shown in fig. 5 include J1, J2, J3, and J4.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A mechanical arm intelligent control rapid training method based on deep reinforcement learning is characterized by comprising the following steps:

2. The intelligent control quick training method for the mechanical arm based on the deep reinforcement learning as claimed in claim 1, wherein the 2D mechanical arm comprises an axis a, an axis b and a terminal c; the lengths of the rod ab and the rod bc are L, the axis a is a fixed rotary joint, the axis b is a movable rotary joint, the c is the tail end of the mechanical arm, the included angle between the rod ab and the horizontal line is & lt theta, and the included angle between the rod bc and the horizontal line is & lt alpha.

3. The intelligent control rapid training method for the mechanical arm based on the deep reinforcement learning as claimed in claim 2, wherein the optimal state vector obtained in step S1 is:

|b_x-x|,i b y i, Indicator); where c _ x represents the x-axis coordinate of the robot arm end c, c _ y represents the y-axis coordinate of the robot arm end c, | c _ x-x | represents the x-axis distance of the robot arm end c from the target point, | c _ y-y | represents the y-axis distance of the robot arm end c from the target point, and (x, y) represents the target location point on any given 2D plane.

4. The method for intelligent mechanical arm control quick training based on deep reinforcement learning of claim 3, wherein the optimal reward function obtained in step S1 is as follows:

5. the intelligent control rapid training method for the mechanical arm based on the deep reinforcement learning as claimed in any one of claims 1 to 4, wherein the DDPG comprises four neural networks, which are respectively: the target network and the evaluation network of the Actor and the target network and the evaluation network of the Critic are the same in structure, and the target network and the evaluation network of the Critic are the same in structure.

6. The intelligent control rapid training method for the mechanical arm based on the deep reinforcement learning as claimed in claim 5, wherein a mean square error loss function is used

Updating parameters of the criticic's evaluation network by gradient back propagation of the neural network;

where m denotes the number of samples in which the batch gradient decreases, y_iTarget Q value of target network representing Critic obtained at ith sample, ω represents parameter of evaluation network of Critic, s_iRepresenting the state in the ith sample, a_iRepresenting the action in the ith sample.

7. The method of claim 5 based on deep reinforcement learningThe intelligent control quick training method of the mechanical arm is characterized by using

Updating parameters of an evaluation network of the Actor through gradient back propagation of the neural network as a loss function;

where m denotes the number of samples in which the gradient of the batch decreases, ω denotes the parameter of the evaluation network of Critic, s_iRepresenting the state in the ith sample, a_iRepresenting the action in the ith sample.

8. The method for intelligent control and rapid training of the mechanical arm based on the deep reinforcement learning of claim 5, wherein if T% C ═ 1, the parameters of the Actor target network are updated through θ '← τ θ + (1- τ) θ', and the parameters of the Critic target network are updated through ω '← τ ω + (1- τ) ω';