CN104932267A

Movatterモバイル変換

Info

Publication number: CN104932267A
Application number: CN201510304299.4A
Authority: CN
Inventors: 刘智斌; 刘晓峰
Original assignee: Qufu Normal University
Current assignee: Shandong Haida Robot Technology Co ltd; Qufu Normal University
Priority date: 2015-06-04
Filing date: 2015-06-04
Publication date: 2015-09-23
Anticipated expiration: 2035-06-04
Also published as: CN104932267B

Abstract

本发明公开了一种采用资格迹的神经网络学习控制方法，该采用资格迹的神经网络学习控制算法，将BP神经网络应用于强化学习，BP神经网络的模型拓扑结构包括输入层、隐层和输出层，运用资格迹，本方法把局部梯度从输出层传递到隐层，实现隐层权值的更新，能大大提高学习效率；在此基础上采用改进的残差梯度法，不仅对神经网络输出层进行权值更新，而且对隐层进行了优化权值更新，保证了BP神经网络在强化学习过程中良好的收敛性能。BP神经网络作为强化学习值函数拟合器，其输入端接收状态信息，依据BP神经网络输出层的输出值V和环境反馈的报酬信息r，利用TD算法训练BP神经网络，Agent依据输V值函数选取行为a，从而实现自适应控制。

The invention discloses a neural network learning control method using qualification traces. The neural network learning control algorithm using qualification traces applies BP neural network to reinforcement learning. The model topology structure of BP neural network includes an input layer, a hidden layer and In the output layer, using the qualification trace, this method transfers the local gradient from the output layer to the hidden layer to realize the update of the hidden layer weight, which can greatly improve the learning efficiency; on this basis, the improved residual gradient method is adopted, not only for the neural network The output layer is updated with weights, and the hidden layer is optimized to update the weights, which ensures the good convergence performance of BP neural network in the process of reinforcement learning. The BP neural network is used as a reinforcement learning value function fitter, and its input terminal receives state information. According to the output value V of the output layer of the BP neural network and the reward information r of the environmental feedback, the BP neural network is trained using the TD algorithm. The function selects behavior a, thus realizing adaptive control.

Description

Translated fromChinese

一种采用资格迹的神经网络学习控制方法A Neural Network Learning Control Method Using Qualification Tracing

技术领域technical field

本发明属于神经网络学习控制领域，具体涉及一种采用资格迹的神经网络学习控制方法。The invention belongs to the field of neural network learning control, in particular to a neural network learning control method using qualification traces.

背景技术Background technique

基于表格的强化学习方法，在未知环境中进行学习，表现出了极好的自适应能力。然而，这种方法只能解决状态空间和行为空间较小的问题。随着问题规模的增大，状态空间往往呈指数增加，“维数灾难”问题就显得尤为突出。采用表格法解决大规模问题，在离散空间中从状态到行为的映射需要精确对应，这样往往占用大量的内存空间。若将这一对应关系用连续函数代替，用函数值代替表格，则能够取得较好的效果。从状态空间到函数值的映射，其建立方法分为线性参数拟合方法和非线性参数拟合方法。由于进行理论分析相对简单，线性参数拟合方法常常应用于强化学习问题中。而非线性参数方法，在数据拟合方面也得到了广泛的应用。非线性参数拟合方法比较典型的工具是神经网络。神经网络具有较强的自适应能力和泛化性能，将神经网络与强化学习相结合，用神经网络代替表格，能够取得较好的效果。针对基于表格的强化学习，Sutton提出了瞬时差分TD(λ)方法，为每个访问状态设立一个资格迹，每执行一步更新，这步更新也向后传递若干步，使学习速度大大加快。针对TD(λ)方法，Dayan等证明了它的收敛性。Sutton提出了在连续状态空间中的瞬时差分法，并提出基于直接梯度法的资格迹方法。Table-based reinforcement learning methods, which learn in unknown environments, show excellent adaptability. However, this approach can only solve problems with small state and action spaces. As the scale of the problem increases, the state space tends to increase exponentially, and the problem of "curse of dimensionality" is particularly prominent. Using the table method to solve large-scale problems, the mapping from state to behavior in discrete space requires precise correspondence, which often takes up a lot of memory space. If this corresponding relationship is replaced by a continuous function and the table is replaced by a function value, better results can be achieved. The establishment method of the mapping from state space to function value is divided into linear parameter fitting method and nonlinear parameter fitting method. Due to the relative simplicity of performing theoretical analysis, linear parameter fitting methods are often applied to reinforcement learning problems. Nonlinear parametric methods have also been widely used in data fitting. A typical tool for nonlinear parameter fitting method is neural network. Neural network has strong self-adaptability and generalization performance. Combining neural network with reinforcement learning and using neural network instead of tables can achieve better results. For table-based reinforcement learning, Sutton proposed the instantaneous difference TD(λ) method, which sets up a qualification trace for each access state, and every time an update is performed, this update is also passed several steps backwards, which greatly speeds up the learning speed. For the TD(λ) method, Dayan et al. proved its convergence. Sutton proposed the instantaneous difference method in continuous state space, and proposed the qualification trace method based on the direct gradient method.

将BP神经网络(BP neural networks，BPNN)运用于强化学习在国内外很多文献都有过介绍，但这些方法基本上采用单步更新。在学习过程中引入资格迹，能大大提高神经网络的训练效率，但是这就使得神经网络的训练过程，特别是神经网络隐层权值的更新，将变得更加复杂，基于拟合器的强化学习方法在学习过程中更新其权值，常用的方法有直接梯度法和残差梯度法。由于直接梯度法类似于监督学习中的最速下降法，这种方法学习速度较快，但是往往收敛性能不理想。而残差梯度法能够保证较好的收敛性，但是它的收敛速度非常缓慢。Baird提出了一种残差法，这种方法既能保证使用残差梯度法的收敛性，又确保使用直接梯度法的收敛速度，取得了良好的性能。然而，Baird只给出了输出层权值更新的计算方法，没有涉及隐层的情形。The application of BP neural networks (BPNN) to reinforcement learning has been introduced in many literatures at home and abroad, but these methods basically use single-step update. The introduction of qualification traces in the learning process can greatly improve the training efficiency of the neural network, but this makes the training process of the neural network, especially the update of the hidden layer weights of the neural network, more complicated. The learning method updates its weights during the learning process. The commonly used methods are direct gradient method and residual gradient method. Since the direct gradient method is similar to the steepest descent method in supervised learning, this method has a faster learning speed, but often has unsatisfactory convergence performance. The residual gradient method can guarantee better convergence, but its convergence speed is very slow. Baird proposed a residual method, which can not only guarantee the convergence of the residual gradient method, but also ensure the convergence speed of the direct gradient method, and achieve good performance. However, Baird only gives the calculation method for the weight update of the output layer, without involving the hidden layer.

发明内容Contents of the invention

本发明的目的是根据现有基于神经网络强化学习过程中，存在效率不高和收敛速度慢的不足，结合资格迹方法提出了一种强化学习过程的多步更新的算法，并且在该算法中运用了一种改进的残差法，在神经网络的训练过程中将各层权值进行线性优化加权，既获得了直接梯度法的学习速度又获得了残差法的收敛性的一种采用资格迹的神经网络学习控制方法。The purpose of the present invention is to propose a multi-step update algorithm of a reinforcement learning process based on the existing neural network-based reinforcement learning process, which has the disadvantages of low efficiency and slow convergence speed, and in this algorithm Using an improved residual method, the weights of each layer are linearly optimized and weighted during the training process of the neural network, which not only obtains the learning speed of the direct gradient method but also obtains the convergence of the residual method. A neural network learning control method for traces.

本发明具体采用如下技术方案：The present invention specifically adopts the following technical solutions:

一种采用资格迹的神经网络学习控制方法，将BP神经网络应用于强化学习，所述BP神经网络的模型拓扑结构包括输入层、隐层和输出层，运用资格迹，把局部梯度从输出层传递到隐层，实现隐层权值的更新，采用改进的残差梯度法，不仅对BP神经网络输出层权值更新，而且对隐层进行了优化权值更新，具体包括如下步骤：A neural network learning control method using a qualification trace, applying a BP neural network to reinforcement learning, the model topology of the BP neural network includes an input layer, a hidden layer and an output layer, and using the qualification trace, the local gradient is transferred from the output layer to Pass it to the hidden layer to realize the update of hidden layer weights. Using the improved residual gradient method, not only update the weights of the output layer of BP neural network, but also optimize the weights of the hidden layer. Specifically, it includes the following steps:

S1.启动基于BP神经网络的强化学习过程，学习Agent在于环境的交互中，不断获得评价性的反馈信息作为回报，再将回报值做加权累加，Agent在行为选择过程中，选择能够取得最大积累回报的行为作为其最优行为：S1. Start the intensive learning process based on the BP neural network. The learning agent lies in the interaction with the environment, and continuously obtains evaluative feedback information as a reward, and then weights and accumulates the reward value. In the process of behavior selection, the agent chooses to achieve the maximum accumulation. Reciprocated behavior as its optimal behavior:

Agent在状态s∈S下的可执行行为记作a∈A，它从行为集合A中选择使Q^π(s,a)最大的行为作为其最优行为，Q^π(s,a)的定义如下：Agent’s executable behavior in state s∈S is denoted as a∈A. It selects the behavior that maximizes Q^π (s, a) from the behavior set A as its optimal behavior. The definition of Q^π (s, a) as follows:

Q^π(s,a)＝E{r_t+1+γr_t+2+γ²r_t+3+…|s_t＝s,a_t＝s,π} (1)Q^π (s,a)＝E{r_t+1 +γr_t+2 +γ² r_t+3 +...|s_t =s,a_t =s,π} (1)

其中：0＜γ＜1，Among them: 0<γ<1,

在问题模型未知的情形下，利用Q-学习算法表示为：When the problem model is unknown, the Q-learning algorithm is used to express as:

$Q Q ((s the s,, a a)) = = Q Q ((s the s,, a a)) + + α α ((r r + + γ γ \underset{{a a}^{' '}}{max max} Q Q (({s the s}^{' '},, {a a}^{' '})) - - Q Q ((s the s,, a a)))) - - - - - - ((22))$

Agent在每次迭代中更新Q(s,a)值，在多次迭代后Q(s,a)值收敛，在Q(s,a)值定义的基础上，V值定义如下：The agent updates the Q(s, a) value in each iteration, and the Q(s, a) value converges after multiple iterations. Based on the definition of the Q(s, a) value, the V value is defined as follows:

$V V ((s the s)) = = \underset{a a &Element; &Element; A A ((s the s))}{max max} Q Q ((s the s,, a a)) - - - - - - ((33))$

在状态s下，求得当前最优策略为π^*：In state s, the current optimal strategy is obtained as π^* :

${π π}^{* *} ((s the s)) = = arg arg \underset{a a}{max max} Q Q ((s the s,, a a)) - - - - - - ((44))$

S2.采用BP神经网络作为强化学习值函数拟合器，所述BP神经网络的输入端接收状态信息，依据BP神经网络的输出层的输出值V和环境反馈的报酬信息r，利用TD算法训练BP神经网络，Agent依据输V值函数选取行为a；S2. Adopt BP neural network as reinforcement learning value function fitter, the input end of described BP neural network receives state information, utilizes TD algorithm to train according to the output value V of the output layer of BP neural network and the remuneration information r of environmental feedback BP neural network, Agent selects behavior a according to the input V value function;

Agent从一个状态X_t进入另一个状态X_t+1，获取报酬值r_t，在状态X_t下的函数值为V(X_t)，V(X_t)用拟合函数表示，对于输入状态X_t，它的目标输出值为r_t+γV(X_t+1)，在更新过程中相应拟合函数的权值更新为：Agent enters another state X_t+1 from one state X_t to obtain the reward value r_t , the function value in state X_t is V(X_t ), V(X_t ) is represented by a fitting function, for the input state X_t , its target output value is r_t +γV(X_t+1 ), and the weight of the corresponding fitting function is updated as:

$Δw Δw = = α α (({r r}_{t t} + + γV γV (({X x}_{t t + + 11})) - - V V (({X x}_{t t})))) \frac{&PartialD; &PartialD; V V (({X x}_{t t}))}{&PartialD; &PartialD; w w} - - - - - - ((55))$

其中，向量X＝[x₁，x₂，…，x_i，…，x_m]^T为状态向量；Wherein, vector X=[x₁ , x₂ , ..., x_i , ..., x_m ]^T is a state vector;

设定输入层节点个数为m+1，隐层节点个数为n+1，输出层节点个数为1，向量Y＝[y₁,y₂,…,y_i,…,x_m]^T为BP神经网络的输入向量，状态向量X中的分量一侧赋值给BP神经网络输入向量Y中的对应分量，y_i←x_i，固定输入y₀←1，隐层节点到输出层节点的连接权值为：Set the number of input layer nodes as m+1, the number of hidden layer nodes as n+1, the number of output layer nodes as 1, the vector Y=[y₁ ,y₂ ,…,y_i ,…,x_m ]^T is the input vector of the BP neural network, one side of the component in the state vector X is assigned to the corresponding component in the input vector Y of the BP neural network, y_i ← x_i , fixed input y₀ ← 1, hidden layer node to output layer node The connection weight of is:

W²＝[w₀,w₁,w₂,…,w_n] (6)W² =[w₀ ,w₁ ,w₂ ,…,w_n ] (6)

输入层到隐层的连接权值为：The connection weights from the input layer to the hidden layer are:

${W W}^{11} = = [\begin{matrix} {w w}_{1010} & {w w}_{1111} & {w w}_{1212} & \cdot &Center Dot; \cdot &Center Dot; \cdot &Center Dot; & {w w}_{11 m m} \\ {w w}_{2020} & {w w}_{21 twenty one} & {w w}_{22 twenty two} & \cdot &Center Dot; \cdot \cdot \cdot &Center Dot; & {w w}_{22 m m} \\ \cdot &Center Dot; \cdot &Center Dot; \cdot &Center Dot; & \cdot &Center Dot; \cdot &Center Dot; \cdot &Center Dot; & \cdot &Center Dot; \cdot &Center Dot; \cdot &Center Dot; & \cdot \cdot \cdot &Center Dot; \cdot &Center Dot; & \cdot &Center Dot; \cdot \cdot \cdot &Center Dot; \\ {w w}_{n no 00} & {w w}_{n no 11} & {w w}_{n no 22} & \cdot &Center Dot; \cdot &Center Dot; \cdot &Center Dot; & {w w}_{nm nm} \end{matrix}] - - - - - - ((77))$

由神经元节点p连接到神经元节点q的突触权值的修正值为：The correction value of the synaptic weight connected by neuron node p to neuron node q is:

Δw_qp＝αδ_qy_p (8)Δw_qp = αδ_q y_p (8)

其中，δ_q为神经元的局部梯度，y_p输入值，Among them, δ_q is the local gradient of the neuron, y_p input value,

在该三层BP神经网络中，输出神经元只有一个，其局部梯度为：In this three-layer BP neural network, there is only one output neuron, and its local gradient is:

其中，为输出节点的激活函数，为在v处的导数，in, is the activation function of the output node, for derivative at v,

神经元j作为隐层节点，其局部梯度为:Neuron j is used as a hidden layer node, and its local gradient is:

其中，i为输入层节点索引；in, i is the index of the input layer node;

S3.引入资格迹的直接梯度法进行计算，为加快训练速度，将一步误差更新向后传播若干步，表现在BP神经网络上，就是累积更新权值，权值更新公式为：S3. Introduce the direct gradient method of the qualification trace for calculation. In order to speed up the training speed, the one-step error update is propagated backwards for several steps, which is manifested on the BP neural network, which is the cumulative update weight value. The weight update formula is:

${Δw Δw}_{t t} = = α α (({r r}_{t t} + + γV γV (({X x}_{t t + + 11})) - - V V (({X x}_{t t})))) {Σ Σ}_{k k = = 00}^{t t} {λ λ}^{t t - - k k} \frac{&PartialD; &PartialD; V V (({X x}_{k k}))}{&PartialD; &PartialD; w w} - - - - - - ((1111))$

令 $e_{t} = Σ_{k = 0}^{t} λ^{t - k} \frac{&PartialD; V (X_{k})}{&PartialD; w},$ make $e_{t} = Σ_{k = 0}^{t} λ^{t - k} \frac{&PartialD; V (x_{k})}{&PartialD; w},$

通过迭代实现每一步的资格迹：Eligibility traces for each step are achieved by iteration:

$\begin{matrix} {e e}_{t t + + 11} = = {Σ Σ}_{k k = = 00}^{t t + + 11} {λ λ}^{t t + + 11 - - k k} \frac{&PartialD; &PartialD; V V (({X x}_{k k}))}{&PartialD; &PartialD; w w} \\ = = \frac{&PartialD; &PartialD; V V (({X x}_{t t + + 11}))}{&PartialD; &PartialD; w w} + + {Σ Σ}_{k k = = 00}^{t t} {λ λ}^{t t + + 11 - - k k} \frac{&PartialD; &PartialD; V V (({X x}_{k k}))}{&PartialD; &PartialD; w w} \\ = = \frac{&PartialD; &PartialD; V V (({X x}_{t t + + 11}))}{&PartialD; &PartialD; w w} + + {λe λ e}_{t t} \end{matrix} - - - - - - ((1212))$

通过式(12)求得的每步资格迹与最后一步状态变换误差值的乘积，得到BP神经网络的连接突触权值更新值，The product of the qualification trace of each step obtained by formula (12) and the error value of the state transition of the last step can be used to obtain the update value of the connection synaptic weight of the BP neural network,

隐层到输出层的任意连接突触更新Δw_j为：The synaptic update Δw_j of any connection from the hidden layer to the output layer is:

为了求得输入层到隐层的连接突触权值，由式(13)，在时间步t，获得误差值r_t+γV(X_t+1)-V(X_t)，传播到时间步k的误差值为：In order to obtain the synaptic weights of the connection from the input layer to the hidden layer, according to formula (13), at time step t, the error value r_t +γV(X_t+1 )-V(X_t ) is obtained and propagated to time step The error value of k is:

(r_t+γV(X_t+1)-V(X_t))λ^t-k (14)(r_t +γV(X_t+1 )-V(X_t ))λ^tk (14)

在时间步k，输出神经元的局部梯度为：At time step k, the local gradient of the output neuron is:

对于神经元j作为隐层节点，在时间步k，其局部梯度为：For neuron j as a hidden layer node, at time step k, its local gradient is:

到时间步k，由神经元节点i连接到神经元节点j的突触权值的修正值为:At time step k, the correction value of the synaptic weights connected from neuron node i to neuron node j is:

在时间步t，引入资格迹后的由神经元节点i连接到神经元节点j的突触权值的修正值为:At time step t, after introducing the qualification trace, the correction value of the synaptic weight connected from neuron node i to neuron node j is:

经过上述计算，BP神经网络的隐层到输出层突触权值的更新依照直接梯度法进行调整，BP神经网络输入层到输出层突触权值的更新依赖于输出层节点局部梯度到隐层节点局部梯度的反传；After the above calculations, the update of the synaptic weights from the hidden layer to the output layer of the BP neural network is adjusted according to the direct gradient method, and the update of the synaptic weights from the input layer to the output layer of the BP neural network depends on the local gradient of the output layer node to the hidden layer The backpropagation of the local gradient of the node;

S4.利用改进的残差法，将资格迹引入权值更新，同时将权值更新扩展到BP神经网络的隐层，利用所述S3的方法，将具有三层节点的BP神经网络的连接突触权值更新用一个(m+2)n+1维向量ΔW_d表示为：S4. Utilize the improved residual method to introduce the qualification trace into the weight update, and at the same time extend the weight update to the hidden layer of the BP neural network, and use the method of S3 to make the connection burst of the BP neural network with three layers of nodes The update of the touch weight value is represented by a (m+2)n+1-dimensional vector ΔW_d as:

ΔW_d＝[Δw₀,Δw₁,…,Δw_n,Δw₁₀,Δw₂₀,…,Δw_n0,Δw₁₁,…,Δw_ji,…,Δw_nm]_d (19)ΔW_d =[Δw₀ ,Δw₁ ,...,Δw_n ,Δw₁₀ ,Δw₂₀ ,...,Δw_n0 ,Δw₁₁ ,...,Δw_ji ,...,Δw_nm ]_d (19)

式(19)中的前n+1项是隐层到输出层的连接突触权值更新，后(m+1)n项是输入层到隐层的连接突触权值更新；The first n+1 items in formula (19) are the update of the connection synaptic weights from the hidden layer to the output layer, and the last (m+1)n items are the update of the connection synaptic weights from the input layer to the hidden layer;

采用基于资格迹的残差梯度法更新BP神经网络的连接突触权值，将具有三层节点的BP神经网络的连接突触权值更新用一个(m+2)n+1维向量ΔW_rg表示为：Use the residual gradient method based on qualification traces to update the connection synaptic weights of BP neural network, and use a (m+2)n+1-dimensional vector ΔW_rg to update the connection synaptic weights of BP neural network with three-layer nodes Expressed as:

ΔW_rg＝[Δw₀,Δw₁,…,Δw_n,Δw₁₀,Δw₂₀,…,Δw_n0,Δw₁₁,…,Δw_ji,…,Δw_nm]_rg (20)ΔW_rg =[Δw₀ ,Δw₁ ,...,Δw_n ,Δw₁₀ ,Δw₂₀ ,...,Δw_n0 ,Δw₁₁ ,...,Δw_ji ,...,Δw_nm ]_rg (20)

1)若ΔW_d·ΔW_rg＞0，则二向量之间的夹角为锐角，ΔW_d减小带来残差梯度更新量ΔW_rg减小，使拟合函数收敛；1) If ΔW_d · ΔW_rg > 0, the angle between the two vectors is an acute angle, and the decrease of ΔW_d brings about the decrease of residual gradient update amount ΔW_rg , which makes the fitting function converge;

2)若ΔW_d·ΔW_rg＜0，则二向量之间的夹角为钝角，ΔW_d减小带来残差梯度更新量ΔW_rg增加，使拟合函数发散；2) If ΔW_d ·ΔW_rg <0, the angle between the two vectors is an obtuse angle, and the decrease of ΔW_d brings about the increase of residual gradient update amount ΔW_rg , which makes the fitting function diverge;

为了避免发散，又能够使BP神经网络的训练过程较为快速，引入残差更新向量ΔW_r，其值为向量ΔW_d和ΔW_rg的加权平均值，定义为：In order to avoid divergence and make the training process of the BP neural network faster, a residual update vector ΔW_r is introduced, whose value is the weighted average of the vectors ΔW_d and ΔW_rg , defined as:

ΔW_r＝(1-φ)ΔW_d+φΔW_rg (21)ΔW_r = (1-φ)ΔW_d +φΔW_rg (21)

其中，φ∈[0,1]where, φ∈[0,1]

φ的选取，应使ΔW_r与ΔW_rg的夹角为锐角，同时让ΔW_r尽量与ΔW_d离得近一些，以下求使向量ΔW_r与向量ΔW_rg垂直的φ_⊥值：The selection of φ should make the angle between ΔW_r and ΔW_rg an acute angle, and at the same time make ΔW_r as close as possible to ΔW_d . The following is the value of φ_⊥ that makes the vector ΔW_r perpendicular to the vector ΔW_rg :

ΔW_r·ΔW_rg＝0 (22)ΔW_r · ΔW_rg = 0 (22)

满足式(22)的向量ΔW_r与向量ΔW_rg垂直，The vector ΔW_r satisfying formula (22) is perpendicular to the vector ΔW_rg ,

求解式(22)，得到φ_⊥值为：Solve equation (22), and get the value of φ_⊥ as:

${φ φ}_{&perp; &perp;} = = \frac{{ΔW ΔW}_{d d} \cdot &Center Dot; {ΔW ΔW}_{rg r g}}{{ΔW ΔW}_{d d} \cdot &Center Dot; {ΔW ΔW}_{rg r g} - - {ΔW ΔW}_{rg r g} \cdot \cdot {ΔW ΔW}_{rg r g}} - - - - - - ((23 twenty three))$

φ的选取只需在φ_⊥值上增加一个较小的正值μ，使之略偏向向量ΔW_rg一点，The selection of φ only needs to add a small positive value μ to the value of φ_⊥ to make it slightly biased towards the vector ΔW_rg ,

φ＝φ_⊥+μ (24)φ＝φ_⊥ +μ (24)

3)若ΔW_d·ΔW_rg＝0，则二向量之间的夹角为直角，这样有：3) If ΔW_d ·ΔW_rg = 0, then the angle between the two vectors is a right angle, so:

φ_⊥＝0φ_⊥ = 0

φ的选取为：φ＝φ_⊥+μ＝μ (25)The selection of φ is: φ＝φ_⊥ +μ＝μ (25)

经过上述运算，保证在迭代过程中权值收敛，通过这种方法训BP练神经网络的各层权值，其更新不会引起函数值发散，同时将BP神经网络的各层权值都加以考虑，使得权值更新向量ΔW_r不会引起用残差梯度法得到的权值更新向量ΔW_rg向其相反的方向变化，从而保证收敛。After the above calculations, the weights are guaranteed to converge during the iterative process. By using this method to train the weights of each layer of the BP neural network, the update will not cause the divergence of the function value, and at the same time, the weights of each layer of the BP neural network will be considered. , so that the weight update vector ΔW_r will not cause the weight update vector ΔW_rg obtained by the residual gradient method to change in the opposite direction, thus ensuring convergence.

优选地，所述S4中基于资格迹的残差梯度法为：Preferably, the residual gradient method based on the qualification trace in the S4 is:

采用BP神经网络拟合值函数，Agent从一个状态X_t转移到下一状态X_t+1，获得报酬值r_t，在状态X_t下的函数值为V(X_t)，V(X_t)用拟合函数来表示，对于输入状态X_t，它的目标输出值为r_t+γV(X_t+1)，其误差信息E的计算公式为：Using the BP neural network to fit the value function, the Agent transfers from one state X_t to the next state X_t+1 and obtains the reward value r_t , the function value in the state X_t is V(X_t ), V(X_t ) is represented by a fitting function. For the input state X_t , its target output value is r_t +γV(X_t+1 ), and the calculation formula of its error information E is:

$E E. = = \frac{11}{22} {(({r r}_{t t} + + γV γV (({X x}_{t t + + 11})) - - V V (({X x}_{t t}))))}^{22} - - - - - - ((2626))$

为使误差E趋于最小，采用残差梯度法，求得每次迭代BP神经网络权值的变化量Δw，将V(X_t)和V(X_t+1)都视为变化量，由式(26)求得拟合函数的权值按残差梯度法更新为：In order to minimize the error E, the residual gradient method is used to obtain the change amount Δw of the weight of the BP neural network in each iteration, and both V(X_t ) and V(X_t+1 ) are regarded as the change amount. Formula (26) obtains the weight of the fitting function and updates it according to the residual gradient method:

$Δw Δw = = α α (({r r}_{t t} + + γV γV (({X x}_{t t + + 11})) - - V V (({X x}_{t t})))) ((\frac{&PartialD; &PartialD; V V (({X x}_{t t}))}{&PartialD; &PartialD; w w} - - γ γ \frac{&PartialD; &PartialD; V V (({X x}_{t t + + 11}))}{&PartialD; &PartialD; w w})) - - - - - - ((2727))$

其中，α为学习速度，采用式(27)对BP神经网络进行权值迭代更新，能保证值函数收敛，Among them, α is the learning speed, and the weight value of the BP neural network is iteratively updated by formula (27), which can ensure the convergence of the value function,

由式(27)变形得:Transformed from formula (27):

$Δw Δw = = α α (({r r}_{t t} + + γV γV (({X x}_{t t + + 11})) - - V V (({X x}_{t t})))) \frac{{&PartialD; &PartialD; V V}_{t t} (({X x}_{t t}))}{&PartialD; &PartialD; w w} - - γα γα (({r r}_{t t} + + γV γV (({X x}_{t t + + 11})) - - {V V}_{t t} (({X x}_{t t})))) \frac{&PartialD; &PartialD; V V (({X x}_{t t + + 11}))}{&PartialD; &PartialD; w w} - - - - - - ((2828))$

式(28)中，项的求值跟公式(5)的直接梯度法求法相同，项的求值跟公式(5)中的直接梯度法求法基本相同，输入值为目标状态，In formula (28), The evaluation of the term is the same as the direct gradient method of formula (5), The evaluation of the item is basically the same as the direct gradient method in formula (5), the input value is the target state,

引入资格迹后，求得相应的拟合函数的权值按残差梯度法更新为：After introducing the qualification trace, the weight value of the corresponding fitting function obtained is updated by the residual gradient method as:

${Δw Δw}_{t t} = = α α (({r r}_{t t} + + γV γV (({X x}_{t t + + 11})) - - V V (({X x}_{t t})))) {Σ Σ}_{k k = = 00}^{t t} {λ λ}^{t t - - k k} ((\frac{&PartialD; &PartialD; V V (({X x}_{t t}))}{&PartialD; &PartialD; w w} - - γ γ \frac{&PartialD; &PartialD; V V (({X x}_{t t + + 11}))}{&PartialD; &PartialD; w w})) - - - - - - ((2929))$

由式(29)变形得：Transformed from formula (29):

$Δw Δw = = α α (({r r}_{t t} + + γV γV (({X x}_{t t + + 11})) - - V V (({X x}_{t t})))) {Σ Σ}_{k k = = 00}^{t t} {λ λ}^{t t - - k k} \frac{&PartialD; &PartialD; V V (({X x}_{t t}))}{&PartialD; &PartialD; w w} - - γα γα (({r r}_{t t} + + γV γV (({X x}_{t t + + 11})) - - V V (({X x}_{t t})))) {Σ Σ}_{k k = = 00}^{t t} {λ λ}^{t t - - k k} \frac{&PartialD; &PartialD; V V (({X x}_{t t + + 11}))}{&PartialD; &PartialD; w w} - - - - - - ((3030))$

式(30)中，等式右侧第一项的求值跟第3节中引入资格迹的直接梯度法求法相同，等式右侧第二项的求值跟第3节中的公式(13)相同，输入值为目标状态。In Equation (30), the evaluation of the first term on the right side of the equation is the same as the direct gradient method with eligibility trace introduced in Section 3, and the evaluation of the second term on the right side of the equation is the same as the formula (13 ), the input value is the target state.

本发明的有益效果是：在运用BP神经网络基础上，结合资格迹方法提出一种算法，实现了强化学习过程的多步更新。解决了输出层的局部梯度向隐层节点的反向传播问题，从而实现了BP神经网络隐层权值的快速更新，通过一种改进的残差法，在BP神经网络的训练过程中将各层权值进行线性优化加权，既获得了直接梯度法的学习速度又获得了残差梯度法的收敛性能，将其应用于BP神经网络隐层的权值更新，改善了值函数的收敛性能。The beneficial effects of the invention are: on the basis of using the BP neural network, an algorithm is proposed in combination with the qualification trace method, and the multi-step update of the reinforcement learning process is realized. It solves the problem of backpropagation of the local gradient of the output layer to the hidden layer nodes, thereby realizing the rapid update of the hidden layer weights of the BP neural network. Through an improved residual method, each Layer weights are linearly optimized and weighted, which not only obtains the learning speed of the direct gradient method but also obtains the convergence performance of the residual gradient method. It is applied to the weight update of the hidden layer of the BP neural network and improves the convergence performance of the value function.

附图说明Description of drawings

图1为基于BP神经网络的强化学习模型；Fig. 1 is the reinforcement learning model based on BP neural network;

图2为基于强化学习的倒立摆平衡控制模型；Figure 2 is an inverted pendulum balance control model based on reinforcement learning;

图3仿真实验的学习过程曲线示意图；The schematic diagram of the learning process curve of Fig. 3 simulation experiment;

图4为仿真试验中小车位置随时间变化示意图；Figure 4 is a schematic diagram of the position of the trolley changing with time in the simulation test;

图5为仿真实验中摆杆角度随时间变化示意图；Figure 5 is a schematic diagram of the change of the pendulum angle with time in the simulation experiment;

图6为仿真实验中控制力随时间变化示意图。Figure 6 is a schematic diagram of the control force changing with time in the simulation experiment.

具体实施方式Detailed ways

下面结合附图和具体实施例多本发明的具体实施方式做进一步说明：Below in conjunction with accompanying drawing and specific embodiment multiple specific implementation modes of the present invention will be further described:

如图1所示，一种采用资格迹的神经网络学习控制方法，将BP神经网络(误差反向传播神经网络)应用于强化学习，所述BP神经网络的模型拓扑结构包括输入层、隐层和输出层，运用资格迹，把局部梯度从输出层传递到隐层，实现隐层权值的更新，采用改进的残差梯度法，不仅对BP神经网络输出层权值更新，而且对隐层进行了优化权值更新，具体包括如下步骤：As shown in Figure 1, a kind of neural network learning control method that adopts qualification trace applies BP neural network (error backpropagation neural network) to reinforcement learning, and the model topology structure of described BP neural network includes input layer, hidden layer And the output layer, using the qualification trace, transfer the local gradient from the output layer to the hidden layer, realize the update of the hidden layer weight, adopt the improved residual gradient method, not only update the weight of the output layer of the BP neural network, but also update the hidden layer The optimized weight value update is carried out, which specifically includes the following steps:

S1.启动基于BP神经网络的强化学习过程，学习Agent(一种处于一定环境下包装的计算机系统)在于环境的交互中，不断获得评价性的反馈信息作为回报，再将回报值做加权累加，Agent在行为选择过程中，选择能够取得最大积累回报的行为作为其最优行为：S1. Start the intensive learning process based on BP neural network, learn Agent (a computer system packaged in a certain environment) in the interaction of the environment, continuously obtain evaluative feedback information as rewards, and then weight and accumulate the rewards, In the process of behavior selection, the agent chooses the behavior that can obtain the maximum cumulative reward as its optimal behavior:

其中：0＜γ＜1，Among them: 0<γ<1,

在问题模型未知的情形下，利用Q-学习算法表示为：In the case that the problem model is unknown, the Q-learning algorithm is used to express as:

S2.采用BP神经网络作为强化学习值函数拟合器，所述BP神经网络的输入端接收状态信息，依据BP神经网络的输出层的输出值V和环境反馈的报酬信息r，利用TD(强化学习)算法训练BP神经网络，Agent依据输V值函数选取行为a；S2. Adopt BP neural network as the reinforcement learning value function fitter, the input end of described BP neural network receives state information, according to the output value V of the output layer V of BP neural network and the remuneration information r of environmental feedback, utilize TD (strengthening learning) algorithm to train the BP neural network, and the Agent selects behavior a according to the input value function;

W²＝[w₀,w₁,w₂,…,w_n] (6)W² =[w₀ ,w₁ ,w₂ ,…,w_n ] (6)

${W W}^{11} = = [\begin{matrix} {w w}_{1010} & {w w}_{1111} & {w w}_{1212} & \cdot &Center Dot; \cdot &Center Dot; \cdot \cdot & {w w}_{11 m m} \\ {w w}_{2020} & {w w}_{21 twenty one} & {w w}_{22 twenty two} & \cdot &Center Dot; \cdot &Center Dot; \cdot \cdot & {w w}_{22 m m} \\ \cdot &Center Dot; \cdot &Center Dot; \cdot \cdot & \cdot \cdot \cdot &Center Dot; \cdot &Center Dot; & \cdot \cdot \cdot &Center Dot; \cdot \cdot & \cdot \cdot \cdot &Center Dot; \cdot &Center Dot; & \cdot &Center Dot; \cdot &Center Dot; \cdot &Center Dot; \\ {w w}_{n no 00} & {w w}_{n no 11} & {w w}_{n no 22} & \cdot &Center Dot; \cdot &Center Dot; \cdot &Center Dot; & {w w}_{nm nm} \end{matrix}] - - - - - - ((77))$

Δw_qp＝αδ_qy_p (8)Δw_qp = αδ_q y_p (8)

神经元j作为隐层节点，其局部梯度为：Neuron j is used as a hidden layer node, and its local gradient is:

(r_t+γV(X_t+1)-V(X_t))λ^t-k (14)(r_t +γV(X_t+1 )-V(X_t ))λ^tk (14)

到时间步k，由神经元节点i连接到神经元节点j的突触权值的修正值为：At time step k, the correction value of the synaptic weight connected from neuron node i to neuron node j is:

在时间步t，引入资格迹后的由神经元节点i连接到神经元节点j的突触权值的修正值为：At time step t, the correction value of the synaptic weight connected from neuron node i to neuron node j after introducing the qualification trace is:

ΔW_r＝(1-φ)ΔW_d+φΔW_rg (21)ΔW_r = (1-φ)ΔW_d +φΔW_rg (21)

其中，φ∈[0,1]where, φ∈[0,1]

ΔW_r·ΔW_rg＝0 (22)ΔW_r · ΔW_rg = 0 (22)

${φ φ}_{&perp; &perp;} = = \frac{{ΔW ΔW}_{d d} \cdot \cdot {ΔW ΔW}_{rg r g}}{{ΔW ΔW}_{d d} \cdot &Center Dot; {ΔW ΔW}_{rg r g} - - {ΔW ΔW}_{rg r g} \cdot \cdot {ΔW ΔW}_{rg r g}} - - - - - - ((23 twenty three))$

φ＝φ_⊥+μ (24)φ＝φ_⊥ +μ (24)

φ_⊥＝0φ_⊥ = 0

经过上述运算，保证在迭代过程中权值收敛，通过这种方法训练BP神经网络的各层权值，其更新不会引起函数值发散，同时将BP神经网络的各层权值都加以考虑，使得权值更新向量ΔW_r不会引起用残差梯度法得到的权值更新向量ΔW_rg向其相反的方向变化，从而保证收敛。After the above calculations, the weights are guaranteed to converge during the iterative process. By using this method to train the weights of each layer of the BP neural network, its update will not cause the divergence of the function value. At the same time, the weights of each layer of the BP neural network will be considered. The weight update vector ΔW_r will not cause the weight update vector ΔW_rg obtained by the residual gradient method to change in the opposite direction, thus ensuring convergence.

所述S4中基于资格迹的残差梯度法为：The residual gradient method based on the qualification trace in the S4 is:

其中，α为学习速度，采用式(27)对BP神经网络进行权值迭代更新，能保证值函数收敛，由式(27)变形得:Among them, α is the learning speed, using formula (27) to iteratively update the weight of BP neural network, which can ensure the convergence of the value function, transformed from formula (27):

由式(29)变形得：Transformed from formula (29):

式(30)中，等式右侧第一项的求值跟第3节中引入资格迹的直接梯度法求法相同，等式右侧第二项的求值跟第3节中的公式(13)相同，输入值为目标状态。In Equation (30), the evaluation of the first term on the right side of the equation is the same as the direct gradient method that introduces the qualification trace in Section 3, and the evaluation of the second term on the right side of the equation is the same as the formula (13 ), the input value is the target state.

如图2所示，一个小车可以在一个水平轨道上自由运动，小车上安装了一个钢性的自由摆杆，摆杆处在不稳定状态下。小车在可控力F的作用下左右运动，小车运动的轨道范围是[-2.4,2.4]m。本问题是：在力的作用下小车在导轨上运动，学习系统力图让摆杆保持足够长时间的竖直状态而不倒掉。当小车运动超出轨道范围[-2.4,2.4]m，则本轮实验失败；当小车的摆杆与垂直方向的夹角θ超过的某一数值也认定为实验失败。将倒立摆的水平位移x、水平运动速度夹角θ和θ对时间的导数作为BP神经网络的输入值。当倒立摆在水平导轨上超出轨道范围[-2.4,2.4]m或θ夹角超出范围[-12°,12°]都会得到奖惩值-1，在其它状态范围，得到的奖惩值为0。As shown in Figure 2, a trolley can move freely on a horizontal track, and a steel free pendulum is installed on the trolley, which is in an unstable state. The trolley moves left and right under the action of the controllable force F, and the orbit range of the trolley is [-2.4,2.4]m. The problem is: the trolley moves on the guide rail under the action of force, and the learning system tries to keep the pendulum in a vertical state for a long enough time without falling down. When the trolley moves beyond the track range [-2.4,2.4]m, the current round of experiment fails; when the angle θ between the pendulum of the trolley and the vertical direction exceeds a certain value, the experiment is also deemed to fail. The horizontal displacement x and horizontal motion speed of the inverted pendulum Derivatives of angles θ and θ with respect to time As the input value of BP neural network. When the inverted pendulum exceeds the range of [-2.4,2.4]m on the horizontal guide rail or the included angle of θ exceeds the range of [-12°,12°], the reward and punishment value -1 will be obtained, and the reward and punishment value obtained in other state ranges will be 0.

倒立摆系统运动的参数方程描述为：The parametric equation of the motion of the inverted pendulum system is described as:

$\overset{\cdot &Center Dot; \cdot &Center Dot;}{θ θ} = = \frac{g g sin sin θ θ + + cos cos θ θ [[\frac{- - F f - - ml ml {\overset{\cdot &Center Dot;}{θ θ}}^{22} sin sin θ θ + + {μ μ}_{c c} sgn sgn ((\overset{\cdot &Center Dot;}{x x}))}{{m m}_{c c} + + m m}]] - - \frac{{μ μ}_{p p} \overset{\cdot &Center Dot;}{θ θ}}{ml ml}}{l l [[\frac{44}{33} - - \frac{m m {cos cos}^{22} θ θ}{{m m}_{c c} + + m m}]]} - - - - - - ((2626))$

$\overset{\cdot \cdot \cdot &Center Dot;}{x x} = = \frac{F f + + ml ml [[{\overset{\cdot &Center Dot;}{θ θ}}^{22} sin sin θ θ - - \overset{\cdot \cdot \cdot &Center Dot;}{θ θ} cos cos θ θ]] - - {μ μ}_{c c} sgn sgn ((\overset{\cdot \cdot}{x x}))}{{m m}_{c c} + + m m} - - - - - - ((2727))$

在式(26)和式(27)中设置参数为：重力加速度g＝-9.8m/s²，小车重量m_c＝1.0kg，摆杆重量m＝0.1kg，摆杆一半的长度l＝0.5m，小车在导轨上的摩擦系数μ_c＝0.0005，摆杆与小车的摩擦系数μ_p＝0.000002。对参数方程的更新采用欧拉方程计算，时间步长设定为0.02秒，这样可以很方便地求得小车的运动速度和位置以及摆杆的角速度和摆角度。The parameters set in formula (26) and formula (27) are: gravitational acceleration g=-9.8m/s² , trolley weight m_c =1.0kg, pendulum weight m=0.1kg, half length of pendulum l=0.5 m, the friction coefficient μ_c of the trolley on the guide rail = 0.0005, the friction coefficient μ_p of the pendulum and the trolley = 0.000002. The update of the parameter equation is calculated by the Euler equation, and the time step is set to 0.02 seconds, so that the moving speed and position of the trolley and the angular velocity and pendulum angle of the pendulum can be easily obtained.

在仿真实验中按物理定律给出运动方程式，但倒立摆学习系统事先并不知道其运动规律，它的知识结构是在不断学习过程中逐步建立起来的。在实验中，设定参数为：学习率α＝0.2，折扣因子γ＝0.95，资格迹系数λ＝0.8，探索行为选择概率ε＝0.1，改进残差法参数μ＝0.1。BP神经网络采用4-16-1结构，隐层节点采用sigmoid型激活函数，输出层节点采用线性函数。In the simulation experiment, the motion equation is given according to the laws of physics, but the inverted pendulum learning system does not know its motion law in advance, and its knowledge structure is gradually established in the continuous learning process. In the experiment, the parameters are set as follows: learning rate α=0.2, discount factor γ=0.95, eligibility trace coefficient λ=0.8, exploration behavior selection probability ε=0.1, improved residual method parameter μ=0.1. The BP neural network adopts a 4-16-1 structure, the hidden layer nodes use a sigmoid activation function, and the output layer nodes use a linear function.

为了验证算法的有效性，将倒立摆控制仿真实验进行40次。每次实验都初始化BP神经网络的权值参数，每次实验包含若干轮(episode)的学习过程，每一轮可能成功，也可能失败。每轮实验从一个有效的随机位置开始，由力控制倒立摆的平衡，若倒立摆在一轮学习过程中能保持10000步不倒掉，就认为它学习到的知识能够成功地控制倒立摆。若本轮控制实验失败或能保持成功步数达到10000步，则重新开始新一轮的学习。In order to verify the effectiveness of the algorithm, the inverted pendulum control simulation experiment was carried out 40 times. Each experiment initializes the weight parameters of the BP neural network, each experiment includes several rounds of learning process, and each round may succeed or fail. Each round of experiments starts from an effective random position, and the balance of the inverted pendulum is controlled by the force. If the inverted pendulum can maintain 10,000 steps during a round of learning, it is considered that the knowledge it has learned can successfully control the inverted pendulum. If the current round of control experiments fails or the number of successful steps reaches 10,000, start a new round of learning again.

表1给出了一个统计表，记录了40次仿真实验中，每次实验系统能成功控制倒立摆所经历的学习轮数。在这40次实验中，采用本文的算法，学习系统都能有效地学习并成功地控制倒立摆。其中，最多学习轮数为：18；最少学习轮数为：8；平均学习轮数为：12.05。Table 1 provides a statistical table, which records the number of learning rounds that the experimental system can successfully control the inverted pendulum in each of the 40 simulation experiments. In these 40 experiments, using the algorithm in this paper, the learning system can learn effectively and successfully control the inverted pendulum. Among them, the maximum number of learning rounds is: 18; the minimum number of learning rounds is: 8; the average number of learning rounds is: 12.05.

表1Table 1

仿真实验的学习过程曲线如图3所示，从实验中抽取第11次实验，对其实验过程进行观察，发现按照本文的方法在经历了前9轮的失败后，从第10轮开始，系统能成功地实现倒立摆控制。前10轮的学习步数分别为：7、10、10、36、18、74、64、706、2411、10000。The learning process curve of the simulation experiment is shown in Figure 3. The 11th experiment was selected from the experiment, and the experimental process was observed. It was found that after the failure of the first 9 rounds according to the method in this paper, the system started from the 10th round. Can successfully realize inverted pendulum control. The learning steps of the first 10 rounds are: 7, 10, 10, 36, 18, 74, 64, 706, 2411, 10000.

将本文方法结果与其他方法结果做一个对比。Barto等提出了AHC方法，将四维参数作为输入，采用两个单层神经网络分别作为ASE和ACE，实现控制倒立摆，其参数设置跟本文相同。这种方法将连续状态离散化，没有导入先验知识，在实现中较为复杂。Anderson等在AHC方法基础上，提出方法并实现了连续状态的控制。Berenji提出一种GARIC方法，采用模糊逻辑的方法，实现了基于泛化规则智能控制结构的强化学习系统来控制倒立摆平衡。Lin等提出了一种RFALCON方法来解决倒立摆问题，他们植入了模糊先验知识，通过调节Critic网络和Action网络进行动态的参数学习。Moriarty等研究了基于表格的Q学习算法实现倒立摆平衡问题，同时提出了一个基于符号的、自适应进化神经网络的SANE算法。蒋国飞等采用基于Q学习算法和BP神经网络来研究倒立摆控制问题，实现了倒立摆的无模型控制，这种方法没有运用资格迹技术。Lagoudakis等利用LSPI算法，采用基于基函数逼近和最小策略迭代法对倒立摆问题进行了研究。Bhatnagar等实现了PG算法，他们采用了自然梯度法和函数拟合的思想进行时域差分学习，在线训练值函数的参数。Martín等提出一种基于加权K近邻的强化学习方法kNN-TD，将当前状态最临近的K个状态的Q值进行加权拟合，求得当前Q值，这样较好地对Q值进行了泛化。为提高学习效率，他们进而提出了基于资格迹的kNN-TD(λ)算法。Lee等提出一种RFWAC算法，采用了增量构建的径向基网络来构成，以接受域加权回归作为其理论基础。接受域用来构建局部模型，其形状和规模可以进行自适应控制。Vien等提出一种ACTAMERRL算法，这种方法植入训练者早期的训练知识，再进行强化学习。采用的学习框架易于实现，这种方法较好地运用于倒立摆的训练上。各种方法的性能比较如表2所示。Compare the results of this method with those of other methods. Barto et al. proposed the AHC method, which takes four-dimensional parameters as input and uses two single-layer neural networks as ASE and ACE respectively to realize the control of the inverted pendulum. The parameter settings are the same as in this paper. This method discretizes the continuous state without introducing prior knowledge, which is more complicated in implementation. Based on the AHC method, Anderson et al. proposed a method and realized continuous state control. Berenji proposed a GARIC method, which uses fuzzy logic to implement a reinforcement learning system based on generalized rule intelligent control structure to control the balance of an inverted pendulum. Lin et al. proposed a RFALCON method to solve the inverted pendulum problem. They implanted fuzzy prior knowledge and performed dynamic parameter learning by adjusting the Critic network and Action network. Moriarty studied the table-based Q-learning algorithm to realize the balance problem of inverted pendulum, and proposed a symbol-based, self-adaptive evolutionary neural network SANE algorithm. Jiang Guofei and others used Q-learning algorithm and BP neural network to study the control problem of inverted pendulum, and realized the model-free control of inverted pendulum. This method did not use qualification trace technology. Lagoudakis et al. used the LSPI algorithm to study the inverted pendulum problem based on basis function approximation and minimum strategy iteration method. Bhatnagar et al. implemented the PG algorithm. They used the idea of natural gradient method and function fitting for time-domain difference learning, and online training the parameters of the value function. Martín et al. proposed a reinforcement learning method kNN-TD based on weighted K nearest neighbors, which weighted and fitted the Q values of the K states closest to the current state to obtain the current Q value, which better generalized the Q value. change. In order to improve the learning efficiency, they further proposed the kNN-TD(λ) algorithm based on qualification trace. Lee et al. proposed an RFWAC algorithm, which is composed of an incrementally constructed radial basis network, and takes receptive domain weighted regression as its theoretical basis. Receptive fields are used to build local models whose shape and scale can be adaptively controlled. Vien et al. proposed an ACTAMERRL algorithm, which implants the trainer's early training knowledge and then performs reinforcement learning. The learning framework adopted is easy to realize, and this method is better applied to the training of inverted pendulum. The performance comparison of various methods is shown in Table 2.

表2Table 2

为了进一步分析本文算法的性能，图4—6分别给出了系统学习到第50轮时小车位置、摆杆角度以及外界对小车控制力随时间变化的曲线图，图4和图5设定测试时间为300秒，行为次数为30000步，从曲线图中看出，小车的位置和角速度都在规定范围之内，可见本算法取得了较好的学习和控制效果，图6只给出的测试时间为50秒，行为次数在2500步内，外界对倒立摆系统进行控制的时间-作用力曲线。In order to further analyze the performance of the algorithm in this paper, Figures 4-6 show the curves of the position of the car, the angle of the pendulum, and the external control force on the car over time when the system learns to the 50th round, and Figures 4 and 5 set up the test The time is 300 seconds, and the number of actions is 30,000 steps. It can be seen from the graph that the position and angular velocity of the car are within the specified range. It can be seen that the algorithm has achieved good learning and control effects. Figure 6 only shows the test The time is 50 seconds, the number of actions is within 2500 steps, and the time-force curve of the external control of the inverted pendulum system.

在表2中GARIC方法充分利用了先验知识进行强化学习，性能有了较大的提高，使学习轮数提高到300：RFALCON方法同样引入了先验知识，使学习轮数提高到15，本文实验结果没有植入先验知识，获得了较好的学习性能，植入部分先验知识，重做以上实验，先验知识描述如下：In Table 2, the GARIC method makes full use of prior knowledge for reinforcement learning, and the performance has been greatly improved, increasing the number of learning rounds to 300; the RFALCON method also introduces prior knowledge, increasing the number of learning rounds to 15. In this paper The experimental results did not embed prior knowledge, and achieved better learning performance. Implanting some prior knowledge and redoing the above experiment, the prior knowledge is described as follows:

$\begin{matrix} IF IF & θ θ > > 00 & AND AND & \overset{\cdot \cdot}{θ θ} > > 00 & THEN THEN & F f > > 00 \end{matrix};;$

$\begin{matrix} IF IF & θ θ < < 00 & AND AND & \overset{\cdot &Center Dot;}{θ θ} < < 00 & THEN THEN & F f < < 00 \end{matrix};;$

同样进行40次实验，每次实验学习系统都能有效地学习并成功地控制倒立摆。表3给出了一个统计表，记录了植入上述知识后，每次实验系统能成功控制倒立摆所经历的学习轮数，其中，最多学习轮数为：14；最少学习轮数为：5；平均学习轮数为：7.93。可见，植入先验知识能大大提高强化学习的效率。Carry out 40 experiments likewise, each experiment learning system can effectively learn and successfully control the inverted pendulum. Table 3 gives a statistical table, which records the number of learning rounds that the experimental system can successfully control the inverted pendulum after implanting the above knowledge. Among them, the maximum number of learning rounds is: 14; the minimum number of learning rounds is: 5 ; The average number of learning rounds is: 7.93. It can be seen that implanting prior knowledge can greatly improve the efficiency of reinforcement learning.

表3table 3

当然，上述说明并非是对本发明的限制，本发明也并不仅限于上述举例，本技术领域的技术人员在本发明的实质范围内所做出的变化、改型、添加或替换，也应属于本发明的保护范围。Of course, the above descriptions are not intended to limit the present invention, and the present invention is not limited to the above examples. Changes, modifications, additions or replacements made by those skilled in the art within the scope of the present invention shall also belong to the present invention. protection scope of the invention.