CN113487902A

Movatterモバイル変換

Info

Publication number: CN113487902A
Application number: CN202110534127.1A
Authority: CN
Inventors: 王昊; 卢云雪; 董长印; 杨朝友
Original assignee: Yangzhou Fama Intelligent Equipment Co ltd; Southeast University
Current assignee: Yangzhou Fama Intelligent Equipment Co ltd; Southeast University
Priority date: 2021-05-17
Filing date: 2021-05-17
Publication date: 2021-10-08
Anticipated expiration: 2041-05-17
Also published as: CN113487902B

Abstract

Translated fromChinese

本发明公开了一种基于车辆规划路径的强化学习区域信号控制方法，具体为在车联网环境下，收集智能体控制范围内交叉口所有车辆的规划路径信息以及位置信息，利用强化学习PPO2算法，对区域内的道路交叉口进行分布式信号控制，实现区域交通联动优化。具体为给出多智能体强化学习在区域交通信号控制的控制框架；基于车辆规划路径信息和车辆位置信息定义道路交通状态；定义交叉口信号控制的动作变量；以减小交叉口排队长度，减小车辆延误和避免下游交通堵塞为目标定义智能体与交通环境交互奖励；同时提出“距离因子”衡量PPO2算法产生的控制方案与排队优先长度优先策略产生方案的距离，避免PPO2算法输出不良控制方法造成道路交通不正常扰动。

The invention discloses a reinforcement learning area signal control method based on vehicle planning path. Specifically, in a vehicle networking environment, the planning path information and position information of all vehicles at an intersection within the control range of an intelligent body are collected, and the reinforcement learning PPO2 algorithm is used to achieve Distributed signal control for road intersections in the region to achieve regional traffic linkage optimization. Specifically, the control framework of multi-agent reinforcement learning in regional traffic signal control is given; the road traffic state is defined based on vehicle planning path information and vehicle position information; the action variables of intersection signal control are defined; The goal of delaying small vehicles and avoiding downstream traffic jams is to define the interactive reward between the agent and the traffic environment; at the same time, a "distance factor" is proposed to measure the distance between the control scheme generated by the PPO2 algorithm and the queue priority length priority strategy, so as to avoid the poor output control method of the PPO2 algorithm. Cause abnormal disturbance of road traffic.

Description

Reinforced learning area signal control method based on vehicle planned path

Technical Field

The invention belongs to the field of traffic management and control, and particularly relates to a reinforcement learning area signal control method based on a vehicle planned path.

Background

In the internet of vehicles environment, vehicles exchange information including local paths of the vehicles, vehicle position information, vehicle speeds and the like with roadside facilities in real time through vehicle-mounted equipment. Reinforcement learning based signal control methods typically incorporate vehicle position and speed information into the algorithm inputs to develop a more accurate signal control scheme. However, the local route information of the vehicles in the car networking environment is generally easily obtained and is information that can reflect the distribution of the vehicles on the road network and the traffic flow, but cannot be used in the signal control. In addition, when the existing multi-agent reinforcement learning algorithm is used for carrying out distributed control on regional traffic, a single intersection is usually used as an independent agent, and the feedback obtained after the agent generates a signal control scheme usually only considers the vehicle queuing condition or delay of the intersection, but the design mode is not beneficial to the combined control of the regional traffic. In addition, existing reinforcement learning models typically require interaction with a traffic simulator, such as SUMO, through a pre-stage, to accumulate data for model training. However, the traffic simulation environment is different from a real traffic system, and when the reinforcement learning model trained by the simulator is transferred to the actual traffic environment, the control effect of the model is poor.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems, the invention provides a reinforcement learning area signal control method based on a vehicle planning path; in the reinforcement learning algorithm, the local planned path of the vehicle is used as the reference information of the signal control scheme as state input, so that the overall state and trend of the regional traffic can be effectively controlled, the predictability of the control scheme is improved, and the optimization of the overall traffic running state of the region is facilitated.

The technical scheme is as follows: in order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: a reinforcement learning area signal control method based on a vehicle planned path comprises the following steps:

step 1, designing a control framework of an intelligent agent in traffic signal control of a target area, and modeling a road traffic state, wherein the control framework comprises the following steps: taking each intersection in the target area as an independent agent, and constructing a respective corresponding reinforcement learning control model and a database for each independent agent;

step 2: enabling an independent intelligent agent at the intersection to interact with the environment of the intersection, and collecting road traffic state information within a certain range of the intersection in real time; the certain range includes the intersection and an entrance lane of an adjacent intersection;

step 3, taking the road traffic state information of the intersection at the current moment as the input of a reinforcement learning control model corresponding to the intersection to obtain an intersection signal control scheme at the next moment of the current moment and an evaluation result of the control scheme; the signal control scheme comprises a release phase and a green time;

step 4, generating an intersection signal control scheme at the next moment of the current moment by using a queuing priority length strategy according to the road traffic state information of the intersection at the current moment;

step 5, calculating a distance factor by using an intersection signal control scheme obtained by the reinforcement learning control model and an intersection signal control scheme generated by a queuing priority length strategy; if the calculated distance factor is larger than the set distance threshold, implementing an intersection signal control scheme generated by a queuing priority length strategy at the intersection; otherwise, implementing an intersection signal control scheme obtained by a reinforcement learning control model at the intersection;

and 6, storing the road traffic state information collected by the intersection intelligent bodies in the target area, the signal control schemes corresponding to the intersections and the rewards of the intersection intelligent bodies and the environment interaction into databases corresponding to the intersections in real time, updating the reinforcement learning control model parameters corresponding to the intersections when the data information stored in the intersection database is judged to be accumulated to a set size, emptying all data in the database after the updating is finished, and returning to thestep 2.

Furthermore, the road traffic state information in thestep 2 is a set formed by a vehicle planning path matrix, a vehicle position matrix, a corresponding relation vector of a lane and a road section and a green time vector, and the overall state and trend of regional traffic can be effectively grasped;

distribution for vehicle planned path matrix_m×n×4Representing, wherein each row corresponds to one lane, the lanes in the monitoring range of the intelligent agent are separated by 1 meter to obtain a plurality of cells, and each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, storing four planned road segment numbers which may be passed by the vehicle after the time t in Distribution (i, k,1), Distribution (i, k,2), Distribution (i, k,3) and Distribution (i, k,4) respectively;

pos for the vehicle position matrix_m×n×1Representing, wherein each row corresponds to a lane within the agent monitoring range; each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 1; if no vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 0;

the corresponding relation vector of the lane and the road section is I_m×1Represents; then I_iA link number indicating the lane i is located;

the green time vector is G_m×1Represents; then G is_iRepresenting the remaining green light transit time of the current cycle in which lane i is located at time t.

Further, the calculation formula of the distance factor in step 5 is as follows:

wherein γ is a distance factor;

an intersection signal control scheme obtained for a reinforcement learning control model;

a signal control scheme generated for a queue length prioritization strategy.

Further, in step 6, the road traffic state information collected by the intersection intelligent agents in the target area, the signal control schemes corresponding to the intersections respectively and rewards for interaction between the intersection intelligent agents and the environment<s_t,a_t,r_t+1,s_t+1>Storing the form of the data into a database corresponding to each cross port; wherein s is_tRepresenting road traffic state information collected by the intelligent body at the intersection at the time t; a is_tRepresenting a signal control scheme implemented at the intersection at the time t; r is_t+1A reward representing interaction of the intersection agent with the environment at time t + 1; s_t+1And the road traffic state information collected by the intersection intelligent agent at the moment t +1 is represented.

Further, the reward of interaction between the intelligent agent and the environment at the intersection is obtained by calculating the waiting time of the first vehicle at the intersection entrance lane, the queuing length of the intersection entrance lane and the queuing length of the intersection exit lane, and specifically comprises the following steps:

in the formula, r_t+1Awarding interaction between the intelligent agent at the intersection and the environment at the time t + 1; l_in、l_outRespectively an inlet channel and an outlet channel of the intersection; w is a_i、q_iRespectively the first vehicle waiting time and the queuing length of the lane i; f. of_jIs a Boolean variable used for measuring the queuing length of an exit trackWhether the degree exceeds the length L of the section_jThree fourths of the total area of the coating, if

F is then_j1, otherwise f_j＝0；L_jIs the road segment length of lane j; q. q.s_jThe length of the queue for lane j; 0 is a penalty factor.

Further, step 6, when the data information stored in the intersection database is accumulated to the set size, updating the reinforcement learning control model parameter corresponding to the intersection, and emptying all data in the database after the update is completed, where the method includes:

step 6.1, initializing reinforcement learning control model parameters, including:

initializing values of the hyper-parameters, including a learning rate alpha, a threshold sigma of a distance factor and a penalty factor delta;

is an action model Actor_θParameter(s) and evaluation model Critic_wThe parameters of (a) are assigned with initial values, wherein theta and w are respectively parameters of the action model to be updated and the evaluation model;

define Actor _ old_θ′、Critic_old_w′Is Actor_θ、Critic_wCopies of the model, Actor _ old_θ′The parameters of the model are equal to Actor_θParameters when not updated are kept unchanged in the updating process;

is Actor_θAnd Critic_wSetting training times n _ operator and n _ critical;

step 6.2, all data x in the database are utilized_t＝<s_t,a_t,r_t+1,s_t+1Update an action model in a reinforcement learning control model, comprising:

step 6.21, calculate A(s)_t,a_t)＝r_t+1+τV_w(s_t+1)-V_w(s_t)

In the formula, V_w(s_t+1) To evaluate the model Critic_wThe evaluation result is output at the moment t + 1; v_w(s_t) Output for evaluation model at time tEvaluating the result; τ is a reduction factor; a(s)_t,a_t) For information on road traffic conditions s_tLower implementation Signal control scheme a_tThe advantages of (a);

step 6.22, calculating Actor_θGradient of the model:

wherein E represents a mathematical expectation; (s)_t,a_t)～π_θ′Data indicating use is by Actor _ old_θ′Obtaining a model; p_θ(a_t|s_t)、P_θ′(a_t|s_t) Is an action model Actor_θ、Actor_old_θ′Information s on road traffic conditions_tLower implementation Signal control scheme a_tThe probability of (d);

representing the derivation of the parameter theta;

6.23, updating the parameter theta according to an Adam optimization method;

step 6.24, repeat step 6.22-step 6.23n _ actor times;

step 6.3, all data x in the database are utilized_t＝<s_t,a_t,r_t+1,s_t+1Updating an evaluation model in a reinforcement learning control model, comprising:

step 6.31, calculate A(s)_t,a_t)＝r_t+1+τV_w′(s_t+1)-V_w(s_t)；

In the formula, V_w′(s_t+1) Represents the evaluation model Critic _ old_w′The output result of (1);

step 6.32, calculating an evaluation model Critic_wGradient (2):

in the formula (I), the compound is shown in the specification,

representing the derivation of the parameter w;

step 6.33, updating the parameter w according to an Adam optimization method;

step 6.34, repeat step 6.31-step 6.33n _ critical times;

and 6.4, emptying all data information in the database.

Has the advantages that: compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:

the intersection and the adjacent intersections in the target area are taken as monitoring objects, and regional vehicle planning path data under the Internet of vehicles environment are considered in the state variables, so that the road traffic state is represented more comprehensively; constructing a regional signal control reinforcement learning model by combining a PPO algorithm and an LSTM algorithm; the distance factor is provided to measure the distance between the reinforcement learning model and the control scheme generated by the traditional queuing length priority strategy, so that the influence of the untrained mature model on road traffic safety and traffic efficiency due to the generation of an improper signal control scheme in the online learning process can be effectively avoided.

Drawings

FIG. 1 is a schematic diagram of modeling a lane condition, under an embodiment;

FIG. 2 is a diagram of a road traffic condition result for a lane under one embodiment;

FIG. 3 is a schematic illustration of an intersection in one embodiment;

FIG. 4 is a schematic structural diagram of a PPO2 model according to an embodiment;

FIG. 5 is a logic flow diagram of a method of the present invention in one embodiment.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Referring to fig. 5, the method of the present invention is further illustrated by taking an intersection as an example. The invention relates to a reinforcement learning area signal control method based on a vehicle planned path, which specifically comprises the following steps:

(1) designing a control framework of an intelligent agent in regional traffic signal control, modeling a road traffic state, and comprising the following steps:

under the environment of the Internet of vehicles, the regional path and the vehicle position information of the vehicle are fully utilized, so that the road traffic state is more comprehensively grasped and analyzed.

Specifically, each intersection is used as an independent agent, the intersection and the entrance lane of the adjacent intersection are used as observation ranges, and planning path information and vehicle position information of vehicles in the ranges are collected.

Defining the regional vehicle planning path matrix as Distribution_m×n×4Each line corresponds to one lane, and the lanes in the intelligent agent monitoring range are divided into a unit grid by 1 meter; if a vehicle exists in the kth cell of the lane i at the time t, storing four planned road segment numbers after the time t of the vehicle in Distribution (i, f,1), Distribution (i, k,2), Distribution (i, k,3) and Distribution (i, k,4) respectively;

defining a vehicle location matrix as Pos_m×n×1Each row of the matrix corresponds to a lane in the monitoring range, the lane takes 1 meter as a cell, and if a vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 1;

definition I_m×1As a vector of the cross-relations of lanes and road sections, I_iA link number indicating the lane i is located;

definition G_m×1Is the green time vector, G_iIndicating the remaining green light transit time of lane i in the current cycle at time t.

Traffic environment state s is defined by Distribution_m×n×4、Pos_m×n×1、I_m×1And G_m×1The formed set can effectively control the overall state and trend of the regional traffic.

Under the present embodiment, referring to fig. 1, the road traffic state result Distribution of lane i_i×n×4、Pos_i×n×1、I_i×1And G_i×1As shown with reference to FIG. 2;

(2) constructing a reinforcement learning control model, defining the input and the output of the model, and comprising the following steps:

referring to fig. 4, the reinforcement learning control model adopts a distributed control mode, and an agent needs to give a phase scheme and a timing scheme of an intersection at the same time; each intersection is used as an independent agent, and a reinforcement learning control model is independently trained; taking all the entrances of the intersection and the adjacent intersections as monitoring ranges, each intelligent agent collects the path information and the position information of all vehicles at the intersection in real time, and simultaneously obtains the path information and the position information of all vehicles at the entrances of the adjacent intersections and the remaining green time of each lane from the adjacent intersections as the input of a PPO2 algorithm.

The PPO2 model comprises an action model Actor and an evaluation model Critic, and the action model outputs a signal control scheme a; the evaluation model outputs an evaluation v of the signal control scheme a.

To improve the efficiency of training, Actor and Critic will share the underlying input layer. Meanwhile, the PPO2 algorithm is combined with a long and short memory model to enhance the memory of the model to the historical state, so that the intelligent agent can make more reasonable decisions.

(3) The intelligent agent interacts with the road traffic environment, and specifically comprises the following steps: at time t, the agent reads the road traffic status s_tOutputting a signal scheme at the time t according to the PPO2 control model

The release phase and green time at this intersection are given.

The phase set is defined as the combination of all non-conflicted traffic flows at the intersection; for example, for a typical crossroad with independent entry lanes for each flow direction, the action set is defined as { north-south straight, north-south left turn, east-west straight, east-west left turn, south-north straight left, north-east straight left, east-west straight left }, and the duration of each signal phase execution is not fixed.

Signal control scheme generated by agent at time t

Not directly used for signal controlSignal control schemes generated by a queue length priority strategy

The distance factor gamma is calculated. The queuing length priority strategy means that the intersection always gives priority to the phase green light time with the longest queuing length; the distance factor gamma is used for measuring the signal control scheme of the PPO2 control model output

Signal control scheme generated by' queuing length priority strategy

The formula is as follows:

when gamma is larger than a certain threshold value, the intersection signal control scheme generated by the queuing priority length strategy is actually implemented at the intersection at the time t, namely

Otherwise, implementing an intersection signal control scheme obtained by the PPO2 control model at the intersection, namely

Action a_tAfter implementation, the traffic environment enters the next state s_t+1。

Under the present embodiment, the signal control scheme generated by the DPPO2 control model

Signal control scheme generated by queuing length priority strategy

For example, there are

If the threshold value sigma is 6, implementing an intersection signal control scheme obtained by a PPO2 control model at the intersection, namely implementing the intersection signal control scheme

To a_tThe conflict-free processing is carried out by_t＝[15，0，0，0，0，0，0,0]。

(4) Designing an incentive r for interaction of an intelligent agent and the environment, aiming at reducing the queuing length of an intersection, reducing vehicle delay and avoiding downstream traffic jam, wherein the incentive is defined as the weighted sum of the queuing length weighted by the waiting time of the first vehicle in an entrance way and the queuing length index in an exit way, namely: reward is the waiting time w of the first vehicle, the inlet lane queuing length q + delta, the outlet lane queuing length index f:

in the formula I_in、l_o8tRespectively an inlet channel and an outlet channel of the intersection; w is a_i、q_iRespectively the first vehicle waiting time and the queuing length of the lane i; f. of_jIs a Boolean variable used for measuring whether the queuing length of the outlet passage exceeds the length L of the road section_jThree fourths of the total area of the coating, if

F is then_j1, otherwise f_j＝0；L_jIs the road segment length of lane j; q. q.s_jThe length of the queue for lane j; delta is a penalty factor.

In the present embodiment, with reference to figure 3,

the queue length q of each entrance lane of the intersection is [20,14,32,20,15,24,20,15,20,26,18,18,12,30 ];

the waiting time w of the first vehicle at each entrance lane is [25,25,15,15,0,0,36,25,25,15,15,0,0,36 ];

each outlet channel queue length p is [5,22,12,14,118,34,12,18,18,10,5,24,5,13], and a boolean variable f is obtained by conversion [0,0,0,0,1,0,0,0,0,0,0,0 ];

taking δ as 100 and converting w into the unit of hour, the following is obtained:

r≈-16.83-100*1＝-116.83

(5) and storing the data of the interaction between the intelligent agent and the environment into a database for Replay buffer. Data to<s_t,a_t,r_t+1,s_t+1>When the database is accumulated to the set size Z, the PPO2 model parameters are updated, the steps are as follows:

(5.1) initializing reinforcement learning control model parameters, including:

initializing values of the hyper-parameters, including a learning rate α of 0.0001; the threshold σ of the distance factor is 6; the penalty factor delta is 100, and Z is 512;

is Actor_θAnd Critic_wSetting the training times n _ operator to be 10 and n _ critical to be 10;

(5.2) use of all data x in the database_t＝<s_t，a_t,r_t+1，s_t+1>. update the action model in the PPO2 control model, including:

(5.21) calculation of A(s)_t，a_t)＝r_t+1+τV_w(s_t+1)-V_w(s_t)

In the formula, V_w(s_t+1) To evaluate the model Critic_wThe evaluation result is output at the moment t + 1; v_w(s_t) The evaluation result is output by the evaluation model at the time t; τ is a reduction factor; a(s)_t,a_t) For the shape of road trafficState information s_tLower implementation Signal control scheme a_tThe advantages of (a);

(5.22) calculating Actor_θGradient of the model:

representing the derivation of the parameter theta;

(5.23) updating the parameter theta according to an Adam optimization method;

(5.24) repeating step (5.22) -step (5.23)10 times;

(5.3) use of all data x in the database_t＝<s_t,a_t,r_t+1,s_t+1>. update the evaluation model in the PPO2 control model, including:

(5.31) calculation of A(s)_t，a_t)＝r_t+1+τV_w′(s_t+1)-V_w(s_t)；

(5.32) calculation of evaluation model Critic_wGradient (2):

in the formula (I), the compound is shown in the specification,

representing the derivation of the parameter w;

(5.33) updating the parameter w according to the Adam optimization method;

(5.34) repeating step (5.31) -step (5.33)10 times;

(5.4) emptying the Replay buffer and repeating the steps (3) - (5).

Claims

Translated fromChinese

1.一种基于车辆规划路径的强化学习区域信号控制方法，其特征在于，具体包括如下步骤：1. a reinforcement learning area signal control method based on vehicle planning path, is characterized in that, specifically comprises the steps:

步骤1，设计目标区域交通信号控制中智能体的控制框架，并对道路交通状态进行建模，包括：以目标区域中每一个交叉口作为独立智能体，为每一个独立智能体构建各自对应的强化学习控制模型和数据库；Step 1: Design the control framework of the agent in the traffic signal control of the target area, and model the road traffic state, including: taking each intersection in the target area as an independent agent, and constructing a corresponding corresponding agent for each independent agent. Reinforcement learning control model and database;

步骤2：令交叉口独立智能体与交叉口环境进行交互，实时收集交叉口一定范围内的道路交通状态信息；所述一定范围包括交叉口和相邻交叉口的进口道；Step 2: Make the intersection independent agent interact with the intersection environment, and collect road traffic status information within a certain range of the intersection in real time; the certain range includes the intersection and the entrance road of the adjacent intersection;

步骤3，将当前时刻交叉口的道路交通状态信息作为该交叉口对应的强化学习控制模型的输入，得到当前时刻的下一时刻的交叉口信号控制方案，以及该控制方案的评价结果；所述信号控制方案包括放行相位和绿灯时间；Step 3, taking the road traffic state information of the intersection at the current moment as the input of the reinforcement learning control model corresponding to the intersection, to obtain the intersection signal control scheme at the next moment at the current moment, and the evaluation result of the control scheme; Signal control scheme including release phase and green time;

步骤4，通过当前时刻交叉口的道路交通状态信息，利用排队优先长度策略生成当前时刻的下一时刻的交叉口信号控制方案；Step 4, through the road traffic state information of the intersection at the current moment, using the queue priority length strategy to generate the intersection signal control scheme at the next moment at the current moment;

步骤5，利用强化学习控制模型得到的交叉口信号控制方案以及排队优先长度策略生成的交叉口信号控制方案，计算距离因子；若计算得到的距离因子大于设定的距离阈值，则在交叉口实施排队优先长度策略生成的交叉口信号控制方案；否则，在交叉口实施强化学习控制模型得到的交叉口信号控制方案；Step 5: Calculate the distance factor by using the intersection signal control scheme obtained by the reinforcement learning control model and the intersection signal control scheme generated by the queuing priority length strategy; if the calculated distance factor is greater than the set distance threshold, implement it at the intersection. The intersection signal control scheme generated by the queue priority length strategy; otherwise, the intersection signal control scheme obtained by implementing the reinforcement learning control model at the intersection;

步骤6，将目标区域中交叉口智能体收集的道路交通状态信息、交叉口各自对应的信号控制方案以及交叉口智能体与环境交互的奖励实时存储至交叉口各自对应的数据库中，并判断当交叉口数据库中存储的数据信息累计至设定大小时，更新该交叉口对应的强化学习控制模型参数，并在更新完成后清空数据库中的全部数据，返回步骤2。Step 6: Store the road traffic status information collected by the intersection agents in the target area, the signal control schemes corresponding to the intersections, and the rewards for the interaction between the intersection agents and the environment in real-time in the respective databases of the intersections, and determine the current situation. When the data information stored in the intersection database is accumulated to the set size, the reinforcement learning control model parameters corresponding to the intersection are updated, and after the update is completed, all data in the database is cleared, and the process returns to step 2.

2.根据权利要求1所述的一种基于车辆规划路径的强化学习区域信号控制方法，其特征在于，步骤2所述道路交通状态信息是由车辆规划路径矩阵、车辆位置矩阵、车道和路段对应关系向量以及绿时向量构成的集合；2. a kind of reinforcement learning area signal control method based on vehicle planning path according to claim 1, is characterized in that, the road traffic state information described in step 2 is by vehicle planning path matrix, vehicle position matrix, lane and road section corresponding A set of relation vectors and green hour vectors;

所述车辆规划路径矩阵用Distribution_m×n×4表示，其中每一行对应一条车道，将智能体监测范围内的车道以1米进行分隔，得到若干个单元格，每一列对应一个所述单元格；若t时刻车道i的第k个单元格中存在车辆，则Distribution(i,k,1)、Distribution(i,k,2)、Distribution(i,k,3)和Distribution(i,k,4)分别存储该车辆在t时刻之后可能经过的四条规划路段编号；The vehicle planning path matrix is represented by Distribution_m×n×4 , where each row corresponds to a lane, and the lanes within the monitoring range of the agent are separated by 1 meter to obtain several cells, and each column corresponds to one of the cells. ;If there is a vehicle in the kth cell of lane i at time t, then Distribution(i,k,1), Distribution(i,k,2), Distribution(i,k,3) and Distribution(i,k, 4) Store the numbers of the four planned road sections that the vehicle may pass through after time t respectively;

所述车辆位置矩阵用Pos_m×n×1表示，其中每一行对应智能体监测范围内的一条车道；每一列对应一个所述单元格；若t时刻车道i的第k个单元格中存在车辆，则Pos(i,k)＝1；若t时刻车道i的第k个单元格中不存在车辆，则Pos(i,k)＝0；The vehicle position matrix is represented by Pos_m×n×1 , where each row corresponds to a lane within the monitoring range of the agent; each column corresponds to one of the cells; if there is a vehicle in the kth cell of lane i at time t , then Pos(i,k)=1; if there is no vehicle in the kth cell of lane i at time t, then Pos(i,k)=0;

所述车道和路段对应关系向量用I_m×1表示；则I_i表示车道i所在的路段编号；The corresponding relationship vector between the lane and the road segment is represented by I_m×1 ; then I_i represents the road segment number where the lane i is located;

所述绿时向量用G_m×1表示；则G_i表示车道i在t时刻所在当前周期的剩余绿灯通行时间。The green hour vector is represented by G_m×1 ; then G_i represents the remaining green light traffic time of the current cycle where lane i is located at time t.

3.根据权利要求1所述的一种基于车辆规划路径的强化学习区域信号控制方法，其特征在于，步骤5所述距离因子的计算公式如下：3. a kind of reinforcement learning area signal control method based on vehicle planning path according to claim 1, is characterized in that, the calculation formula of the distance factor described in step 5 is as follows:

式中，γ为距离因子；

为强化学习控制模型得到的交叉口信号控制方案；

为排队长度优先策略生成的信号控制方案。where γ is the distance factor;

The intersection signal control scheme obtained for the reinforcement learning control model;

Signal control scheme generated for queue length first policy.

4.根据权利要求1所述的一种基于车辆规划路径的强化学习区域信号控制方法，其特征在于，步骤6中目标区域中交叉口智能体收集的道路交通状态信息、交叉口各自对应的信号控制方案以及交叉口智能体与环境交互的奖励以<s_t,a_t,r_t+1,s_t+1>的形式存储至交叉口各自对应的数据库中；其中，s_t表示交叉口智能体在t时刻收集到的道路交通状态信息；a_t表示目标交叉口在t时刻实施的信号控制方案；r_t+1表示交叉口智能体在t+1时刻与环境交互的奖励；s_t+1表示交叉口智能体在t+1时刻收集到的道路交通状态信息。4. The method for controlling signals in a reinforcement learning area based on a planned path of a vehicle according to claim 1, wherein in step 6, the road traffic state information and the signals corresponding to the intersections collected by the intersection agent in the target area The control scheme and the reward for the interaction between the intersection agent and the environment are stored in the respective databases of the intersection in the form of <s_t , at , r_t₊₁ , s_t+1 >; where s_t represents the intersection intelligence is the road traffic state information collected by the agent at time t; a_t represents the signal control scheme implemented by the target intersection at time t; r_t+1 represents the reward for the interaction agent at the intersection with the environment at time t+1; s_{t+ 1} represents the road traffic state information collected by the intersection agent at time t+1.

5.根据权利要求1所述的一种基于车辆规划路径的强化学习区域信号控制方法，其特征在于，所述交叉口智能体与环境交互的奖励，是由交叉口进口道首车等候时间、交叉口进口道排队长度和交叉口出口道排队长度计算得到的，具体如下：5 . The method of strengthening learning regional signal control based on vehicle planning path according to claim 1 , wherein the reward for the interaction between the agent and the environment at the intersection is determined by the waiting time for the first vehicle at the entrance of the intersection, The queuing length at the entrance of the intersection and the queuing length at the exit of the intersection are calculated as follows:

式中，r_t+1为交叉口智能体在t+1时刻与环境交互的奖励；l_in、l_out分别为交叉口的进口道、出口道集合；w_i、q_i分别为车道i的首车等候时间和排队长度；f_j为布尔变量，用来衡量出口道排队长度是否超过路段长度L_j的四分之三,如果

则f_j＝1,否则f_j＝0；L_j为车道j的路段长度；q_j为车道j的排队长度；δ为惩罚因子。In the formula, r_t+1 is the reward of the intersection agent interacting with the environment at time t+1;_lin and l_out are the set of entrance and exit roads of the intersection, respectively; w_i and q_i are the lane i respectively. First car waiting time and queue length; f_j is a Boolean variable used to measure whether the exit road queue length exceeds three-quarters of the road segment length L_j , if

Then f_j =1, otherwise f_j =0; L_j is the length of the road section of lane j; q_j is the queue length of lane j; δ is the penalty factor.

6.根据权利要求4所述的一种基于车辆规划路径的强化学习区域信号控制方法，其特征在于，步骤6所述当交叉口数据库中存储的数据信息累计至一定大小时，则更新该交叉口对应的强化学习控制模型参数，并在更新完成后清空数据库中的全部数据，包括：6 . The method for controlling a signal in a reinforcement learning area based on a planned path of a vehicle according to claim 4 , wherein in step 6, when the data information stored in the intersection database is accumulated to a certain size, the intersection is updated. 6 . The corresponding reinforcement learning controls the model parameters, and after the update is completed, all data in the database is cleared, including:

步骤6.1，初始化强化学习控制模型参数，包括：Step 6.1, initialize the reinforcement learning control model parameters, including:

初始化超参数的值，包括学习速率α、距离因子的阈值σ、惩罚因子δ；Initialize the values of the hyperparameters, including the learning rate α, the threshold σ of the distance factor, and the penalty factor δ;

为动作模型Actor_θ的参数、评价模型Critic_w的参数赋初始值，其中，θ和w分别为待更新的动作模型和评价模型的参数；Assign initial values to the parameters of the action model Actor_θ and the parameters of the evaluation model Critic_w , where θ and w are the parameters of the action model to be updated and the evaluation model respectively;

定义Actor_old_θ′、Critic_old_w′为Actor_θ、Critic_w模型的副本，即Actor_old_θ′模型的参数等于Actor_θ未更新时的参数，并在更新过程中参数保持不变；Define Actor_old_θ′ and Critic_old_w′ as copies of Actor_θ and Critic_w models, that is, the parameters of Actor_old_θ′ model are equal to the parameters when Actor_θ is not updated, and the parameters remain unchanged during the update process;

为Actor_θ和Critic_w设置训练次数n_actor、n_critic；Set the training times n_actor, n_critic for Actor_θ and Critic_w ;

步骤6.2，利用数据库中的全部数据x_t＝<s_t,a_t,r_t+1,s_t+1＞，更新强化学习控制模型中的动作模型，包括：Step 6.2, using all the data in the database x_t =<s_t , at , r_t₊₁ , s_t+1 >, update the action model in the reinforcement learning control model, including:

步骤6.21，计算A(s_t,a_t)＝r_t+1+τV_w(s_t+1)-V_w(s_t)Step 6.21, calculate A(s_t , at )=r_t₊₁ +τV_w (s_t+1 )-V_w (s_t )

式中，V_w(s_t+1)为评价模型Critic_w在t+1时刻输出的评价结果；V_w(s_t)为评价模型在t时刻输出的评价结果；τ为折减因子；A(s_t,a_t)为在道路交通状态信息s_t下实施信号控制方案a_t的优势；In the formula, V_w (s_t+1 ) is the evaluation result output by the evaluation model Critic_w at time t+1; V_w (s_t ) is the evaluation result output by the evaluation model at time t; τ is the reduction factor; A (s_t , at_t ) is the advantage of implementing the signal control scheme a_t under the road traffic state information_st ;

步骤6.22，计算Actor_θ模型的梯度：Step 6.22, compute the gradient of the Actor_θ model:

式中，E表示数学期望；(s_t,a_t)～π_θ′表示使用的数据是由Actor_old_θ′模型得到的；P_θ(a_t|s_t)、P_θ′(a_t|s_t)为动作模型Actor_θ、Actor_old_θ′在道路交通状态信息s_t下实施信号控制方案a_t的概率；

表示对参数θ求导数；In the formula, E represents mathematical expectation; (s_t , at_t )～π_θ′ indicates that the data used is obtained by Actor_old_θ′ model; P_θ (at |_{s t}₎ , P_θ′ (at |_s_t ) is the probability that the action models Actor_θ and Actor_old_θ′ implement the signal control scheme a_t under the road traffic state information_st ;

Represents the derivative of the parameter θ;

步骤6.23，根据Adam优化方法，更新参数θ；Step 6.23, according to the Adam optimization method, update the parameter θ;

步骤6.24，重复步骤6.22-步骤6.23n_actor次；Step 6.24, repeat step 6.22-step 6.23n_actor times;

步骤6.3，利用数据库中的全部数据x_t＝<s_t,a_t,r_t+1,s_t+1＞，更新强化学习控制模型中的评价模型，包括：Step 6.3, using all the data in the database x_t =<s_t , at , r_t₊₁ , s_t+1 >, update the evaluation model in the reinforcement learning control model, including:

步骤6.31，计算A(s_t,a_t)＝r_t+1+τV_w′(s_t+1)-V_w(s_t)；Step 6.31, calculate A(s_t , at )=r_t₊₁ +τV_w′ (s_t+1 )-V_w (s_t );

式中，V_w′(s_t+1)表示评价模型Critic_old_w′的输出结果；In the formula, V_w′ (s_t+1 ) represents the output result of the evaluation model Critic_old_w′ ;

步骤6.32，计算评价模型Critic_w的梯度：Step 6.32, calculate the gradient of the evaluation model Critical_w :

式中，

表示对参数w求导数；In the formula,

Represents the derivative of the parameter w;

步骤6.33，根据Adam优化方法，更新参数w；Step 6.33, according to the Adam optimization method, update the parameter w;

步骤6.34，重复步骤6.31-步骤6.33n_critic次；Step 6.34, repeat step 6.31-step 6.33n_critic times;

步骤6.4，清空数据库中的全部数据信息。Step 6.4, clear all data information in the database.