Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems, the invention provides a reinforcement learning area signal control method based on a vehicle planning path; in the reinforcement learning algorithm, the local planned path of the vehicle is used as the reference information of the signal control scheme as state input, so that the overall state and trend of the regional traffic can be effectively controlled, the predictability of the control scheme is improved, and the optimization of the overall traffic running state of the region is facilitated.
The technical scheme is as follows: in order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: a reinforcement learning area signal control method based on a vehicle planned path comprises the following steps:
step 1, designing a control framework of an intelligent agent in traffic signal control of a target area, and modeling a road traffic state, wherein the control framework comprises the following steps: taking each intersection in the target area as an independent agent, and constructing a respective corresponding reinforcement learning control model and a database for each independent agent;
step 2: enabling an independent intelligent agent at the intersection to interact with the environment of the intersection, and collecting road traffic state information within a certain range of the intersection in real time; the certain range includes the intersection and an entrance lane of an adjacent intersection;
step 3, taking the road traffic state information of the intersection at the current moment as the input of a reinforcement learning control model corresponding to the intersection to obtain an intersection signal control scheme at the next moment of the current moment and an evaluation result of the control scheme; the signal control scheme comprises a release phase and a green time;
step 4, generating an intersection signal control scheme at the next moment of the current moment by using a queuing priority length strategy according to the road traffic state information of the intersection at the current moment;
step 5, calculating a distance factor by using an intersection signal control scheme obtained by the reinforcement learning control model and an intersection signal control scheme generated by a queuing priority length strategy; if the calculated distance factor is larger than the set distance threshold, implementing an intersection signal control scheme generated by a queuing priority length strategy at the intersection; otherwise, implementing an intersection signal control scheme obtained by a reinforcement learning control model at the intersection;
and 6, storing the road traffic state information collected by the intersection intelligent bodies in the target area, the signal control schemes corresponding to the intersections and the rewards of the intersection intelligent bodies and the environment interaction into databases corresponding to the intersections in real time, updating the reinforcement learning control model parameters corresponding to the intersections when the data information stored in the intersection database is judged to be accumulated to a set size, emptying all data in the database after the updating is finished, and returning to thestep 2.
Furthermore, the road traffic state information in thestep 2 is a set formed by a vehicle planning path matrix, a vehicle position matrix, a corresponding relation vector of a lane and a road section and a green time vector, and the overall state and trend of regional traffic can be effectively grasped;
distribution for vehicle planned path matrixm×n×4Representing, wherein each row corresponds to one lane, the lanes in the monitoring range of the intelligent agent are separated by 1 meter to obtain a plurality of cells, and each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, storing four planned road segment numbers which may be passed by the vehicle after the time t in Distribution (i, k,1), Distribution (i, k,2), Distribution (i, k,3) and Distribution (i, k,4) respectively;
pos for the vehicle position matrixm×n×1Representing, wherein each row corresponds to a lane within the agent monitoring range; each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 1; if no vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 0;
the corresponding relation vector of the lane and the road section is Im×1Represents; then IiA link number indicating the lane i is located;
the green time vector is Gm×1Represents; then G isiRepresenting the remaining green light transit time of the current cycle in which lane i is located at time t.
Further, the calculation formula of the distance factor in step 5 is as follows:
wherein γ is a distance factor;
an intersection signal control scheme obtained for a reinforcement learning control model;
a signal control scheme generated for a queue length prioritization strategy.
Further, in step 6, the road traffic state information collected by the intersection intelligent agents in the target area, the signal control schemes corresponding to the intersections respectively and rewards for interaction between the intersection intelligent agents and the environment<st,at,rt+1,st+1>Storing the form of the data into a database corresponding to each cross port; wherein s istRepresenting road traffic state information collected by the intelligent body at the intersection at the time t; a istRepresenting a signal control scheme implemented at the intersection at the time t; r ist+1A reward representing interaction of the intersection agent with the environment at time t + 1; st+1And the road traffic state information collected by the intersection intelligent agent at the moment t +1 is represented.
Further, the reward of interaction between the intelligent agent and the environment at the intersection is obtained by calculating the waiting time of the first vehicle at the intersection entrance lane, the queuing length of the intersection entrance lane and the queuing length of the intersection exit lane, and specifically comprises the following steps:
in the formula, r
t+1Awarding interaction between the intelligent agent at the intersection and the environment at the time t + 1; l
in、l
outRespectively an inlet channel and an outlet channel of the intersection; w is a
i、q
iRespectively the first vehicle waiting time and the queuing length of the lane i; f. of
jIs a Boolean variable used for measuring the queuing length of an exit trackWhether the degree exceeds the length L of the section
jThree fourths of the total area of the coating, if
F is then
j1, otherwise f
j=0;L
jIs the road segment length of lane j; q. q.s
jThe length of the queue for lane j; 0 is a penalty factor.
Further, step 6, when the data information stored in the intersection database is accumulated to the set size, updating the reinforcement learning control model parameter corresponding to the intersection, and emptying all data in the database after the update is completed, where the method includes:
step 6.1, initializing reinforcement learning control model parameters, including:
initializing values of the hyper-parameters, including a learning rate alpha, a threshold sigma of a distance factor and a penalty factor delta;
is an action model ActorθParameter(s) and evaluation model CriticwThe parameters of (a) are assigned with initial values, wherein theta and w are respectively parameters of the action model to be updated and the evaluation model;
define Actor _ oldθ′、Critic_oldw′Is Actorθ、CriticwCopies of the model, Actor _ oldθ′The parameters of the model are equal to ActorθParameters when not updated are kept unchanged in the updating process;
is ActorθAnd CriticwSetting training times n _ operator and n _ critical;
step 6.2, all data x in the database are utilizedt=<st,at,rt+1,st+1Update an action model in a reinforcement learning control model, comprising:
step 6.21, calculate A(s)t,at)=rt+1+τVw(st+1)-Vw(st)
In the formula, Vw(st+1) To evaluate the model CriticwThe evaluation result is output at the moment t + 1; vw(st) Output for evaluation model at time tEvaluating the result; τ is a reduction factor; a(s)t,at) For information on road traffic conditions stLower implementation Signal control scheme atThe advantages of (a);
step 6.22, calculating ActorθGradient of the model:
wherein E represents a mathematical expectation; (s)
t,a
t)~π
θ′Data indicating use is by Actor _ old
θ′Obtaining a model; p
θ(a
t|s
t)、P
θ′(a
t|s
t) Is an action model Actor
θ、Actor_old
θ′Information s on road traffic conditions
tLower implementation Signal control scheme a
tThe probability of (d);
representing the derivation of the parameter theta;
6.23, updating the parameter theta according to an Adam optimization method;
step 6.24, repeat step 6.22-step 6.23n _ actor times;
step 6.3, all data x in the database are utilizedt=<st,at,rt+1,st+1Updating an evaluation model in a reinforcement learning control model, comprising:
step 6.31, calculate A(s)t,at)=rt+1+τVw′(st+1)-Vw(st);
In the formula, Vw′(st+1) Represents the evaluation model Critic _ oldw′The output result of (1);
step 6.32, calculating an evaluation model CriticwGradient (2):
in the formula (I), the compound is shown in the specification,
representing the derivation of the parameter w;
step 6.33, updating the parameter w according to an Adam optimization method;
step 6.34, repeat step 6.31-step 6.33n _ critical times;
and 6.4, emptying all data information in the database.
Has the advantages that: compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:
the intersection and the adjacent intersections in the target area are taken as monitoring objects, and regional vehicle planning path data under the Internet of vehicles environment are considered in the state variables, so that the road traffic state is represented more comprehensively; constructing a regional signal control reinforcement learning model by combining a PPO algorithm and an LSTM algorithm; the distance factor is provided to measure the distance between the reinforcement learning model and the control scheme generated by the traditional queuing length priority strategy, so that the influence of the untrained mature model on road traffic safety and traffic efficiency due to the generation of an improper signal control scheme in the online learning process can be effectively avoided.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Referring to fig. 5, the method of the present invention is further illustrated by taking an intersection as an example. The invention relates to a reinforcement learning area signal control method based on a vehicle planned path, which specifically comprises the following steps:
(1) designing a control framework of an intelligent agent in regional traffic signal control, modeling a road traffic state, and comprising the following steps:
under the environment of the Internet of vehicles, the regional path and the vehicle position information of the vehicle are fully utilized, so that the road traffic state is more comprehensively grasped and analyzed.
Specifically, each intersection is used as an independent agent, the intersection and the entrance lane of the adjacent intersection are used as observation ranges, and planning path information and vehicle position information of vehicles in the ranges are collected.
Defining the regional vehicle planning path matrix as Distributionm×n×4Each line corresponds to one lane, and the lanes in the intelligent agent monitoring range are divided into a unit grid by 1 meter; if a vehicle exists in the kth cell of the lane i at the time t, storing four planned road segment numbers after the time t of the vehicle in Distribution (i, f,1), Distribution (i, k,2), Distribution (i, k,3) and Distribution (i, k,4) respectively;
defining a vehicle location matrix as Posm×n×1Each row of the matrix corresponds to a lane in the monitoring range, the lane takes 1 meter as a cell, and if a vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 1;
definition Im×1As a vector of the cross-relations of lanes and road sections, IiA link number indicating the lane i is located;
definition Gm×1Is the green time vector, GiIndicating the remaining green light transit time of lane i in the current cycle at time t.
Traffic environment state s is defined by Distributionm×n×4、Posm×n×1、Im×1And Gm×1The formed set can effectively control the overall state and trend of the regional traffic.
Under the present embodiment, referring to fig. 1, the road traffic state result Distribution of lane ii×n×4、Posi×n×1、Ii×1And Gi×1As shown with reference to FIG. 2;
(2) constructing a reinforcement learning control model, defining the input and the output of the model, and comprising the following steps:
referring to fig. 4, the reinforcement learning control model adopts a distributed control mode, and an agent needs to give a phase scheme and a timing scheme of an intersection at the same time; each intersection is used as an independent agent, and a reinforcement learning control model is independently trained; taking all the entrances of the intersection and the adjacent intersections as monitoring ranges, each intelligent agent collects the path information and the position information of all vehicles at the intersection in real time, and simultaneously obtains the path information and the position information of all vehicles at the entrances of the adjacent intersections and the remaining green time of each lane from the adjacent intersections as the input of a PPO2 algorithm.
The PPO2 model comprises an action model Actor and an evaluation model Critic, and the action model outputs a signal control scheme a; the evaluation model outputs an evaluation v of the signal control scheme a.
To improve the efficiency of training, Actor and Critic will share the underlying input layer. Meanwhile, the PPO2 algorithm is combined with a long and short memory model to enhance the memory of the model to the historical state, so that the intelligent agent can make more reasonable decisions.
(3) The intelligent agent interacts with the road traffic environment, and specifically comprises the following steps: at time t, the agent reads the road traffic status s
tOutputting a signal scheme at the time t according to the PPO2 control model
The release phase and green time at this intersection are given.
The phase set is defined as the combination of all non-conflicted traffic flows at the intersection; for example, for a typical crossroad with independent entry lanes for each flow direction, the action set is defined as { north-south straight, north-south left turn, east-west straight, east-west left turn, south-north straight left, north-east straight left, east-west straight left }, and the duration of each signal phase execution is not fixed.
Signal control scheme generated by agent at time t
Not directly used for signal controlSignal control schemes generated by a queue length priority strategy
The distance factor gamma is calculated. The queuing length priority strategy means that the intersection always gives priority to the phase green light time with the longest queuing length; the distance factor gamma is used for measuring the signal control scheme of the PPO2 control model output
Signal control scheme generated by' queuing length priority strategy
The formula is as follows:
when gamma is larger than a certain threshold value, the intersection signal control scheme generated by the queuing priority length strategy is actually implemented at the intersection at the time t, namely
Otherwise, implementing an intersection signal control scheme obtained by the PPO2 control model at the intersection, namely
Action a
tAfter implementation, the traffic environment enters the next state s
t+1。
Under the present embodiment, the signal control scheme generated by the DPPO2 control model
Signal control scheme generated by queuing length priority strategy
For example, there are
If the threshold value sigma is 6, implementing an intersection signal control scheme obtained by a PPO2 control model at the intersection, namely implementing the intersection signal control scheme
To a
tThe conflict-free processing is carried out by
t=[15,0,0,0,0,0,0,0]。
(4) Designing an incentive r for interaction of an intelligent agent and the environment, aiming at reducing the queuing length of an intersection, reducing vehicle delay and avoiding downstream traffic jam, wherein the incentive is defined as the weighted sum of the queuing length weighted by the waiting time of the first vehicle in an entrance way and the queuing length index in an exit way, namely: reward is the waiting time w of the first vehicle, the inlet lane queuing length q + delta, the outlet lane queuing length index f:
in the formula I
in、l
o8tRespectively an inlet channel and an outlet channel of the intersection; w is a
i、q
iRespectively the first vehicle waiting time and the queuing length of the lane i; f. of
jIs a Boolean variable used for measuring whether the queuing length of the outlet passage exceeds the length L of the road section
jThree fourths of the total area of the coating, if
F is then
j1, otherwise f
j=0;L
jIs the road segment length of lane j; q. q.s
jThe length of the queue for lane j; delta is a penalty factor.
In the present embodiment, with reference to figure 3,
the queue length q of each entrance lane of the intersection is [20,14,32,20,15,24,20,15,20,26,18,18,12,30 ];
the waiting time w of the first vehicle at each entrance lane is [25,25,15,15,0,0,36,25,25,15,15,0,0,36 ];
each outlet channel queue length p is [5,22,12,14,118,34,12,18,18,10,5,24,5,13], and a boolean variable f is obtained by conversion [0,0,0,0,1,0,0,0,0,0,0,0 ];
taking δ as 100 and converting w into the unit of hour, the following is obtained:
r≈-16.83-100*1=-116.83
(5) and storing the data of the interaction between the intelligent agent and the environment into a database for Replay buffer. Data to<st,at,rt+1,st+1>When the database is accumulated to the set size Z, the PPO2 model parameters are updated, the steps are as follows:
(5.1) initializing reinforcement learning control model parameters, including:
initializing values of the hyper-parameters, including a learning rate α of 0.0001; the threshold σ of the distance factor is 6; the penalty factor delta is 100, and Z is 512;
is an action model ActorθParameter(s) and evaluation model CriticwThe parameters of (a) are assigned with initial values, wherein theta and w are respectively parameters of the action model to be updated and the evaluation model;
define Actor _ oldθ′、Critic_oldw′Is Actorθ、CriticwCopies of the model, Actor _ oldθ′The parameters of the model are equal to ActorθParameters when not updated are kept unchanged in the updating process;
is ActorθAnd CriticwSetting the training times n _ operator to be 10 and n _ critical to be 10;
(5.2) use of all data x in the databaset=<st,at,rt+1,st+1>. update the action model in the PPO2 control model, including:
(5.21) calculation of A(s)t,at)=rt+1+τVw(st+1)-Vw(st)
In the formula, Vw(st+1) To evaluate the model CriticwThe evaluation result is output at the moment t + 1; vw(st) The evaluation result is output by the evaluation model at the time t; τ is a reduction factor; a(s)t,at) For the shape of road trafficState information stLower implementation Signal control scheme atThe advantages of (a);
(5.22) calculating ActorθGradient of the model:
wherein E represents a mathematical expectation; (s)
t,a
t)~π
θ′Data indicating use is by Actor _ old
θ′Obtaining a model; p
θ(a
t|s
t)、P
θ′(a
t|s
t) Is an action model Actor
θ、Actor_old
θ′Information s on road traffic conditions
tLower implementation Signal control scheme a
tThe probability of (d);
representing the derivation of the parameter theta;
(5.23) updating the parameter theta according to an Adam optimization method;
(5.24) repeating step (5.22) -step (5.23)10 times;
(5.3) use of all data x in the databaset=<st,at,rt+1,st+1>. update the evaluation model in the PPO2 control model, including:
(5.31) calculation of A(s)t,at)=rt+1+τVw′(st+1)-Vw(st);
In the formula, Vw′(st+1) Represents the evaluation model Critic _ oldw′The output result of (1);
(5.32) calculation of evaluation model CriticwGradient (2):
in the formula (I), the compound is shown in the specification,
representing the derivation of the parameter w;
(5.33) updating the parameter w according to the Adam optimization method;
(5.34) repeating step (5.31) -step (5.33)10 times;
(5.4) emptying the Replay buffer and repeating the steps (3) - (5).