Movatterモバイル変換


[0]ホーム

URL:


CN113487902A - Reinforced learning area signal control method based on vehicle planned path - Google Patents

Reinforced learning area signal control method based on vehicle planned path
Download PDF

Info

Publication number
CN113487902A
CN113487902ACN202110534127.1ACN202110534127ACN113487902ACN 113487902 ACN113487902 ACN 113487902ACN 202110534127 ACN202110534127 ACN 202110534127ACN 113487902 ACN113487902 ACN 113487902A
Authority
CN
China
Prior art keywords
intersection
signal control
model
reinforcement learning
vehicle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110534127.1A
Other languages
Chinese (zh)
Other versions
CN113487902B (en
Inventor
王昊
卢云雪
董长印
杨朝友
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangzhou Fama Intelligent Equipment Co ltd
Southeast University
Original Assignee
Yangzhou Fama Intelligent Equipment Co ltd
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangzhou Fama Intelligent Equipment Co ltd, Southeast UniversityfiledCriticalYangzhou Fama Intelligent Equipment Co ltd
Priority to CN202110534127.1ApriorityCriticalpatent/CN113487902B/en
Publication of CN113487902ApublicationCriticalpatent/CN113487902A/en
Application grantedgrantedCritical
Publication of CN113487902BpublicationCriticalpatent/CN113487902B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种基于车辆规划路径的强化学习区域信号控制方法,具体为在车联网环境下,收集智能体控制范围内交叉口所有车辆的规划路径信息以及位置信息,利用强化学习PPO2算法,对区域内的道路交叉口进行分布式信号控制,实现区域交通联动优化。具体为给出多智能体强化学习在区域交通信号控制的控制框架;基于车辆规划路径信息和车辆位置信息定义道路交通状态;定义交叉口信号控制的动作变量;以减小交叉口排队长度,减小车辆延误和避免下游交通堵塞为目标定义智能体与交通环境交互奖励;同时提出“距离因子”衡量PPO2算法产生的控制方案与排队优先长度优先策略产生方案的距离,避免PPO2算法输出不良控制方法造成道路交通不正常扰动。

Figure 202110534127

The invention discloses a reinforcement learning area signal control method based on vehicle planning path. Specifically, in a vehicle networking environment, the planning path information and position information of all vehicles at an intersection within the control range of an intelligent body are collected, and the reinforcement learning PPO2 algorithm is used to achieve Distributed signal control for road intersections in the region to achieve regional traffic linkage optimization. Specifically, the control framework of multi-agent reinforcement learning in regional traffic signal control is given; the road traffic state is defined based on vehicle planning path information and vehicle position information; the action variables of intersection signal control are defined; The goal of delaying small vehicles and avoiding downstream traffic jams is to define the interactive reward between the agent and the traffic environment; at the same time, a "distance factor" is proposed to measure the distance between the control scheme generated by the PPO2 algorithm and the queue priority length priority strategy, so as to avoid the poor output control method of the PPO2 algorithm. Cause abnormal disturbance of road traffic.

Figure 202110534127

Description

Reinforced learning area signal control method based on vehicle planned path
Technical Field
The invention belongs to the field of traffic management and control, and particularly relates to a reinforcement learning area signal control method based on a vehicle planned path.
Background
In the internet of vehicles environment, vehicles exchange information including local paths of the vehicles, vehicle position information, vehicle speeds and the like with roadside facilities in real time through vehicle-mounted equipment. Reinforcement learning based signal control methods typically incorporate vehicle position and speed information into the algorithm inputs to develop a more accurate signal control scheme. However, the local route information of the vehicles in the car networking environment is generally easily obtained and is information that can reflect the distribution of the vehicles on the road network and the traffic flow, but cannot be used in the signal control. In addition, when the existing multi-agent reinforcement learning algorithm is used for carrying out distributed control on regional traffic, a single intersection is usually used as an independent agent, and the feedback obtained after the agent generates a signal control scheme usually only considers the vehicle queuing condition or delay of the intersection, but the design mode is not beneficial to the combined control of the regional traffic. In addition, existing reinforcement learning models typically require interaction with a traffic simulator, such as SUMO, through a pre-stage, to accumulate data for model training. However, the traffic simulation environment is different from a real traffic system, and when the reinforcement learning model trained by the simulator is transferred to the actual traffic environment, the control effect of the model is poor.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems, the invention provides a reinforcement learning area signal control method based on a vehicle planning path; in the reinforcement learning algorithm, the local planned path of the vehicle is used as the reference information of the signal control scheme as state input, so that the overall state and trend of the regional traffic can be effectively controlled, the predictability of the control scheme is improved, and the optimization of the overall traffic running state of the region is facilitated.
The technical scheme is as follows: in order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: a reinforcement learning area signal control method based on a vehicle planned path comprises the following steps:
step 1, designing a control framework of an intelligent agent in traffic signal control of a target area, and modeling a road traffic state, wherein the control framework comprises the following steps: taking each intersection in the target area as an independent agent, and constructing a respective corresponding reinforcement learning control model and a database for each independent agent;
step 2: enabling an independent intelligent agent at the intersection to interact with the environment of the intersection, and collecting road traffic state information within a certain range of the intersection in real time; the certain range includes the intersection and an entrance lane of an adjacent intersection;
step 3, taking the road traffic state information of the intersection at the current moment as the input of a reinforcement learning control model corresponding to the intersection to obtain an intersection signal control scheme at the next moment of the current moment and an evaluation result of the control scheme; the signal control scheme comprises a release phase and a green time;
step 4, generating an intersection signal control scheme at the next moment of the current moment by using a queuing priority length strategy according to the road traffic state information of the intersection at the current moment;
step 5, calculating a distance factor by using an intersection signal control scheme obtained by the reinforcement learning control model and an intersection signal control scheme generated by a queuing priority length strategy; if the calculated distance factor is larger than the set distance threshold, implementing an intersection signal control scheme generated by a queuing priority length strategy at the intersection; otherwise, implementing an intersection signal control scheme obtained by a reinforcement learning control model at the intersection;
and 6, storing the road traffic state information collected by the intersection intelligent bodies in the target area, the signal control schemes corresponding to the intersections and the rewards of the intersection intelligent bodies and the environment interaction into databases corresponding to the intersections in real time, updating the reinforcement learning control model parameters corresponding to the intersections when the data information stored in the intersection database is judged to be accumulated to a set size, emptying all data in the database after the updating is finished, and returning to thestep 2.
Furthermore, the road traffic state information in thestep 2 is a set formed by a vehicle planning path matrix, a vehicle position matrix, a corresponding relation vector of a lane and a road section and a green time vector, and the overall state and trend of regional traffic can be effectively grasped;
distribution for vehicle planned path matrixm×n×4Representing, wherein each row corresponds to one lane, the lanes in the monitoring range of the intelligent agent are separated by 1 meter to obtain a plurality of cells, and each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, storing four planned road segment numbers which may be passed by the vehicle after the time t in Distribution (i, k,1), Distribution (i, k,2), Distribution (i, k,3) and Distribution (i, k,4) respectively;
pos for the vehicle position matrixm×n×1Representing, wherein each row corresponds to a lane within the agent monitoring range; each column corresponds to one cell; if a vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 1; if no vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 0;
the corresponding relation vector of the lane and the road section is Im×1Represents; then IiA link number indicating the lane i is located;
the green time vector is Gm×1Represents; then G isiRepresenting the remaining green light transit time of the current cycle in which lane i is located at time t.
Further, the calculation formula of the distance factor in step 5 is as follows:
Figure BDA0003069119330000021
wherein γ is a distance factor;
Figure BDA0003069119330000022
an intersection signal control scheme obtained for a reinforcement learning control model;
Figure BDA0003069119330000023
a signal control scheme generated for a queue length prioritization strategy.
Further, in step 6, the road traffic state information collected by the intersection intelligent agents in the target area, the signal control schemes corresponding to the intersections respectively and rewards for interaction between the intersection intelligent agents and the environment<st,at,rt+1,st+1>Storing the form of the data into a database corresponding to each cross port; wherein s istRepresenting road traffic state information collected by the intelligent body at the intersection at the time t; a istRepresenting a signal control scheme implemented at the intersection at the time t; r ist+1A reward representing interaction of the intersection agent with the environment at time t + 1; st+1And the road traffic state information collected by the intersection intelligent agent at the moment t +1 is represented.
Further, the reward of interaction between the intelligent agent and the environment at the intersection is obtained by calculating the waiting time of the first vehicle at the intersection entrance lane, the queuing length of the intersection entrance lane and the queuing length of the intersection exit lane, and specifically comprises the following steps:
Figure BDA0003069119330000031
in the formula, rt+1Awarding interaction between the intelligent agent at the intersection and the environment at the time t + 1; lin、loutRespectively an inlet channel and an outlet channel of the intersection; w is ai、qiRespectively the first vehicle waiting time and the queuing length of the lane i; f. ofjIs a Boolean variable used for measuring the queuing length of an exit trackWhether the degree exceeds the length L of the sectionjThree fourths of the total area of the coating, if
Figure BDA0003069119330000032
F is thenj1, otherwise fj=0;LjIs the road segment length of lane j; q. q.sjThe length of the queue for lane j; 0 is a penalty factor.
Further, step 6, when the data information stored in the intersection database is accumulated to the set size, updating the reinforcement learning control model parameter corresponding to the intersection, and emptying all data in the database after the update is completed, where the method includes:
step 6.1, initializing reinforcement learning control model parameters, including:
initializing values of the hyper-parameters, including a learning rate alpha, a threshold sigma of a distance factor and a penalty factor delta;
is an action model ActorθParameter(s) and evaluation model CriticwThe parameters of (a) are assigned with initial values, wherein theta and w are respectively parameters of the action model to be updated and the evaluation model;
define Actor _ oldθ′、Critic_oldw′Is Actorθ、CriticwCopies of the model, Actor _ oldθ′The parameters of the model are equal to ActorθParameters when not updated are kept unchanged in the updating process;
is ActorθAnd CriticwSetting training times n _ operator and n _ critical;
step 6.2, all data x in the database are utilizedt=<st,at,rt+1,st+1Update an action model in a reinforcement learning control model, comprising:
step 6.21, calculate A(s)t,at)=rt+1+τVw(st+1)-Vw(st)
In the formula, Vw(st+1) To evaluate the model CriticwThe evaluation result is output at the moment t + 1; vw(st) Output for evaluation model at time tEvaluating the result; τ is a reduction factor; a(s)t,at) For information on road traffic conditions stLower implementation Signal control scheme atThe advantages of (a);
step 6.22, calculating ActorθGradient of the model:
Figure BDA0003069119330000041
wherein E represents a mathematical expectation; (s)t,at)~πθ′Data indicating use is by Actor _ oldθ′Obtaining a model; pθ(at|st)、Pθ′(at|st) Is an action model Actorθ、Actor_oldθ′Information s on road traffic conditionstLower implementation Signal control scheme atThe probability of (d);
Figure BDA0003069119330000042
representing the derivation of the parameter theta;
6.23, updating the parameter theta according to an Adam optimization method;
step 6.24, repeat step 6.22-step 6.23n _ actor times;
step 6.3, all data x in the database are utilizedt=<st,at,rt+1,st+1Updating an evaluation model in a reinforcement learning control model, comprising:
step 6.31, calculate A(s)t,at)=rt+1+τVw′(st+1)-Vw(st);
In the formula, Vw′(st+1) Represents the evaluation model Critic _ oldw′The output result of (1);
step 6.32, calculating an evaluation model CriticwGradient (2):
Figure BDA0003069119330000043
in the formula (I), the compound is shown in the specification,
Figure BDA0003069119330000044
representing the derivation of the parameter w;
step 6.33, updating the parameter w according to an Adam optimization method;
step 6.34, repeat step 6.31-step 6.33n _ critical times;
and 6.4, emptying all data information in the database.
Has the advantages that: compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:
the intersection and the adjacent intersections in the target area are taken as monitoring objects, and regional vehicle planning path data under the Internet of vehicles environment are considered in the state variables, so that the road traffic state is represented more comprehensively; constructing a regional signal control reinforcement learning model by combining a PPO algorithm and an LSTM algorithm; the distance factor is provided to measure the distance between the reinforcement learning model and the control scheme generated by the traditional queuing length priority strategy, so that the influence of the untrained mature model on road traffic safety and traffic efficiency due to the generation of an improper signal control scheme in the online learning process can be effectively avoided.
Drawings
FIG. 1 is a schematic diagram of modeling a lane condition, under an embodiment;
FIG. 2 is a diagram of a road traffic condition result for a lane under one embodiment;
FIG. 3 is a schematic illustration of an intersection in one embodiment;
FIG. 4 is a schematic structural diagram of a PPO2 model according to an embodiment;
FIG. 5 is a logic flow diagram of a method of the present invention in one embodiment.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Referring to fig. 5, the method of the present invention is further illustrated by taking an intersection as an example. The invention relates to a reinforcement learning area signal control method based on a vehicle planned path, which specifically comprises the following steps:
(1) designing a control framework of an intelligent agent in regional traffic signal control, modeling a road traffic state, and comprising the following steps:
under the environment of the Internet of vehicles, the regional path and the vehicle position information of the vehicle are fully utilized, so that the road traffic state is more comprehensively grasped and analyzed.
Specifically, each intersection is used as an independent agent, the intersection and the entrance lane of the adjacent intersection are used as observation ranges, and planning path information and vehicle position information of vehicles in the ranges are collected.
Defining the regional vehicle planning path matrix as Distributionm×n×4Each line corresponds to one lane, and the lanes in the intelligent agent monitoring range are divided into a unit grid by 1 meter; if a vehicle exists in the kth cell of the lane i at the time t, storing four planned road segment numbers after the time t of the vehicle in Distribution (i, f,1), Distribution (i, k,2), Distribution (i, k,3) and Distribution (i, k,4) respectively;
defining a vehicle location matrix as Posm×n×1Each row of the matrix corresponds to a lane in the monitoring range, the lane takes 1 meter as a cell, and if a vehicle exists in the kth cell of the lane i at the time t, Pos (i, k) is 1;
definition Im×1As a vector of the cross-relations of lanes and road sections, IiA link number indicating the lane i is located;
definition Gm×1Is the green time vector, GiIndicating the remaining green light transit time of lane i in the current cycle at time t.
Traffic environment state s is defined by Distributionm×n×4、Posm×n×1、Im×1And Gm×1The formed set can effectively control the overall state and trend of the regional traffic.
Under the present embodiment, referring to fig. 1, the road traffic state result Distribution of lane ii×n×4、Posi×n×1、Ii×1And Gi×1As shown with reference to FIG. 2;
(2) constructing a reinforcement learning control model, defining the input and the output of the model, and comprising the following steps:
referring to fig. 4, the reinforcement learning control model adopts a distributed control mode, and an agent needs to give a phase scheme and a timing scheme of an intersection at the same time; each intersection is used as an independent agent, and a reinforcement learning control model is independently trained; taking all the entrances of the intersection and the adjacent intersections as monitoring ranges, each intelligent agent collects the path information and the position information of all vehicles at the intersection in real time, and simultaneously obtains the path information and the position information of all vehicles at the entrances of the adjacent intersections and the remaining green time of each lane from the adjacent intersections as the input of a PPO2 algorithm.
The PPO2 model comprises an action model Actor and an evaluation model Critic, and the action model outputs a signal control scheme a; the evaluation model outputs an evaluation v of the signal control scheme a.
To improve the efficiency of training, Actor and Critic will share the underlying input layer. Meanwhile, the PPO2 algorithm is combined with a long and short memory model to enhance the memory of the model to the historical state, so that the intelligent agent can make more reasonable decisions.
(3) The intelligent agent interacts with the road traffic environment, and specifically comprises the following steps: at time t, the agent reads the road traffic status stOutputting a signal scheme at the time t according to the PPO2 control model
Figure BDA0003069119330000061
The release phase and green time at this intersection are given.
The phase set is defined as the combination of all non-conflicted traffic flows at the intersection; for example, for a typical crossroad with independent entry lanes for each flow direction, the action set is defined as { north-south straight, north-south left turn, east-west straight, east-west left turn, south-north straight left, north-east straight left, east-west straight left }, and the duration of each signal phase execution is not fixed.
Signal control scheme generated by agent at time t
Figure BDA0003069119330000062
Not directly used for signal controlSignal control schemes generated by a queue length priority strategy
Figure BDA0003069119330000063
The distance factor gamma is calculated. The queuing length priority strategy means that the intersection always gives priority to the phase green light time with the longest queuing length; the distance factor gamma is used for measuring the signal control scheme of the PPO2 control model output
Figure BDA0003069119330000064
Signal control scheme generated by' queuing length priority strategy
Figure BDA0003069119330000065
The formula is as follows:
Figure BDA0003069119330000066
when gamma is larger than a certain threshold value, the intersection signal control scheme generated by the queuing priority length strategy is actually implemented at the intersection at the time t, namely
Figure BDA0003069119330000067
Otherwise, implementing an intersection signal control scheme obtained by the PPO2 control model at the intersection, namely
Figure BDA0003069119330000068
Action atAfter implementation, the traffic environment enters the next state st+1
Under the present embodiment, the signal control scheme generated by the DPPO2 control model
Figure BDA0003069119330000069
Signal control scheme generated by queuing length priority strategy
Figure BDA00030691193300000610
For example, there are
Figure BDA00030691193300000611
If the threshold value sigma is 6, implementing an intersection signal control scheme obtained by a PPO2 control model at the intersection, namely implementing the intersection signal control scheme
Figure BDA00030691193300000612
To atThe conflict-free processing is carried out byt=[15,0,0,0,0,0,0,0]。
(4) Designing an incentive r for interaction of an intelligent agent and the environment, aiming at reducing the queuing length of an intersection, reducing vehicle delay and avoiding downstream traffic jam, wherein the incentive is defined as the weighted sum of the queuing length weighted by the waiting time of the first vehicle in an entrance way and the queuing length index in an exit way, namely: reward is the waiting time w of the first vehicle, the inlet lane queuing length q + delta, the outlet lane queuing length index f:
Figure BDA0003069119330000071
in the formula Iin、lo8tRespectively an inlet channel and an outlet channel of the intersection; w is ai、qiRespectively the first vehicle waiting time and the queuing length of the lane i; f. ofjIs a Boolean variable used for measuring whether the queuing length of the outlet passage exceeds the length L of the road sectionjThree fourths of the total area of the coating, if
Figure BDA0003069119330000072
F is thenj1, otherwise fj=0;LjIs the road segment length of lane j; q. q.sjThe length of the queue for lane j; delta is a penalty factor.
In the present embodiment, with reference to figure 3,
the queue length q of each entrance lane of the intersection is [20,14,32,20,15,24,20,15,20,26,18,18,12,30 ];
the waiting time w of the first vehicle at each entrance lane is [25,25,15,15,0,0,36,25,25,15,15,0,0,36 ];
each outlet channel queue length p is [5,22,12,14,118,34,12,18,18,10,5,24,5,13], and a boolean variable f is obtained by conversion [0,0,0,0,1,0,0,0,0,0,0,0 ];
taking δ as 100 and converting w into the unit of hour, the following is obtained:
r≈-16.83-100*1=-116.83
(5) and storing the data of the interaction between the intelligent agent and the environment into a database for Replay buffer. Data to<st,at,rt+1,st+1>When the database is accumulated to the set size Z, the PPO2 model parameters are updated, the steps are as follows:
(5.1) initializing reinforcement learning control model parameters, including:
initializing values of the hyper-parameters, including a learning rate α of 0.0001; the threshold σ of the distance factor is 6; the penalty factor delta is 100, and Z is 512;
is an action model ActorθParameter(s) and evaluation model CriticwThe parameters of (a) are assigned with initial values, wherein theta and w are respectively parameters of the action model to be updated and the evaluation model;
define Actor _ oldθ′、Critic_oldw′Is Actorθ、CriticwCopies of the model, Actor _ oldθ′The parameters of the model are equal to ActorθParameters when not updated are kept unchanged in the updating process;
is ActorθAnd CriticwSetting the training times n _ operator to be 10 and n _ critical to be 10;
(5.2) use of all data x in the databaset=<st,at,rt+1,st+1>. update the action model in the PPO2 control model, including:
(5.21) calculation of A(s)t,at)=rt+1+τVw(st+1)-Vw(st)
In the formula, Vw(st+1) To evaluate the model CriticwThe evaluation result is output at the moment t + 1; vw(st) The evaluation result is output by the evaluation model at the time t; τ is a reduction factor; a(s)t,at) For the shape of road trafficState information stLower implementation Signal control scheme atThe advantages of (a);
(5.22) calculating ActorθGradient of the model:
Figure BDA0003069119330000081
wherein E represents a mathematical expectation; (s)t,at)~πθ′Data indicating use is by Actor _ oldθ′Obtaining a model; pθ(at|st)、Pθ′(at|st) Is an action model Actorθ、Actor_oldθ′Information s on road traffic conditionstLower implementation Signal control scheme atThe probability of (d);
Figure BDA0003069119330000082
representing the derivation of the parameter theta;
(5.23) updating the parameter theta according to an Adam optimization method;
(5.24) repeating step (5.22) -step (5.23)10 times;
(5.3) use of all data x in the databaset=<st,at,rt+1,st+1>. update the evaluation model in the PPO2 control model, including:
(5.31) calculation of A(s)t,at)=rt+1+τVw′(st+1)-Vw(st);
In the formula, Vw′(st+1) Represents the evaluation model Critic _ oldw′The output result of (1);
(5.32) calculation of evaluation model CriticwGradient (2):
Figure BDA0003069119330000083
in the formula (I), the compound is shown in the specification,
Figure BDA0003069119330000084
representing the derivation of the parameter w;
(5.33) updating the parameter w according to the Adam optimization method;
(5.34) repeating step (5.31) -step (5.33)10 times;
(5.4) emptying the Replay buffer and repeating the steps (3) - (5).

Claims (6)

Translated fromChinese
1.一种基于车辆规划路径的强化学习区域信号控制方法,其特征在于,具体包括如下步骤:1. a reinforcement learning area signal control method based on vehicle planning path, is characterized in that, specifically comprises the steps:步骤1,设计目标区域交通信号控制中智能体的控制框架,并对道路交通状态进行建模,包括:以目标区域中每一个交叉口作为独立智能体,为每一个独立智能体构建各自对应的强化学习控制模型和数据库;Step 1: Design the control framework of the agent in the traffic signal control of the target area, and model the road traffic state, including: taking each intersection in the target area as an independent agent, and constructing a corresponding corresponding agent for each independent agent. Reinforcement learning control model and database;步骤2:令交叉口独立智能体与交叉口环境进行交互,实时收集交叉口一定范围内的道路交通状态信息;所述一定范围包括交叉口和相邻交叉口的进口道;Step 2: Make the intersection independent agent interact with the intersection environment, and collect road traffic status information within a certain range of the intersection in real time; the certain range includes the intersection and the entrance road of the adjacent intersection;步骤3,将当前时刻交叉口的道路交通状态信息作为该交叉口对应的强化学习控制模型的输入,得到当前时刻的下一时刻的交叉口信号控制方案,以及该控制方案的评价结果;所述信号控制方案包括放行相位和绿灯时间;Step 3, taking the road traffic state information of the intersection at the current moment as the input of the reinforcement learning control model corresponding to the intersection, to obtain the intersection signal control scheme at the next moment at the current moment, and the evaluation result of the control scheme; Signal control scheme including release phase and green time;步骤4,通过当前时刻交叉口的道路交通状态信息,利用排队优先长度策略生成当前时刻的下一时刻的交叉口信号控制方案;Step 4, through the road traffic state information of the intersection at the current moment, using the queue priority length strategy to generate the intersection signal control scheme at the next moment at the current moment;步骤5,利用强化学习控制模型得到的交叉口信号控制方案以及排队优先长度策略生成的交叉口信号控制方案,计算距离因子;若计算得到的距离因子大于设定的距离阈值,则在交叉口实施排队优先长度策略生成的交叉口信号控制方案;否则,在交叉口实施强化学习控制模型得到的交叉口信号控制方案;Step 5: Calculate the distance factor by using the intersection signal control scheme obtained by the reinforcement learning control model and the intersection signal control scheme generated by the queuing priority length strategy; if the calculated distance factor is greater than the set distance threshold, implement it at the intersection. The intersection signal control scheme generated by the queue priority length strategy; otherwise, the intersection signal control scheme obtained by implementing the reinforcement learning control model at the intersection;步骤6,将目标区域中交叉口智能体收集的道路交通状态信息、交叉口各自对应的信号控制方案以及交叉口智能体与环境交互的奖励实时存储至交叉口各自对应的数据库中,并判断当交叉口数据库中存储的数据信息累计至设定大小时,更新该交叉口对应的强化学习控制模型参数,并在更新完成后清空数据库中的全部数据,返回步骤2。Step 6: Store the road traffic status information collected by the intersection agents in the target area, the signal control schemes corresponding to the intersections, and the rewards for the interaction between the intersection agents and the environment in real-time in the respective databases of the intersections, and determine the current situation. When the data information stored in the intersection database is accumulated to the set size, the reinforcement learning control model parameters corresponding to the intersection are updated, and after the update is completed, all data in the database is cleared, and the process returns to step 2.2.根据权利要求1所述的一种基于车辆规划路径的强化学习区域信号控制方法,其特征在于,步骤2所述道路交通状态信息是由车辆规划路径矩阵、车辆位置矩阵、车道和路段对应关系向量以及绿时向量构成的集合;2. a kind of reinforcement learning area signal control method based on vehicle planning path according to claim 1, is characterized in that, the road traffic state information described in step 2 is by vehicle planning path matrix, vehicle position matrix, lane and road section corresponding A set of relation vectors and green hour vectors;所述车辆规划路径矩阵用Distributionm×n×4表示,其中每一行对应一条车道,将智能体监测范围内的车道以1米进行分隔,得到若干个单元格,每一列对应一个所述单元格;若t时刻车道i的第k个单元格中存在车辆,则Distribution(i,k,1)、Distribution(i,k,2)、Distribution(i,k,3)和Distribution(i,k,4)分别存储该车辆在t时刻之后可能经过的四条规划路段编号;The vehicle planning path matrix is represented by Distributionm×n×4 , where each row corresponds to a lane, and the lanes within the monitoring range of the agent are separated by 1 meter to obtain several cells, and each column corresponds to one of the cells. ;If there is a vehicle in the kth cell of lane i at time t, then Distribution(i,k,1), Distribution(i,k,2), Distribution(i,k,3) and Distribution(i,k, 4) Store the numbers of the four planned road sections that the vehicle may pass through after time t respectively;所述车辆位置矩阵用Posm×n×1表示,其中每一行对应智能体监测范围内的一条车道;每一列对应一个所述单元格;若t时刻车道i的第k个单元格中存在车辆,则Pos(i,k)=1;若t时刻车道i的第k个单元格中不存在车辆,则Pos(i,k)=0;The vehicle position matrix is represented by Posm×n×1 , where each row corresponds to a lane within the monitoring range of the agent; each column corresponds to one of the cells; if there is a vehicle in the kth cell of lane i at time t , then Pos(i,k)=1; if there is no vehicle in the kth cell of lane i at time t, then Pos(i,k)=0;所述车道和路段对应关系向量用Im×1表示;则Ii表示车道i所在的路段编号;The corresponding relationship vector between the lane and the road segment is represented by Im×1 ; then Ii represents the road segment number where the lane i is located;所述绿时向量用Gm×1表示;则Gi表示车道i在t时刻所在当前周期的剩余绿灯通行时间。The green hour vector is represented by Gm×1 ; then Gi represents the remaining green light traffic time of the current cycle where lane i is located at time t.3.根据权利要求1所述的一种基于车辆规划路径的强化学习区域信号控制方法,其特征在于,步骤5所述距离因子的计算公式如下:3. a kind of reinforcement learning area signal control method based on vehicle planning path according to claim 1, is characterized in that, the calculation formula of the distance factor described in step 5 is as follows:
Figure FDA0003069119320000021
Figure FDA0003069119320000021
式中,γ为距离因子;
Figure FDA0003069119320000022
为强化学习控制模型得到的交叉口信号控制方案;
Figure FDA0003069119320000023
为排队长度优先策略生成的信号控制方案。
where γ is the distance factor;
Figure FDA0003069119320000022
The intersection signal control scheme obtained for the reinforcement learning control model;
Figure FDA0003069119320000023
Signal control scheme generated for queue length first policy.
4.根据权利要求1所述的一种基于车辆规划路径的强化学习区域信号控制方法,其特征在于,步骤6中目标区域中交叉口智能体收集的道路交通状态信息、交叉口各自对应的信号控制方案以及交叉口智能体与环境交互的奖励以<st,at,rt+1,st+1>的形式存储至交叉口各自对应的数据库中;其中,st表示交叉口智能体在t时刻收集到的道路交通状态信息;at表示目标交叉口在t时刻实施的信号控制方案;rt+1表示交叉口智能体在t+1时刻与环境交互的奖励;st+1表示交叉口智能体在t+1时刻收集到的道路交通状态信息。4. The method for controlling signals in a reinforcement learning area based on a planned path of a vehicle according to claim 1, wherein in step 6, the road traffic state information and the signals corresponding to the intersections collected by the intersection agent in the target area The control scheme and the reward for the interaction between the intersection agent and the environment are stored in the respective databases of the intersection in the form of <st , at , rt+1 , st+1 >; where st represents the intersection intelligence is the road traffic state information collected by the agent at time t; at represents the signal control scheme implemented by the target intersection at time t; rt+1 represents the reward for the interaction agent at the intersection with the environment at time t+1; st+ 1 represents the road traffic state information collected by the intersection agent at time t+1.5.根据权利要求1所述的一种基于车辆规划路径的强化学习区域信号控制方法,其特征在于,所述交叉口智能体与环境交互的奖励,是由交叉口进口道首车等候时间、交叉口进口道排队长度和交叉口出口道排队长度计算得到的,具体如下:5 . The method of strengthening learning regional signal control based on vehicle planning path according to claim 1 , wherein the reward for the interaction between the agent and the environment at the intersection is determined by the waiting time for the first vehicle at the entrance of the intersection, The queuing length at the entrance of the intersection and the queuing length at the exit of the intersection are calculated as follows:
Figure FDA0003069119320000024
Figure FDA0003069119320000024
式中,rt+1为交叉口智能体在t+1时刻与环境交互的奖励;lin、lout分别为交叉口的进口道、出口道集合;wi、qi分别为车道i的首车等候时间和排队长度;fj为布尔变量,用来衡量出口道排队长度是否超过路段长度Lj的四分之三,如果
Figure FDA0003069119320000025
则fj=1,否则fj=0;Lj为车道j的路段长度;qj为车道j的排队长度;δ为惩罚因子。
In the formula, rt+1 is the reward of the intersection agent interacting with the environment at time t+1;lin and lout are the set of entrance and exit roads of the intersection, respectively; wi and qi are the lane i respectively. First car waiting time and queue length; fj is a Boolean variable used to measure whether the exit road queue length exceeds three-quarters of the road segment length Lj , if
Figure FDA0003069119320000025
Then fj =1, otherwise fj =0; Lj is the length of the road section of lane j; qj is the queue length of lane j; δ is the penalty factor.
6.根据权利要求4所述的一种基于车辆规划路径的强化学习区域信号控制方法,其特征在于,步骤6所述当交叉口数据库中存储的数据信息累计至一定大小时,则更新该交叉口对应的强化学习控制模型参数,并在更新完成后清空数据库中的全部数据,包括:6 . The method for controlling a signal in a reinforcement learning area based on a planned path of a vehicle according to claim 4 , wherein in step 6, when the data information stored in the intersection database is accumulated to a certain size, the intersection is updated. 6 . The corresponding reinforcement learning controls the model parameters, and after the update is completed, all data in the database is cleared, including:步骤6.1,初始化强化学习控制模型参数,包括:Step 6.1, initialize the reinforcement learning control model parameters, including:初始化超参数的值,包括学习速率α、距离因子的阈值σ、惩罚因子δ;Initialize the values of the hyperparameters, including the learning rate α, the threshold σ of the distance factor, and the penalty factor δ;为动作模型Actorθ的参数、评价模型Criticw的参数赋初始值,其中,θ和w分别为待更新的动作模型和评价模型的参数;Assign initial values to the parameters of the action model Actorθ and the parameters of the evaluation model Criticw , where θ and w are the parameters of the action model to be updated and the evaluation model respectively;定义Actor_oldθ′、Critic_oldw′为Actorθ、Criticw模型的副本,即Actor_oldθ′模型的参数等于Actorθ未更新时的参数,并在更新过程中参数保持不变;Define Actor_oldθ′ and Critic_oldw′ as copies of Actorθ and Criticw models, that is, the parameters of Actor_oldθ′ model are equal to the parameters when Actorθ is not updated, and the parameters remain unchanged during the update process;为Actorθ和Criticw设置训练次数n_actor、n_critic;Set the training times n_actor, n_critic for Actorθ and Criticw ;步骤6.2,利用数据库中的全部数据xt=<st,at,rt+1,st+1>,更新强化学习控制模型中的动作模型,包括:Step 6.2, using all the data in the database xt =<st , at , rt+1 , st+1 >, update the action model in the reinforcement learning control model, including:步骤6.21,计算A(st,at)=rt+1+τVw(st+1)-Vw(st)Step 6.21, calculate A(st , at )=rt+1 +τVw (st+1 )-Vw (st )式中,Vw(st+1)为评价模型Criticw在t+1时刻输出的评价结果;Vw(st)为评价模型在t时刻输出的评价结果;τ为折减因子;A(st,at)为在道路交通状态信息st下实施信号控制方案at的优势;In the formula, Vw (st+1 ) is the evaluation result output by the evaluation model Criticw at time t+1; Vw (st ) is the evaluation result output by the evaluation model at time t; τ is the reduction factor; A (st , att ) is the advantage of implementing the signal control scheme at under the road traffic state informationst ;步骤6.22,计算Actorθ模型的梯度:Step 6.22, compute the gradient of the Actorθ model:
Figure FDA0003069119320000031
Figure FDA0003069119320000031
式中,E表示数学期望;(st,at)~πθ′表示使用的数据是由Actor_oldθ′模型得到的;Pθ(at|st)、Pθ′(at|st)为动作模型Actorθ、Actor_oldθ′在道路交通状态信息st下实施信号控制方案at的概率;
Figure FDA0003069119320000034
表示对参数θ求导数;
In the formula, E represents mathematical expectation; (st , att )~πθ′ indicates that the data used is obtained by Actor_oldθ′ model; Pθ (at |s t) , Pθ′ (at |st ) is the probability that the action models Actorθ and Actor_oldθ′ implement the signal control scheme at under the road traffic state informationst ;
Figure FDA0003069119320000034
Represents the derivative of the parameter θ;
步骤6.23,根据Adam优化方法,更新参数θ;Step 6.23, according to the Adam optimization method, update the parameter θ;步骤6.24,重复步骤6.22-步骤6.23n_actor次;Step 6.24, repeat step 6.22-step 6.23n_actor times;步骤6.3,利用数据库中的全部数据xt=<st,at,rt+1,st+1>,更新强化学习控制模型中的评价模型,包括:Step 6.3, using all the data in the database xt =<st , at , rt+1 , st+1 >, update the evaluation model in the reinforcement learning control model, including:步骤6.31,计算A(st,at)=rt+1+τVw′(st+1)-Vw(st);Step 6.31, calculate A(st , at )=rt+1 +τVw′ (st+1 )-Vw (st );式中,Vw′(st+1)表示评价模型Critic_oldw′的输出结果;In the formula, Vw′ (st+1 ) represents the output result of the evaluation model Critic_oldw′ ;步骤6.32,计算评价模型Criticw的梯度:Step 6.32, calculate the gradient of the evaluation model Criticalw :
Figure FDA0003069119320000032
Figure FDA0003069119320000032
式中,
Figure FDA0003069119320000033
表示对参数w求导数;
In the formula,
Figure FDA0003069119320000033
Represents the derivative of the parameter w;
步骤6.33,根据Adam优化方法,更新参数w;Step 6.33, according to the Adam optimization method, update the parameter w;步骤6.34,重复步骤6.31-步骤6.33n_critic次;Step 6.34, repeat step 6.31-step 6.33n_critic times;步骤6.4,清空数据库中的全部数据信息。Step 6.4, clear all data information in the database.
CN202110534127.1A2021-05-172021-05-17 A Reinforcement Learning Area Signal Control Method Based on Vehicle Planning PathActiveCN113487902B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202110534127.1ACN113487902B (en)2021-05-172021-05-17 A Reinforcement Learning Area Signal Control Method Based on Vehicle Planning Path

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202110534127.1ACN113487902B (en)2021-05-172021-05-17 A Reinforcement Learning Area Signal Control Method Based on Vehicle Planning Path

Publications (2)

Publication NumberPublication Date
CN113487902Atrue CN113487902A (en)2021-10-08
CN113487902B CN113487902B (en)2022-08-12

Family

ID=77933576

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202110534127.1AActiveCN113487902B (en)2021-05-172021-05-17 A Reinforcement Learning Area Signal Control Method Based on Vehicle Planning Path

Country Status (1)

CountryLink
CN (1)CN113487902B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114550470A (en)*2022-03-032022-05-27沈阳化工大学 A wireless network interconnected intelligent traffic signal light
CN114667852A (en)*2022-03-142022-06-28广西大学 An intelligent collaborative control method for hedge trimming robots based on deep reinforcement learning
CN114819617A (en)*2022-03-032022-07-29北京邮电大学Bus scheduling method based on reinforcement learning
CN116092297A (en)*2023-04-072023-05-09南京航空航天大学Edge calculation method and system for low-permeability distributed differential signal control
CN116524741A (en)*2023-04-242023-08-01上海交通大学Special vehicle priority passing method and system based on reinforcement learning
CN119207133A (en)*2024-09-252024-12-27同济大学 Decentralized adaptive signal control method considering the impact of downstream traffic pressure transmission
CN120293170A (en)*2025-04-112025-07-11华芯数智(北京)科技有限公司 Automatic driving path planning method, system, electronic device and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
RU2379761C1 (en)*2008-07-012010-01-20Государственное образовательное учреждение высшего профессионального образования "Уральский государственный технический университет УПИ имени первого Президента России Б.Н.Ельцина"Method of controlling road traffic at intersection
CN105046987A (en)*2015-06-172015-11-11苏州大学Road traffic signal lamp coordination control method based on reinforcement learning
CN111915894A (en)*2020-08-062020-11-10北京航空航天大学Variable lane and traffic signal cooperative control method based on deep reinforcement learning
CN112365724A (en)*2020-04-132021-02-12北方工业大学Continuous intersection signal cooperative control method based on deep reinforcement learning
CN112632858A (en)*2020-12-232021-04-09浙江工业大学Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
RU2379761C1 (en)*2008-07-012010-01-20Государственное образовательное учреждение высшего профессионального образования "Уральский государственный технический университет УПИ имени первого Президента России Б.Н.Ельцина"Method of controlling road traffic at intersection
CN105046987A (en)*2015-06-172015-11-11苏州大学Road traffic signal lamp coordination control method based on reinforcement learning
CN112365724A (en)*2020-04-132021-02-12北方工业大学Continuous intersection signal cooperative control method based on deep reinforcement learning
CN111915894A (en)*2020-08-062020-11-10北京航空航天大学Variable lane and traffic signal cooperative control method based on deep reinforcement learning
CN112632858A (en)*2020-12-232021-04-09浙江工业大学Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YUNXUE LU,HAO WANG: "Compatibility-Based Approach for Routing and Scheduling the Demand Responsive Connector", 《IEEE ACCESS 》*
黄艳国等: "基于多智能体的城市交通区域协调控制方法", 《武汉理工大学学报(交通科学与工程版)》*

Cited By (12)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114550470A (en)*2022-03-032022-05-27沈阳化工大学 A wireless network interconnected intelligent traffic signal light
CN114819617A (en)*2022-03-032022-07-29北京邮电大学Bus scheduling method based on reinforcement learning
CN114550470B (en)*2022-03-032023-08-22沈阳化工大学Wireless network interconnection intelligent traffic signal lamp
CN114819617B (en)*2022-03-032025-03-11北京邮电大学Bus scheduling method based on reinforcement learning
CN114667852A (en)*2022-03-142022-06-28广西大学 An intelligent collaborative control method for hedge trimming robots based on deep reinforcement learning
CN114667852B (en)*2022-03-142023-04-14广西大学 An intelligent collaborative control method for hedge trimming robots based on deep reinforcement learning
CN116092297A (en)*2023-04-072023-05-09南京航空航天大学Edge calculation method and system for low-permeability distributed differential signal control
CN116524741A (en)*2023-04-242023-08-01上海交通大学Special vehicle priority passing method and system based on reinforcement learning
CN116524741B (en)*2023-04-242025-08-05上海交通大学 Special vehicle priority passage method and system based on reinforcement learning
CN119207133A (en)*2024-09-252024-12-27同济大学 Decentralized adaptive signal control method considering the impact of downstream traffic pressure transmission
CN119207133B (en)*2024-09-252025-09-30同济大学 Decentralized adaptive signal control method considering the impact of downstream traffic pressure transmission
CN120293170A (en)*2025-04-112025-07-11华芯数智(北京)科技有限公司 Automatic driving path planning method, system, electronic device and medium

Also Published As

Publication numberPublication date
CN113487902B (en)2022-08-12

Similar Documents

PublicationPublication DateTitle
CN113487902B (en) A Reinforcement Learning Area Signal Control Method Based on Vehicle Planning Path
CN111696370B (en)Traffic light control method based on heuristic deep Q network
Lin et al.Traffic signal optimization based on fuzzy control and differential evolution algorithm
CN108510764B (en)Multi-intersection self-adaptive phase difference coordination control system and method based on Q learning
CN112365724B (en)Continuous intersection signal cooperative control method based on deep reinforcement learning
CN110032782A (en)A kind of City-level intelligent traffic signal control system and method
CN112419726B (en)Urban traffic signal control system based on traffic flow prediction
CN113436443B (en)Distributed traffic signal control method based on generation of countermeasure network and reinforcement learning
CN115083174B (en) A traffic light control method based on cooperative multi-agent reinforcement learning
CN107862877A (en)A kind of urban traffic signal fuzzy control method
CN118968790A (en) Multi-traffic signal light control method and system based on multi-agent reinforcement learning
CN113112823A (en)Urban road network traffic signal control method based on MPC
CN116895158B (en) A traffic signal control method for urban road networks based on multi-agent Actor-Critic and GRU
CN113487857B (en)Regional multi-intersection variable lane cooperative control decision method
CN117116064A (en) A signal control method for minimizing passenger delays based on deep reinforcement learning
CN118155429A (en) A traffic adaptive control method based on multi-agent reinforcement learning
CN114360290A (en)Method for selecting vehicle group lanes in front of intersection based on reinforcement learning
CN118629228A (en) Traffic signal control method based on deep reinforcement learning with multi-objective and multi-agent
CN113628455B (en)Intersection signal optimization control method considering number of people in vehicle under Internet of vehicles environment
CN118332451B (en) A Classification Method for Group Division Model
CN117058873B (en) A variable speed limit control method for expressways under digital twin conditions
CN115762128B (en) A deep reinforcement learning traffic signal control method based on self-attention mechanism
CN116189464B (en)Cross entropy reinforcement learning variable speed limit control method based on refined return mechanism
CN114783178B (en)Self-adaptive parking lot exit gateway control method, device and storage medium
Wei et al.Intersection signal control approach based on PSO and simulation

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp