Intelligent automobile track prediction system and method integrating peripheral automobile interaction informationTechnical Field
The invention belongs to the technical field of intelligent driving, and particularly relates to an intelligent automobile track prediction system and method integrating peripheral automobile interaction information.
Background
With the development of intelligent automobile technology and the rising of 5G communication technology, the study of the automatic driving of vehicles by students at home and abroad is more and more, and one of the main purposes of the study is to reduce traffic accidents. The decision system is used as a core part of the automatic driving technology of the vehicle, and is required to predict a driving track capable of avoiding surrounding obstacles in real time, so that the decision system is important for safe driving of the vehicle. The system is an automatic driving brain, and a safe and reasonable optimal track is planned for the intelligent automobile mainly according to the self-vehicle driving information sensed by the vehicle sensor and other traffic main body information such as the position, the speed and the lane line of the surrounding vehicle acquired based on V2X.
The main research direction of the vehicle decision-making system is to face the intelligent vehicle state, and the future track is predicted according to the collected historical track data. The methods used can be divided into two classes, including methods based on traditional physical models and methods based on neural network predictions. One type is a method based on a traditional physical model, and generally using models includes: the method comprises the steps of generating a future track of a predicted vehicle according to historical data representing physical actions, wherein the future track of the predicted vehicle is generated according to a constant speed model, a bicycle model and a Kalman filtering model, but the method rarely considers influence factors of surrounding vehicles, and parameters need to be adjusted in each situation, so that real-time performance and accuracy cannot be well ensured; the other type is a method based on neural network prediction, and the neural network is mainly used as follows: the method is to generate future tracks by encoding and decoding based on historical tracks of vehicles, and the effect of the method is proved to be superior to that of the method based on a traditional physical model, but the environmental data characteristics are not fully mined, and the interactive information of the vehicles and the surrounding environment cannot be well utilized.
In fact, intelligent automobiles must share the road with surrounding vehicles while traveling, and their travel trajectories are also affected and constrained by the road environment, e.g., lane geometry, crosswalks, traffic lights, and other vehicle behaviors. Based on the existing neural network method, the invention provides an intelligent automobile track prediction method integrating peripheral automobile interaction information by considering the influence of the running scene of the automobile and the surrounding automobile environment on track prediction and combining the dynamic graph neural network and the lane graph neural network.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide an intelligent automobile track prediction system and method for fusing peripheral automobile interaction information, so as to solve the problem that track prediction accuracy is affected due to neglect of interaction between an own automobile and peripheral automobiles in the prior art, and provide guarantee for safe and efficient running of the intelligent automobile.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the invention discloses an intelligent automobile track prediction model system integrating peripheral vehicle interaction information, which is shown in fig. 1 and comprises four modules, namely: the system comprises a vehicle interaction relation extraction module, a driving scene characterization module, a space-time relation fusion module and a track prediction module. The vehicle interaction relation extraction module is used for defining an influence threshold value of 10 meters from vehicle history track data perceived by a sensor, mainly referring to position coordinates, constructing an interaction diagram representing interaction relation between a vehicle and surrounding vehicles, transmitting the original track sequence coordinates and the interaction diagram into a GCN network, and outputting track data representing the interaction relation diagram among vehicles.
The driving scene representation module constructs an interactive relation graph between lane segments according to the original perceived map information M, namely, the front segment, the subsequent, the left adjacent and the right adjacent lanes of the lane are represented, and then the interactive relation graph and the original map information M are transmitted into a lane graph convolution network together to output map data representing the interactive relation of the lane.
The space-time relationship fusion module fuses the data of the two modules, transmits track information representing an interaction relationship diagram between vehicles to map data, and grasps traffic congestion or traffic use condition; then updating the map data information fused with the track information at the moment through a lane diagram convolution network to realize real-time interconnection among lane segments, and outputting map feature data implicitly containing vehicle information; and finally, feeding back the updated real-time map features and the original track information to the vehicle, wherein the output information implicitly represents the historical track information with real-time map interaction and surrounding vehicle interaction.
The track prediction module inputs the historical track data fused by the space-time relation fusion module, decodes the two-dimensional track coordinates of the vehicle at the future moment through the processing of the encoder and the decoder, and simultaneously sets classification loss and outputs multiple modes through the stacking of a plurality of codes and decodes. The final output track coordinates are thus represented as sets of future track values, representing a number of possible future tracks for the same vehicle.
Further, the vehicle interaction relation extraction module includes: vehicle history trajectory X, build vehicle interaction graph G and graph roll-up network GCN. Constructing a vehicle interaction graph G, receiving historical track information X of a vehicle and surrounding vehicles, describing interaction relation of the vehicles on time and space levels in a graph matrix mode, and inputting the graph into a graph convolution GCN network together with the vehicle historical track X to capture complex interactions among different traffic vehicles and obtain historical track information with interaction information;
further, the driving scene representation module comprises a high-definition vector map M, an interactive lane map graph, a lane map convolution GCN and a full connection layer FC1. Considering the influence of driving scene information (including lane center line, steering and traffic control) on the track of a target vehicle, acquiring lane information of a high-definition vector map M by an interactive lane map, inputting the lane information and the high-definition vector map M into a lane map convolution GCN network, and extracting map feature information through a full-connection layer FC 1;
Further, the space-time relation fusion module is divided into three units, wherein the unit I receives historical track information and map feature information, introduces real-time vehicle information to lane nodes through a layer of Attention mechanism Attention and a full connection layer FC2, acquires the service condition of a lane, and outputs map data containing the historical track information of the vehicle; the second unit receives the output of the first unit, the history track information and the map feature information, and the lane node feature is updated by transmitting lane information through a lane graph convolution GCN layer and a full connection layer FC 3; and the third unit receives the historical track information and the map feature information, and performs real-time traffic information fusion with the updated features of the second unit through an Attention mechanism Attention and a full connection layer FC 4. The three units are used for realizing the transmission of real-time traffic information by constructing a stack of sequential cycle fusion blocks to acquire information flows among vehicles, lanes and between the vehicles, and finally outputting the track information of the vehicles to a track prediction module;
Further, the trajectory prediction module includes an encoder GRU, a decoder GRU, and last observed frame coordinates of Seq2 Seq. Firstly, the encoder GRU receives fusion characteristic information from a space-time relation fusion module as input, codes time dimension, then inputs the fusion characteristic information into the decoder GRU together with the observed frame coordinates, and repeatedly decodes BEV track coordinate values of future time steps. A classification branch is used to predict confidence scores of each mode, and K mode tracks of the own vehicle are obtained.
The invention also provides an intelligent automobile track prediction method integrating the peripheral vehicle interaction information, which comprises the following steps:
S1: firstly, preprocessing an input historical track of a predicted vehicle and a history track of surrounding vehicles and an interaction graph G between the vehicles; processing the historical track into a three-dimensional array of n×th ×c, where n represents n objects in the traffic scene observed in the past time step, th represents the historical period, and c=2 represents the x and y coordinates of the object;
representing an interaction graph G between vehicles as g= (V, E), wherein V represents nodes of the graph, i.e. observed vehicles, feature vectors on the nodes being coordinates of the object at time t; e represents the interactive connection edges between vehicles and is represented by an adjacent matrix; wherein, considering that there are connection edges between vehicles on the space-time characteristics, namely, connection edges of interaction influence between different vehicles generated by the influence of distance in space and connection edges of each vehicle with own historical moment in the time domain, the interaction graph G is expressed as an adjacency matrix:
G={A0,A1}
wherein a0 is a time-connected edge adjacency matrix and a1 is a space-connected edge adjacency matrix;
S2: mapping the history track and the interaction graph G to a high-dimensional convolution layer through a two-dimensional convolution layer; then carrying out space-time interaction through two layers of picture scroll lamination; the convolution kernel of the space-time interaction comprises two parts, namely an interaction graph G of a current observation frame and a trainable graph Gtrain with the same size as G; extracting space interaction information by using a convolution network with a convolution kernel as a sum of G and Gtrain, and processing data by using a time convolution layer with the size fixed as (1 multiplied by 3) through the convolution kernel on a time layer, so that data in the dimension of n multiplied byh multiplied by c are processed along the time dimension, and after the space layer and the time layer are alternately processed, outputting track data of an inter-vehicle interaction relation diagram with the dimension of n multiplied byh multiplied by c;
S3: extracting features according to the map data to obtain a structured map representation from the vectorized map data;
S3.1: firstly, constructing a lane map according to map data: according to the obtained lane data center line Lcen, the lane data center line Lcen is expressed as a series of two-dimensional bird's eye view angle coordinate points, any two communicated lane information, namely a left adjacent lane, a right adjacent lane, a front section and a rear section, is obtained, and is processed into four communicated dictionaries corresponding to the lane ids, and the front section of the lane L Front section, the rear section of the lane L Subsequent to, the left adjacent lane L Left adjacent and the right adjacent lane L Right adjacent of the given lane L are respectively expressed, so that a lane map is obtained;
S3.2: features in the lane map and map data are then added, the features including: the method comprises the steps of inputting a lane serial number lid, a lane central line sequence point lcen, a lane steering condition lturn, whether a lane has traffic control lcon or not, whether the lane is an intersection linter or not, inputting the lane serial number lid and the lane central line sequence point lcen into a lane graph convolution GCN network together, and outputting map data containing lane interaction relations;
S4: fusing the track data of the inter-vehicle interaction relation diagram output in the step S2 with the map data containing the lane interaction relation output in the step S3.2, wherein the method comprises the following steps:
(1) Fusing the vehicle information to the lane nodes, and grasping the traffic congestion condition;
(2) Information fusion update between lane nodes is realized to realize real-time interconnection between lane segments;
(3) Fusing and feeding back map data characteristics and real-time traffic information to the vehicle;
The information update between the (2) th partial lane nodes adopts a lane graph convolution GCN mode, and a graph convolution is constructed by using an adjacent matrix with lane information to extract lane interaction information;
The mutual transmission between the vehicle information and the lane information, namely the (1) th and the (3) th parts, extracts the interactive features of three types of information, namely the input lane features, the vehicle features and the context node information through a spatial attention mechanism; the context node is defined as a channel node with a lane node and a vehicle node, wherein the l2 distance between the lane node and the vehicle node is smaller than a threshold value;
the network of part (1) is arranged to: the method comprises the steps of forming new map feature information and two-dimensional feature data of vehicles by using n multiplied by 128 two-dimensional lane position information and n multiplied by 4-dimensional lane property features as unit input data, outputting lane features with vehicle information after stacking two layers of graph annotation force mechanisms and fully connecting one layer, and keeping the dimension as n multiplied by 128; the lane property characteristics include whether to turn, whether to have traffic control, and whether to be an intersection;
The network setting of the part (3) is the same as that of the part (1), and finally, vehicle characteristic information containing lane information and lane interaction information is output, and the dimension output is kept to be n multiplied by 128;
s6: outputting final motion trail prediction according to the fused vehicle characteristic information in S5; specifically:
For each vehicle agent, K possible future trajectories and corresponding confidence scores are predicted, the prediction comprising two branches: one regression branch predicts the trajectory of each mode and the other classification branch predicts the confidence score of each mode; for the nth participant, the K-sequence of BEV coordinates was regressed in the regression branch using the Seq2Seq structure, by: firstly, the fused vehicle characteristics are expanded to n multiplied byh multiplied by c and then input into a Seq2Seq structure network, and vectors representing the vehicle characteristics are fed to corresponding input units of an encoder; the hidden feature of the encoder is then fed to the decoder together with the coordinates of the vehicle at the previous time step to predict the position coordinates of the current time step, in particular the coordinates of the vehicle in the "last history time" step as input to the first decoding step, the output of the current step being fed to the next decoder unit, so that the decoding process is repeated several times until the model predicts the position coordinates of all expected time steps in the future.
Further, in S2, the graph convolution is defined as y=lxw, whereRepresenting node characteristics,/>Representing a weight matrix,/>Representing output, N representing the total number of input nodes, F representing the characteristic number of the input nodes, O representing the characteristic number of the output nodes, and a graph Laplace matrix/>The expression of (2) is:
Wherein I, A and D are an identity matrix, an adjacency matrix and a degree matrix, respectively, I and a represent the self-connection and the connection between different nodes, all connections sharing the same weight W, and the degree matrix D is used for normalizing the data.
Further, before performing the graph convolution in S2, the interaction graph G is normalized:
Wherein A refers to an adjacency matrix, D refers to a degree matrix, and j refers to a data sequence. Aj represents an adjacency matrix representing the construction of the jth data sequence, and Dj represents a degree matrix representing the construction of the jth data sequence, in a manner of calculation:
The degree matrix Dj is a diagonal matrix, the number of nodes adjacent to the node i in k nodes is solved, and α is set to 0.001, so as to avoid empty rows in aj.
Further, the lane diagram convolution GCN network in S3.2 is expressed as:
Wherein ai and Wi refer to an adjacency matrix and a weight matrix corresponding to an ith lane connection mode, an X-finger node feature matrix, and a corresponding node feature Xi is an ith row of the node feature matrix X, and represents an input feature of an ith lane node, and includes a shape feature and a position feature of a lane, namely:
where vi refers to the location of the i-th lane node, i.e. the midpoint between the two end points of the lane segment,And/>Respectively refer to the start position coordinate and the end position coordinate of the ith lane segment.
The invention has the beneficial effects that:
(1) The invention provides a graph convolution neural network considering surrounding vehicle interaction, which solves the problem that the information interaction of surrounding vehicles is not considered in the existing track prediction algorithm.
(2) The invention provides a method for extracting map information by replacing a bird's eye view by a high-definition vector map, wherein the vector map is used for defining the geometric shape of a lane, so that the problem of prediction discretization caused by the resolution problem is reduced.
(3) The invention provides a method for fusing the space-time relationship between the vehicle and the driving scene, and introduces new lane characteristics to represent the generalized geometric relationship between the vehicle and the lanes, thereby effectively improving the accuracy of track prediction when facing lanes with different shapes and numbers.
(4) The invention provides a multi-Seq 2Seq structure stacking mode for predicting future multi-modal tracks of a vehicle and the probability of selecting different lanes, and improves the limitation of single track output.
Drawings
FIG. 1 is a schematic diagram of the predictive model architecture of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Modeling analysis of trajectory prediction problem
The trajectory prediction problem can be expressed as a problem of predicting the trajectory of a vehicle in a future scene based on the historical trajectory information of all objects. Specifically, the input to the model is the historical track X of all observations over a historical time th:
Wherein,
The positions of the abscissa and the ordinate of n observing vehicles at the time t are indicated;
Further, in consideration of the influence of the static environment around the vehicle on the running of the vehicle, the present invention simultaneously takes into consideration the map lane information within the scene, and therefore the input section includes the map data M of the scene in addition to the history track of the vehicle:
M=[lid,lcen,lturn,lcon,linter]
where lid denotes the lane sequence number, lcen denotes the lane centerline sequence point, lturn denotes the lane steering situation, lcon denotes whether the lane has traffic control, and linter denotes whether the lane is an intersection.
The future time coordinate sequence Y from th +1 to th+tf is output after model training:
The raw data needs to be preprocessed before it is input into the model. Firstly, the predicted vehicles and the surrounding vehicles in the traffic scene are sampled at the frequency of 10Hz, and the position coordinates of all the vehicle sampling points, namely the transverse coordinates and the longitudinal coordinates of the vehicles, are obtained. The predicted vehicle coordinates are set to be (0, 0), and the vehicle coordinates around the predicted vehicle are corrected to be relative coordinates with the predicted vehicle as an origin, so that generalization and robustness of the model are enhanced. Track information of the next 3s is then predicted from the track information of the previous 2s as history data.
(II) design model implementation trajectory prediction
As shown in fig. 1, a track prediction model for fusing surrounding vehicle interaction information according to the present invention includes: the system comprises a vehicle interaction relation extraction module, a driving scene characterization module, a space-time relation fusion module and a track prediction module. The track prediction method by using the model comprises the following steps:
The input of the vehicle interaction relation extraction module comprises two parts, namely a history track of the predicted vehicle and surrounding vehicles and an interaction graph G between the vehicles.
The input is first preprocessed to process the history trace into a three-dimensional array of n x th x c, where n represents n objects in the traffic scene observed in the past time step, th refers to the history period, and c=2 represents the x and y coordinates of the object.
Representing an interaction graph G between vehicles as g= (V, E), wherein V represents nodes of the graph, i.e. observed vehicles, feature vectors on the nodes being coordinates of the object at time t; e represents the interactive connection edge between vehicles, and is represented by an adjacent matrix when a model is input. In consideration of the fact that there are connecting edges between vehicles on the space-time characteristics, namely, connecting edges of interaction influence between different vehicles generated by the influence of distance in space and connecting edges of each vehicle and own historical moment in a time domain, the interaction graph G is expressed as:
G={A0,A1}
where a0 is the time-connected edge adjacency matrix and a1 is the space-connected edge adjacency matrix.
In the data processing, taking all tracks of the last frame of the history frame into consideration, taking a predicted vehicle as a center, defining a 10-meter area (empirically taking value) as an influence radius r, calculating a distance value l between a peripheral vehicle and the predicted vehicle, considering that interaction is caused between vehicles when l is smaller than or equal to r compared with the empirical radius distance r, setting an adjacent matrix value as 1, and otherwise setting the adjacent matrix value as 0, thereby constructing a space matrix A1 at the current observation time, and processing A0 as an identity matrix I in a vehicle time domain.
After data processing, the history track and the interaction graph G are transmitted into a vehicle interaction relation extraction module, and the history track and the interaction graph G are mapped to a high-dimensional convolution layer through a layer of 2D convolution layer; then, the space-time interaction is carried out through the two layers of picture scroll lamination. Considering the time-varying nature of the spatial interaction, the convolution kernel of the spatial interaction consists of the sum of two parts, namely the interaction graph G of the current observation frame and the trainable graph Gtrain which is consistent with the size of G but participates in training.
The convolution is defined as y=lxw, whereRepresenting a node feature matrix,/>Representing a weight matrix,/>Representing the output (N represents the total number of input nodes, F represents the characteristic number of the input nodes, O represents the characteristic number of the output nodes), and drawing the Laplace matrix/>The expression of (2) is:
Wherein I, A and D are the identity matrix, the adjacency matrix, and the degree matrix, respectively. I and a represent self-connections and connections between different nodes. All connections share the same weight W, and the degree matrix D is used to normalize the data.
Therefore, before performing the graph convolution operation, to ensure that the value range of the element graph remains unchanged after performing the graph operation, the present invention normalizes the interaction graph G using the following equation:
Wherein A refers to an adjacency matrix, D refers to a degree matrix, and j refers to a data sequence. Aj represents an adjacency matrix representing the construction of the jth data sequence, and Dj represents a degree matrix representing the construction of the jth data sequence, in a manner of calculation:
The degree matrix Dj is a diagonal matrix, the number of nodes adjacent to the node i in k nodes is solved, and α is set to 0.001, so as to avoid empty rows in aj.
In this way, the space interaction information is extracted through the convolution network with the convolution kernels G and Gtrain, and then the data in the dimension n×th ×c is processed along the time dimension (the second dimension) by the time convolution layer with the convolution kernel fixed to be (1×3) in the time plane. And outputting data with dimensions of n multiplied byh multiplied by c unchanged after the space layer and the time layer are alternately processed, wherein the data are subsequently used for being fused with the output data of the driving scene representation module.
The driving scene representation module extracts features from the input map data M, and learns the structured map representation from the vectorized map data. Firstly, a lane map is required to be built according to map data M before the input module, and any two communicated lanes, namely a left adjacent lane, a right adjacent lane, a front section and a subsequent lane, can be obtained according to an obtained lane data central line lcen (expressed as a series of two-dimensional bird-eye view angle coordinate points). The data are processed into four connected dictionaries corresponding to the lane ids, which respectively represent the previous section of the lane L Front section, the following lane L Subsequent to, the left adjacent lane L Left adjacent and the right adjacent lane L Right adjacent of the given lane L, thereby obtaining a lane map. The interactive lane map is then entered into the lane-map convolutional GCN network along with other features in the map data M, including a lane sequence number lid, a lane centerline sequence point lcen, a lane turn condition lturn, whether the lane has traffic control lcon, and whether the lane is an intersection linter. The invention deforms the conventional graph convolution to obtain the lane graph convolution GCN network, which is expressed as:
Wherein Ai and Wi refer to an adjacency matrix and a heavy matrix corresponding to the ith lane connection mode (i.e. i epsilon { front section, next, left adjacent, right adjacent }), and X refers to a node characteristic matrix. The corresponding node feature Xi is the ith row of the node feature matrix X, and represents the input feature of the node of the ith lane, including the shape feature and the position feature of the lane, namely:
where vi refers to the location of the i-th lane node, i.e. the midpoint between the two end points of the lane segment,And/>Respectively refer to the start position coordinate and the end position coordinate of the ith lane segment.
Considering that the prediction process has the situation that the vehicle generates long-distance historical track segments due to too high speed in a fixed historical time period, and the long-distance historical track segments usually occur in a straight line lane segment, the expansion convolution mode can be adopted in the straight line segment, and the adjacency matrix is increased to enlarge the field of view. The lane features are then output through a layer of n x 128-dimensional fully connected layers after the convolutional network.
The space-time relationship fusion module mainly fuses the characteristics output by the vehicle interaction relationship extraction module and the driving scene representation module, and sequentially realizes: (1) Transmitting the vehicle information to the lane nodes, and grasping lane congestion or other service conditions; (2) Information between lane nodes is updated to realize real-time interconnection between lane segments; (3) And fusing and feeding back the updated map features and the real-time traffic information to the vehicle. The information update between the lane nodes in the part (2) still adopts a lane graph convolution GCN mode, and a graph convolution is constructed by using an adjacent matrix with lane information to extract lane interaction information. And the mutual transmission between the vehicle information and the lane information, namely the (1) th and the (3) th parts, the interactive features of three types of information, namely the input lane features, the vehicle features and the context node information, are extracted through a spatial attention mechanism. Where a context node is defined as a lane node that is less than a threshold value, where the threshold value may take a 6 meter checked value, from the vehicle node's l2 distance. The network of part (1) is arranged to: the driving scene representation module extracts n×128 two-dimensional lane position information and n×4-dimensional lane property characteristics (whether steering, traffic control and intersection) to form new map characteristic information and two-dimensional characteristic data of vehicles as the unit input data, and outputs lane characteristics with vehicle information after stacking two layers of graph annotation force mechanisms and full-connection of one layer, wherein the dimension is n×128. And (3) the network structure of the part (1) is consistent, vehicle characteristic information containing lane information and lane interaction information is finally extracted, and the dimension output is kept to be n multiplied by 128.
The track prediction module takes the fused vehicle characteristic information as input, and the multi-mode prediction head outputs final motion track prediction. For each vehicle agent, K possible future trajectories and corresponding confidence scores are predicted. The prediction module thus has two branches, one regression branch predicting the trajectory of each mode and the other classification branch predicting the confidence score of each mode. For the nth participant, the K sequence of BEV coordinates was regressed using the Seq2Seq structure in the regression branch. The specific process is as follows: firstly, the fused vehicle characteristics are expanded into n multiplied byh multiplied by c and then input into a Seq2Seq structure network, and vectors representing the vehicle characteristics are fed to corresponding input units of an encoder GRU (in each time dimension); the hidden features of the encoder GRU, along with the coordinates of the vehicle at the previous time step, are then fed together to the decoder GRU to predict the position coordinates of the current time step. Specifically, the input to the first decoding step is the coordinates of the vehicle in the "last historic moment" step, and the output of the current step is fed to the next GRU unit. Such a decoding process is repeated several times until the model predicts the position coordinates of all expected time steps in the future.
(III) model training
The invention collects real vehicle data in a continuous time period in a track prediction implementation scene as a data set of model training, and a training set, a verification set and a test set used for model training are all taken from the data set.
The invention uses pytorch framework to train the model, wherein the model uses Adam optimizer to accelerate the learning speed of the model, and the learning rate of Adam optimizer is set to 0.001, so that the training can find the global optimal point more accurately. The loss function consists of the sum of a lane classification error and a track regression error, wherein the lane classification loss adopts two classification range losses, and the track regression loss adopts Root Mean Square Error (RMSE) loss. The evaluation results use the L2 distance FDE between the end point of the best predicted trajectory and the ground truth value and the average L2 distance ADE between the best predicted trajectory and the ground truth value.
The training rounds are adjusted in real time according to actual demands and training effects, and model parameter files are stored once after each round of training.
The above list of detailed descriptions is only specific to practical embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent manners or modifications that do not depart from the technical scope of the present invention should be included in the scope of the present invention.