Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a driving dangerous scene identification method based on a lightweight multi-mode neural network, which is used for testing an automatic driving algorithm and improves the accuracy of the automatic driving test.
The aim of the invention can be achieved by the following technical scheme:
 a driving dangerous scene identification method based on a lightweight multi-mode neural network is characterized by comprising the following steps:
 S1, acquiring driving video and vehicle-mounted data in a current time period;
 s2, dividing a picture of a driving video into three driving areas which are distributed up and down, carrying out averaging treatment on images in each driving area of each frame of picture of the video in the vertical direction, converting the images into one row of pixels, and then splicing one row of pixels corresponding to each frame together in time sequence to form a motion profile diagram of each driving area;
 s3, inputting the motion profile graph and the vehicle-mounted data of each driving area into a driving risk assessment model to obtain an identification result;
 The driving risk assessment model is a multi-mode neural network comprising a visual data processing layer, a kinematic data processing layer, a data fusion layer and a prediction layer, wherein the visual data processing layer is a lightweight CNN network, a AlexNet network structure is adopted, an attention mechanism is introduced for improvement, the visual data processing layer is used for inputting a motion profile graph into the lightweight CNN network and outputting to obtain visual characteristics, the kinematic data processing layer is an LSTM network and is used for inputting vehicle-mounted data into the LSTM network and outputting to obtain kinematic characteristics, and the data fusion layer is a full-connection layer and is used for inputting the visual characteristics and the kinematic characteristics and outputting to obtain identification results.
Further, the step S2 specifically includes:
 s21, dividing three driving areas from the original video according to the distance between the driving video and the vehicle according to the camera position, wherein each area is divided by an upper boundary and a lower boundary;
 S22, based on the driving video clip in the current time period [ ta,tb ], sampling each driving area obtained in the step S21, and obtaining RGB pixel values in a longitudinal [ yl,yu ] and transverse [0,w ] rectangular range in each frame of picture, wherein w is the video width, yl is the sampling lower boundary, and yu is the sampling upper boundary;
 S23, respectively carrying out the following operations on R, G, B channels of the image in the rectangular range, namely taking a pixel mean value in the vertical direction, compressing a matrix of (w multiplied by (yu-yl)) into a matrix of (w multiplied by 1), and then superposing the results of the three channels to obtain a row of (w multiplied by 3) pixel matrix corresponding to each frame;
 S24, a row of pixel matrixes obtained by each frame are spliced together in time sequence to form (fps× (tb-ta), w,3 matrixes, and a colorful motion profile is generated according to the pixel matrixes, wherein fps is the number of frames per second of video.
Further, in step S3, the lightweight CNN network introduces an attention mechanism module after each layer of convolution layer, performs channel attention and spatial attention transformation on the feature map, and reconstructs a new feature map, where the calculation formulas of the channel attention and the spatial attention are as follows:
Attentionc=σ(MLP(AvgPool(F))+MLP(MaxPool(F))
Attentions=σ(Conv([AvgPool(F),MaxPool(F)]))
 Wherein, attentionc,Attentions represents the result of channel Attention and space Attention respectively, F represents the feature map of the output of a certain layer of convolution layer, sigma represents Sigmoid function, MLP represents a multi-layer perceptron network, conv represents a convolution layer with output channel number of 1.
Further, the output training set of the driving risk assessment model comprises a normal event set and a high risk event set, and the acquisition method comprises the following steps:
 A1, collecting historical vehicle-mounted data;
 A2, detecting and filtering abnormal values of historical vehicle-mounted data by using a normal distributed 3 sigma principle, and taking the abnormal values as missing values;
 a3, filling missing values in the historical vehicle-mounted data by adopting a linear interpolation method to obtain complete vehicle-mounted data;
 A4, providing vehicle acceleration data a in the complete vehicle-mounted data, drawing and observing a distribution curve, determining an acceleration threshold of obvious deceleration behavior, and recording as THd;
 A5, traversing all vehicle acceleration data according to time sequence, collecting emergency braking time td according to acceleration conditions a which are less than or equal to THd, for each time td, taking time segments from d1 before to d2 seconds after to form a potential high risk event segment ec, combining video verification, eliminating false alarms caused by data collection errors, and forming a plurality of high risk event segments into a high risk event set;
 a6, randomly sampling a plurality of normal non-conflict events from the vehicle acceleration data left in the step A5 by taking |d1+d2 | as a time window to serve as a normal event set.
Further, in step A2, each non-airborne dynamic characteristic variable of a piece of historical vehicle-mounted data is subjected to condition judgment, and the abnormal value is met, wherein the expression of the condition judgment is as follows:
|x-μ|>3σ
 where x is the non-airborne kinematic characteristic variable, μ is the mean value of x and σ is the standard deviation of x.
Further, in the step A3, the linear interpolation method has a calculation expression as follows:
 Wherein, theIs a missing value, di-1 is the last non-empty nearest neighbor of the missing value, di+1 is the next non-empty nearest neighbor of the missing value, n is the total number of records, ti-1,ti,ti+1 is di-1,Di+1 corresponds to the time instant.
Further, the output training set of the driving risk assessment model comprises a CNN network training set, and the acquisition method comprises the following steps:
 acquiring a historical driving video;
 Dividing the picture of the historical driving video into three driving areas which are distributed up and down, carrying out averaging treatment on the image in each driving area of each frame of picture of the video in the vertical direction, converting the image into one row of pixels, and then splicing the corresponding one row of pixels of each frame together according to time sequence to form a motion profile graph, wherein all motion profiles form a CNN network training set.
Further, the CNN training set is expanded through data enhancement processing, wherein the data enhancement processing comprises the steps of randomly transforming brightness, contrast, saturation and hue, and turning over the motion profile in the horizontal direction with a certain probability.
Compared with the prior art, the invention has the following beneficial effects:
 The invention firstly carries out region division on a driving video picture and respectively generates a motion profile, and carries out data compression under the condition of retaining image characteristics, secondly, designs a multi-mode neural network with a visual data processing layer, a kinematic data processing layer, a data fusion layer and a prediction layer as a driving risk assessment model, adopts a lightweight CNN network to simplify the operation amount and introduces a attention mechanism to improve the classification performance of the model, and meanwhile, the LSTM network is used for increasing the extraction of the kinematic characteristics in the identification process in the driving risk assessment model, thereby effectively improving the accuracy of model prediction. In summary, the method can effectively extract video data, reduce the operation data amount, simplify the model calculation process, and has low time consumption, high accuracy and good practical application value.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.
As shown in fig. 1, the present embodiment provides a driving hazard scene identification method based on a lightweight multi-modal neural network, which includes the following steps:
 and S1, acquiring driving video and vehicle-mounted data in the current time period.
And S2, dividing the picture of the driving video into three driving areas which are distributed up and down, carrying out averaging treatment on the image in each driving area of each frame of picture of the video in the vertical direction, converting the image into one row of pixels, and splicing the corresponding one row of pixels of each frame together according to time sequence to form a motion profile diagram of each driving area.
And S3, inputting the motion profile graph and the vehicle-mounted data of each driving area into a driving risk assessment model to obtain an identification result. The driving risk assessment model is a multi-modal neural network comprising a visual data processing layer, a kinematic data processing layer, a data fusion layer and a prediction layer:
 The visual data processing layer is a lightweight CNN network, performs network structure lightweight on the basis of AlexNet, introduces an attention mechanism to improve, and is used for outputting visual characteristics after the motion profile is input into the lightweight CNN network;
 the kinematic data processing layer is an LSTM network and is used for outputting the vehicle-mounted data after being input into the LSTM network to obtain kinematic characteristics;
 the data fusion layer is a full-connection layer and is used for outputting an identification result after inputting visual features and kinematic features.
The steps can be specifically described by adopting the following six parts:
 1. Motion profile generation algorithm:
 1) For each section of forward driving video, as shown in fig. 2, three driving areas are divided from the original video according to the distance from the vehicle according to the camera position, and each area is divided by an upper boundary and a lower boundary, as shown in the upper half of fig. 3.
2) Sampling each region obtained in the step 1.1 based on the driving video segment in the time period [ ta,tb ], setting fps as the number of frames per second of video, w as the video width, yl as the sampling lower boundary and yu as the sampling upper boundary. The samples of each driving area are processed to finally obtain a motion profile with the length of fps× (tb-ta) and the width of w. The specific development is as follows:
 I) Acquiring RGB pixel values in a longitudinal [ yl,yu ] and transverse [0,w ] rectangular range in each frame of picture, namely (yu-yl, w, 3) three-dimensional integer matrix (wherein '3' represents RGB three channels);
 II) for each channel in RGB in the range, taking the average value of longitudinal pixels as the pixel value of a point, namely taking the average value of the first dimension of a (yu-yl, w, 3) three-dimensional integer matrix, and arranging the average value into a row of pixels of 1 Xw, namely a (1, w, 3) matrix;
 III) a row of pixel matrixes obtained by each frame are spliced together in time sequence to form a (fps× (tb-ta), w, 3) matrix, and a colorful motion profile is generated according to the pixel matrixes, as shown in the lower half of figure 3, and is used for converting a middle-distance driving area into a motion profile.
2. Constructing a light-weight CNN network that draws attention to mechanisms
The lightweight CNN network is based on AlexNet network structure, and a lightweight convolutional neural network comprising a convolutional layer for extracting local visual data features and a full-connection layer for global feature processing is constructed, and meanwhile, an attention mechanism improvement model is introduced, so that the model is focused on key position information of a motion profile. When the motion profile is input into the lightweight CNN network, firstly, three motion profiles of three driving areas are converted into matrixes mnear、mmid and mfar respectively, and the three matrixes are combined on an image channel domain to obtain a nine-channel matrix m1 for input.
The detailed construction flow of the lightweight CNN network is as follows:
 1) First, an input layer, such as a 224 pixel by 224 pixel map, is constructed that is converted into a matrix of (224,224,9);
 2) Setting parameters of a convolution layer by passing m1 through a Conv1 layer, wherein the parameters mainly comprise the number, the size, the step length, the activation function and the like of filters, for example, using 16 filters of 11 multiplied by 11 to carry out convolution with the step length of 4, and obtaining a matrix m2 by passing a ReLU activation function;
 3) Setting parameters of a pooling layer by passing m2 through a Pool1 layer, wherein the parameters mainly comprise the size, the type, the step length and the like of a filter, for example, using a 3×3 filter to carry out maximum pooling with the step length of 2, so as to obtain a matrix m3;
 4) Similarly, m3 passes through Conv2 layer, sets parameters of convolution layer, for example, convolves with step length 1 and packing as 2 by using 32 filters of 5×5, and obtains matrix m4 by ReLU activation function;
 5) Similarly, m4 is passed through Pool2 layer, and pooling layer parameters are set, such as maximum pooling is performed with 3×3 filter with step size 2 to obtain matrix m5;
 6) Similarly, m5 is passed through Conv3 layer, convolutional layer parameters are set, such as convolution is carried out with 32 filters of 3×3 and step length of 3 and padding is 1, and matrix m6 is obtained through ReLU activation function;
 7) Similarly, m6 is passed through Pool3 layer, and pooling layer parameters are set, such as maximum pooling is performed with 3×3 filter with step size 2 to obtain matrix m7;
 8) Passing m6 through AdaptiveAvgPool layers, setting parameters such as setting the output size to be 3x3 to obtain a matrix m8;
 9) Smoothing m8 to a one-dimensional matrix m9 through the full connection layer FC4 to output a one-dimensional matrix m10 of r×1 (e.g., 128×1);
 10 Passing m10 through Drop4 layer, discarding part of the neural nodes with a certain proportion of Drop probability (such as 50%) to prevent overfitting, and obtaining matrix m11;
 11 M11 is passed through the FC5 full connection layer to output a one-dimensional matrix m12 of r×1 (such as 32×1);
 12 Passing m12 through Drop5 layer, discarding part of the neural nodes with a certain proportion of Drop probability (such as 50%) to prevent overfitting, and obtaining matrix m13;
 13 M11 is output as a 2 x1 matrix through the FC6 full connection layer, two values in the matrix correspond to predicted values of probabilities belonging to risky and risky categories, then the predicted values are processed by Softmax to make the sum of the probabilities of the two categories 1, and the total network structure is demonstrated as follows:
 TABLE 1 Multi-modal network Structure Table
| Layer(s) | Input device | Output of | 
| Conv1 | 224×224×9 | 55×55×16 | 
| Pool1 | 55×55×16 | 27×27×16 | 
| Conv2 | 27×27×16 | 27×27×32 | 
| Pool2 | 27×27×32 | 13×13×32 | 
| Conv3 | 13×13×32 | 13×13×32 | 
| Pool3 | 13×13×32 | 6×6×32 | 
| AdaptiveAvgPool | 6×6×32 | 3×3×32 | 
| FC4 | 3×3×32 | 128×1 | 
| Drop4 | 128×1 | 128×1 | 
| FC5 | 128×1 | 32×1 | 
| Drop5 | 32×1 | 32×1 | 
| FC6 | 32x1 | 2×1 | 
Finally, introducing an attention mechanism module after each convolution layer, respectively carrying out channel attention and space attention transformation on the feature map, and reconstructing to obtain a new feature map, wherein the calculation formulas of the channel attention and the space attention are respectively as follows:
Attentionc=σ(MLP(AvgPool(F))+MLP(MaxPool(F))
Attentions=σ(Conv([AvgPool(F),MaxPool(F)]))
 Wherein, attentionc,Attentions represents the result of channel Attention and space Attention respectively, F represents the feature map of the output of a certain layer of convolution layer, sigma represents Sigmoid function, MLP represents a multi-layer perceptron network, conv represents a convolution layer with output channel number of 1.
3. Data enhancement of motion profile
The training set of the driving risk assessment model comprises a CNN network training set for an input side, the acquisition method of the training set is basically the same as the generation algorithm of the motion profile of the first part, and the training set comprises the following steps:
 dividing the picture of the historical driving video into three driving areas which are distributed up and down, carrying out averaging treatment on the image in each driving area of each frame of picture of the video in the vertical direction, converting the image into a row of pixels, and then splicing the corresponding row of pixels of each frame together according to time sequence to form a motion profile graph, wherein all motion profiles form a CNN network training set.
In order to improve the generalization capability of the model, the CNN network training set is further expanded through data enhancement processing. The enhancement process includes a random transformation of brightness, contrast, saturation, and hue, with a probability of flipping the motion profile horizontally. The effect diagram after the data enhancement processing is shown in fig. 4.
Luminance transformation, namely randomly changing the luminance of the motion profile, setting the original picture as im1 and the luminance transformation factor as factorb, and obtaining a transformed image im2 as follows:
im2=factorb×im1。
 Saturation transformation, namely randomly changing the saturation of a motion profile, wherein the saturation transformation factor is factors, the gray image corresponding to im2 is gray2, the pixel mean value of the gray image is mean, and the transformed image im3 is:
im3=factorc×im2+(1-factors)×gray2。
 Contrast conversion, namely randomly changing the contrast of the motion profile map, converting im3 into a corresponding gray map by a contrast conversion factor of factorc, and calculating the pixel mean value of the gray map to mean, wherein the converted image im4 is:
im4=factorc×im3+(1-factorc)×mean。
 Hue conversion, namely randomly changing the hue of the motion profile, converting im4 into HSV format to obtain hue H, randomly converting hue, and converting a new HSV image back into the original format to obtain im5:
Hnew=Horigin+factorh×255。
 And (3) turning over and transforming, namely horizontally turning over im5 with a certain probability to obtain im6.
4. Acceleration threshold based auxiliary video verification and calibration dangerous scene
The training set of the driving risk assessment model comprises a normal event set and a high risk event set which are used for an output side, the acquisition method comprises the steps of detecting and filtering abnormal values by using a normal distributed 3 sigma principle on historical vehicle-mounted data collected by a radar, filling missing values by using a linear interpolation method, acquiring acceleration distribution on the basis of filled historical vehicle-mounted data, determining an acceleration dividing threshold value of a dangerous driving event so as to judge obvious vehicle avoidance behaviors, judging and extracting potential dangerous driving events on the basis of the threshold value, and calibrating the normal event set and the high risk event set on the potential dangerous driving events on the basis of video verification. The detailed flow is as follows:
 1) Collecting historical vehicle-mounted data;
 2) The vehicle kinematic characteristic variables in the vehicle-mounted data mostly accord with normal distribution, the 3 sigma principle is used for carrying out outlier filtering, namely, each non-empty kinematic characteristic variable of one driving record is subjected to condition judgment, the abnormal value is met, and the abnormal value is processed according to the missing value:
|x-μ|>3σ
 where x is the kinematic parameter, μ is the mean value of x, σ is the standard deviation of x.
3) Because the driving environment is complex, and the interference sources are numerous, the signal intensity of the detection equipment can be influenced, and the driving data always have missing values, so that the missing values need to be filled. Filling the missing value by adopting a linear interpolation method, wherein the calculation formula is as follows:
 Wherein, theIs a missing value, di-1 is the last non-empty nearest neighbor of the missing value, di+1 is the next non-empty nearest neighbor of the missing value, n is the total number of records, ti-1,ti,ti+1 is di-1,Di+1 corresponds to the time instant.
4) Extracting vehicle acceleration data a in natural driving data, and drawing and observing a distribution curveThe acceleration threshold for significant deceleration behavior is determined and noted as THd.
5) Traversing all vehicle acceleration data according to time sequence, collecting emergency braking time td according to acceleration conditions a≤THd, for each time td, taking time segments from front d1 to back d2 seconds to form a potential high risk event segment ec, combining video verification, eliminating false alarms caused by data collection errors, and forming nconflict_candidate high risk event segments into a high risk event setTo avoid event overlap, it is ensured that adjacent emergency braking moments meet the condition td[i+1]-td[i]≥|d1+d2.
6) Randomly sampling nnormal_candidate normal non-conflict events from the rest vehicle acceleration data by taking |d1+d2 | as a time window to serve as a normal event set
5. Integrally constructing driving risk assessment model
And the conventional LSTM is used for extracting kinematic features in the driving risk assessment model, and the kinematic features are fused with visual features extracted by the lightweight CNN network, so that the identification accuracy of the model is further improved.
1) Vehicle-mounted data from the radar are extracted at certain time intervals, the LSTM network is used for extracting kinematic features, and the output structure of the LSTM network is denoted as fkinematics.
2) And processing the potential high-risk event according to a generation algorithm of the motion profile of the first part to obtain a corresponding motion profile, inputting the profile into a lightweight CNN network to extract visual characteristics, and recording the output structure of the network as fvision.
3) Fvision is combined with fkinematic, i.e., [ fvision fkinematic ] as input to the fully connected layer, and outputs a 2 x 1 matrix with two values corresponding to the predicted values of the probabilities belonging to the risky class and the non-risky class, and then the predicted values are processed using Softmax to sum the probabilities of the two classes to 1.
4) Dividing the normal event set from the fourth partAnd a high risk event setThe training set Θtrain and the test set Θtest are divided into 3:1 respectively.
5) And training the model, wherein during training, data enhancement is carried out on the motion profile according to the third part, the loss value of the model is converged to a smaller value through nepoch epochs, training is stopped, and the final model MVK is stored.
6) Calling a trained MVK model for each event in a test set Θtest (comprising ec normal events and en high-risk events), obtaining a predicted classification value of the model, and counting to obtain the normal event predicted by the modelConflict eventGenerating a confusion matrix according to the prediction result of the test set as follows:
 TABLE 2 confusion matrix
The sensitivity Isensitivity and the specificity Ispecificity of the model are calculated, and the formula is as follows:
Isensitivity=TP/(TP+FN)
Ispecificity=TN/(FP+TN)
 And generating ROC curves according to Isensitivity and Ispecificity for evaluating model prediction effects.
As shown in fig. 5, in the effect graph of the running risk assessment model of the present invention, the more accurate the AUC value effect, the better the effect, and the running risk assessment model (VK-Net) of the present invention reaches 0.95, with very good accuracy and precision.
6. And distinguishing dangerous scenes based on the driving risk assessment model.
Generating a motion profile diagram from a continuous current driving video through a motion profile diagram generation algorithm, taking the motion profile diagram and a kinematic feature variable extracted from vehicle-mounted data as inputs of a driving risk assessment model, calculating to obtain a predicted value of whether the driving is at risk, and alarming if the driving is at risk.
It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.
The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.