Disclosure of Invention
The invention mainly aims to provide a vehicle-mounted network anomaly detection method and system based on correlation analysis.
The invention adopts the following technical scheme:
on one hand, the invention discloses a vehicle-mounted network abnormity detection method based on correlation analysis, which comprises the following steps:
collecting communication data in the running process of a vehicle; the communication data comprises a message ID, message content and message occurrence time;
predicting and outputting a message value corresponding to the byte order by using the established prediction model, judging whether the deviation between the predicted message value and the actual message value exceeds a detection threshold value, and if so, judging that the message is abnormal; the prediction model is established in groups based on the correlation relation between communication data in the vehicle running process; the input of the prediction model is the message values corresponding to one or more message ID byte orders in the related relation group, and the output of the prediction model is the message values corresponding to other message ID byte orders in the related relation group.
Preferably, the method for establishing the prediction model includes:
collecting communication data of vehicles of the same type in the running process;
calculating a Hamming distance, analyzing Hamming distance data, and removing message IDs with unchanged message contents and bytes with unchanged message contents in the message IDs; recording the message ID with the changed message content and the corresponding byte sequence;
normalizing the occurrence time of the message event according to the recorded message ID, pairing the event time of different message IDs according to the close time, respectively calculating the correlation coefficient of each byte pair, extracting the byte pair of which the absolute value of the correlation coefficient is greater than a preset value, and marking the byte pair as a correlation relation group; the approach time comprises the same time or the time within a preset range;
and training the message data in each correlation relation group by using an LSTM neural network according to the time sequence, and establishing a prediction model of each correlation relation group.
Preferably, calculating a Hamming distance, analyzing Hamming distance data, and removing message IDs with unchanged message contents and bytes with unchanged message contents in the message IDs; recording the message ID and the corresponding byte sequence with the changed message content specifically comprises the following steps:
summarizing and counting the sum of Hamming distances of the total number of bytes according to the message ID, and calculating index values including a maximum value, a minimum value, a median, a lower quartile and an upper quartile; if all index values are 0 or equal, the message ID is rejected if the message content of all bytes of the message ID is unchanged;
counting the Hamming distance of each byte in the message ID according to the byte order for the message ID which is not removed, and calculating the index values including the maximum value, the minimum value, the median, the lower quartile and the upper quartile; if all index values are 0 or equal, the content of the byte message is not changed, and the unchanged bytes are removed;
and recording the message ID with the changed message content and the corresponding byte sequence.
Preferably, the preset value of the correlation coefficient is 0.5.
Preferably, the method for acquiring and setting the detection threshold includes:
selecting communication data collected by multiple sections of normal driving records, predicting the message value of a byte in the corresponding message ID by using the prediction model, and setting a detection threshold value based on the standard difference between the predicted message value and the actual message value.
On the other hand, the invention relates to a vehicle-mounted network anomaly detection system based on correlation analysis, which comprises the following components:
the data acquisition module is used for acquiring communication data in the running process of the vehicle; the communication data comprises a message ID, message content and message occurrence time;
the message abnormity detection module predicts and outputs a message value corresponding to the byte order by using the prediction model established by the prediction model establishment module, judges whether the deviation between the predicted message value and the actual message value exceeds a detection threshold value, and judges that the message is abnormal if the deviation exceeds the detection threshold value; the prediction model is established in groups based on the correlation relation between communication data in the vehicle running process; the input of the prediction model is the message values corresponding to one or more message ID byte orders in the related relation group, and the output of the prediction model is the message values corresponding to other message ID byte orders in the related relation group.
Preferably, the method for establishing the prediction model includes:
collecting communication data of vehicles of the same type in the running process;
calculating a Hamming distance, analyzing Hamming distance data, and removing message IDs with unchanged message contents and bytes with unchanged message contents in the message IDs; recording the message ID with the changed message content and the corresponding byte sequence;
normalizing the occurrence time of the message event according to the recorded message ID, pairing the event time of different message IDs according to the close time, respectively calculating the correlation coefficient of each byte pair, extracting the byte pair of which the absolute value of the correlation coefficient is greater than a preset value, and marking the byte pair as a correlation relation group; the approach time comprises the same time or the time within a preset range;
and training the message data in each correlation relation group by using an LSTM neural network according to the time sequence, and establishing a prediction model of each correlation relation group.
Preferably, calculating a Hamming distance, analyzing Hamming distance data, and removing message IDs with unchanged message contents and bytes with unchanged message contents in the message IDs; recording the message ID and the corresponding byte sequence with the changed message content specifically comprises the following steps:
summarizing and counting the sum of Hamming distances of the total number of bytes according to the message ID, and calculating index values including a maximum value, a minimum value, a median, a lower quartile and an upper quartile; if all index values are 0 or equal, the message ID is rejected if the message content of all bytes of the message ID is unchanged;
counting the Hamming distance of each byte in the message ID according to the byte order for the message ID which is not removed, and calculating the index values including the maximum value, the minimum value, the median, the lower quartile and the upper quartile; if all index values are 0 or equal, the content of the byte message is not changed, and the unchanged bytes are removed;
and recording the message ID with the changed message content and the corresponding byte sequence.
Preferably, the preset value of the correlation coefficient is 0.5.
Preferably, the method for acquiring and setting the detection threshold includes:
selecting communication data collected by multiple sections of normal driving records, predicting the message value of a byte in the corresponding message ID by using the prediction model, and setting a detection threshold value based on the standard difference between the predicted message value and the actual message value.
Compared with the prior art, the invention has the following beneficial effects:
according to the method and the system, a specific vehicle bus communication protocol does not need to be acquired, and the position and the mode of storing the physical variable do not need to be known; under the condition that bus communication data do not need to be converted into data with physical significance variables actually, correlation among original message data is determined through statistical analysis, a message byte content prediction model is established by utilizing a neural network, and malicious data injection attacks which do not accord with the normal driving state of a vehicle can be detected in real time.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
Referring to fig. 1, in one aspect, the present invention provides a method for detecting an abnormality of a vehicle-mounted network based on correlation analysis, including:
s10, collecting communication data in the running process of the vehicle; the communication data comprises a message ID, message content and message occurrence time;
s20, predicting and outputting the message value of the corresponding byte order by using the established prediction model, judging whether the deviation between the predicted message value and the actual message value exceeds a detection threshold value, and if so, judging that the message is abnormal; the prediction model is established in groups based on the correlation relation between communication data in the vehicle running process; the input of the prediction model is the message values corresponding to one or more message ID byte orders in the related relation group, and the output of the prediction model is the message values corresponding to other message ID byte orders in the related relation group.
Referring to fig. 2, the method for establishing the prediction model includes:
s201, collecting communication data in the running process of vehicles of the same vehicle type.
S202, calculating a Hamming distance, analyzing Hamming distance data, and removing message IDs with unchanged message contents and bytes with unchanged message contents in the message IDs; and recording the message ID with the changed message content and the corresponding byte sequence.
Specifically, the method comprises the following steps:
s2021, summarizing and counting the sum of Hamming distances of the total byte number according to the message ID, and calculating index values including a maximum value, a minimum value, a median, a lower quartile and an upper quartile; if all index values are 0 or equal, the message ID is rejected if the message content of all bytes of the message ID is unchanged;
if the message ID is 0 xYFD 0500, the sampling result in a period of time is as follows: each message of the message fixedly comprises 8 bytes, and the total number of the messages is hundreds, the corresponding Hamming distance of 8 bytes of each two adjacent messages is calculated according to the byte sequence, and the sum of the Hamming distances is calculated. If two adjacent messages are both 0x0102030405060708, the sum of the total Hamming distance is 0, if 289 messages are the same and do not change, the maximum value, the minimum value, the median, the lower quartile and the upper quartile are all 0, and the message ID can be directly eliminated.
S2022, counting the Hamming distance of each byte in the message ID according to the byte order for the message ID which is not removed, and calculating the index values including the maximum value, the minimum value, the median, the lower quartile and the upper quartile; if all index values are 0 or equal, the content of the byte message is not changed, and the unchanged bytes are removed;
for example, the message ID YYF00F51 has hundreds of records of sampled data over a period of time, and the statistical result is as shown in table 1 below (only 8 bytes are listed in the table), wherein only the 4 th byte meets the record requirement.
TABLE 1
S2023, recording the message ID with the changed message content and the corresponding byte sequence.
S203, aiming at the recorded message ID, normalizing the occurrence time of the message event, pairing the event time of different message IDs according to the similar time, respectively calculating the correlation coefficient of each byte pair, extracting the byte pair of which the absolute value of the correlation coefficient is greater than a preset value, and marking the byte pair as a correlation relation group; the approach time includes the same time or a time within a preset range.
Specifically, for a selected packet, a line graph and a time series scatter diagram of actual packet byte values corresponding to the packet in the same time period can be drawn by using a visualization system, verification and rechecking are performed by combining the graph, and if the packet value change trends of the packet are consistent or just opposite and the correlations are very stable along with the time lapse, the packet has a strong correlation relationship, and the packet is determined and marked. If pairwise correlation exists among the multiple correlation relation groups, if correlation relations AB, BC and AC exist, ABC is merged into one correlation relation group. Or, only partial intersection relations exist between the groups, such as AB and AC, and the groups can also be combined into one group, but special marks are needed, and when training is performed later, only B and C can be selected as input items, and a is selected as output items.
Referring to fig. 3, calculated correlation coefficients of the 6 th byte and the 8 th byte of the message ID XXFEYYEE and the 3 rd byte of the message ID XXF003YY are 0.95 and 0.96, respectively, a message change time series scatter diagram of a certain sampling period is drawn by using a visualization system, wherein fig. 3(a) is a message value change diagram of the message XXFEYYEE, and fig. 3(b) is a message value change diagram of the message XXF003 YY. It can be seen from the figure that the three change trends are very consistent, the previous calculation results are verified, it is proved that pairwise correlation exists between the bytes corresponding to the two messages, a correlation group can be formed, any two bytes can be selected as input items, and the other one is an output item.
Referring to fig. 4, calculated correlation coefficients of the 2 nd and 3 rd bytes with message IDs of XXYYF030 and the 6 th byte of XXFEYY02 are 0.7 and 0.6, respectively, and a message change time sequence scatter diagram of a certain sampling period is drawn by using a visualization system, where fig. 4(a) is an XXYYF030 message value change diagram, and fig. 4(b) is an XXFEYY02 message value change diagram. The three changes are basically consistent, and by judging that the 2 nd and 3 rd bytes of XXYYF030 are combined by certain calculation rules to be more consistent with the 6 th byte change of XXFEYY02, only the 2 nd and 3 th bytes of XXYYF030 can be selected as input items, and the 6 th byte of XXFEYY02 can be selected as output items.
Further, the preset value of the correlation coefficient is 0.5.
The correlation coefficient calculation method is based on covariance and standard deviation, and a calculation formula of correlation coefficients of two-dimensional variables x and y is as follows:
wherein r isxyRepresenting the sample correlation coefficient, SxyRepresents the sample covariance, SyDenotes the sample standard deviation of x, SySample standard deviations for y are indicated. Below are respectively SxyCovariance sum Sx、SyAnd (5) a calculation formula of standard deviation.
Wherein, x represents the kth (k value is generally 1 to 8) message byte value with message ID of A in the method, and y represents the mth message byte value with message ID of B. For example, x represents the message value of the 6 th byte with the message ID XXFEYYEE, and y represents the message value of the 3 rd byte with the message ID XXF003 YY.
S204, training the message data in each correlation grouping by using an LSTM neural network according to the time sequence, and establishing a prediction model of each correlation grouping.
Specifically, one of the pair of packets is arbitrarily selected as an input item, and the other is selected as an output item. If more than two objects are contained in the group, one of the objects is selected as an output item, and the other objects are selected as input items. The selection of input and output items may be adjusted according to the training effect. If there is a pairwise correlation between the message a _1 (1 st byte indicating a message ID of a), B _2 (2 nd byte indicating B), and C _5 (5 th byte indicating C), two of them, i.e., a _1 and B _2, can be arbitrarily selected as input items, and C _5 as an output item.
Further, after the prediction model is built, a plurality of segments of CAN bus messages collected by normal driving records are selected to test the prediction model, the standard deviation between the prediction message value of a byte corresponding to the message ID and the original message value is calculated, and a proper detection threshold value is set according to the standard deviation and the normal data range of the corresponding message. Specifically, the detection threshold may be set to 2 times the standard deviation, and in practical application, the detection threshold may be adjusted according to the training data condition and the fluctuation range of the normal message value itself, so as to avoid false alarm.
Further, based on the relevance grouping, a byte value corresponding to the output item of the model is calculated and predicted in real time by using a prediction model, and if the deviation of the data value of the predicted output item and the data value of the actually received message exceeds the detection threshold value obtained by training, the group of messages is considered to be abnormal, and the system is possibly subjected to malicious and illegal injection attacks. Continuing with the example in S204, the message sequences corresponding to A _1 and B _2 in a certain small time range are input during real-time detection, the predicted message value of C _5 in the corresponding time period is output, the error between the predicted value and the actual received value is calculated, and if the error is larger than the detection threshold value, abnormal behavior is prompted to be detected.
The invention relates to a vehicle network anomaly detection method based on correlation analysis, which is used for detecting vehicle CAN bus or other bus anomaly messages, obtaining a message combination with a strong correlation relationship by directly extracting original message byte data and carrying out correlation analysis, carrying out regression analysis on grouped message data, and establishing various normal message correlation models, wherein variables of the grouped models have a forward consistency relationship or an anti-correlation relationship, are an expression of a corresponding state of a vehicle sensor in a digital form in the vehicle driving process, and CAN be used for detecting the problem of data inconsistency caused by malicious data injection attack in real time.
On the other hand, the invention relates to a vehicle-mounted network anomaly detection system based on correlation analysis, which comprises the following components:
the data acquisition module is used for acquiring communication data in the running process of the vehicle; the communication data comprises a message ID, message content and message occurrence time;
the message abnormity detection module predicts and outputs a message value corresponding to the byte order by using the prediction model established by the prediction model establishment module, judges whether the deviation between the predicted message value and the actual message value exceeds a detection threshold value, and judges that the message is abnormal if the deviation exceeds the detection threshold value; the prediction model is established in groups based on the correlation relation between communication data in the vehicle running process; the input of the prediction model is the message values corresponding to one or more message ID byte orders in the related relation group, and the output of the prediction model is the message values corresponding to other message ID byte orders in the related relation group.
The specific implementation of each module of the vehicle-mounted network abnormality detection system based on the correlation analysis is consistent with a vehicle-mounted network abnormality detection method based on the correlation analysis, and the description of the embodiment is not repeated.
The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept should fall within the scope of infringing the present invention.