Disclosure of Invention
In order to observe the abnormity of the network communication behavior of the application software, the invention provides a network communication similarity-based method and a network communication similarity-based system for detecting the abnormity of the application software, which acquire the network communication behavior of the application software according to the characteristics of network data packets to calculate the behavior similarity of the network communication behavior of the application software.
In order to achieve the above object, the present invention provides the following.
An application software abnormal behavior detection method based on network communication similarity, the method comprising:
Grouping the data packets according to the characteristics of the data packets, and constructing each group of data packets into a software behavior chain;
generating a feature vector of each software behavior chain describing statistical features of the behavior chain;
and obtaining an abnormal behavior detection result of the software behavior chain based on the similarity between the feature vector describing the statistical features of the behavior chain and the central vector of each cluster in a baseline model, wherein the baseline model is constructed based on normal application software communication flow data.
Further, the grouping of the data packets according to the data packet characteristics, and constructing each group of data packets as a software behavior chain, includes:
receiving data packets of network traffic by using a sliding window;
Extracting the data packet characteristics of each data packet, wherein the data packet characteristics comprise a protocol type, a departure time, an arrival time, a source address, a destination address, a source port, a destination port, a payload, a software signature, a data packet length, TCP handshake metadata, a total number of bytes, a maximum value/minimum value/average value/standard deviation of a packet length, a maximum value/minimum value/average value/standard deviation of a packet interval time and an information entropy;
grouping data packets according to the software signature, the protocol type, the source address, the source port, the destination address, the destination port and the payload;
and sequencing the grouped data packets according to the departure time to obtain a corresponding software behavior chain.
Further, the grouping of the data packets according to the software signature, the protocol type, the source address, the source port, the destination address, the destination port and the payload includes:
acquiring a data packet grouping strategy, wherein the data packet grouping strategy comprises the following steps:
Data packets with the same software signature are divided into a group;
And, a step of, in the first embodiment,
For data packets with the protocol type of TCP, dividing the data packets from the same TCP connection into a group;
And, a step of, in the first embodiment,
Data packets with the same protocol type, the same source address, the same source port, the same destination address and the same destination port are divided into a group;
And, a step of, in the first embodiment,
Grouping data packets having the same specific file type or keyword in the payload into a group;
and dividing the data packets into a group under the condition that the data packet characteristics of at least two data packets meet the grouping strategy.
Further, the generating the feature vector describing the statistical feature of the behavior chain of each software behavior chain includes:
Acquiring vector representations xi of all data packets in a software behavior chain;
obtaining a vector representation si corresponding to the vector representation xi through self-attention calculation of analysis of the feature correlation of the data packet;
Combining vector representations si corresponding to the software behavior chains, regularizing the combined results in batches, and obtaining vector representations corresponding to the combined results through self-attention calculation of correlation among analysis data packets;
and enabling the vector representation corresponding to the combined result to sequentially pass through the full connection layer and the softmax layer to obtain the feature vector describing the statistical features of the behavior chain.
Further, the behavior chain statistical features include vector dimensions, average, standard deviation, median absolute deviation, skewness, kurtosis, and equine distance.
Further, the baseline model is constructed based on normal application software communication traffic data, comprising:
constructing a plurality of normal software behavior chains based on normal application software communication flow data;
And generating feature vectors of the statistical features of the descriptive behavior chains of each normal software behavior chain, and then clustering to obtain a baseline model.
Further, the generating the feature vector describing the statistical feature of the behavior chain of each normal software behavior chain and then clustering to obtain a baseline model includes:
Treating each feature vector describing the statistical features of the behavior chain as an independent cluster;
Generating a plurality of clusters Vi(0) by calculating the similarity between the feature vectors, wherein i is a positive integer;
Generating a center vector of the cluster Vi(t-1) based on the variance of the similarity between each feature vector and other feature vectors in the cluster Vi(t-1), wherein t is an iteration round;
Generating a plurality of clusters Vi(t) according to the similarity between the center vector of each cluster Vi(t-1) and the center vectors of other clusters Vj(t-1), and returning t=t+1 to the variance based on the similarity between each feature vector and other feature vectors in the clusters Vi(t-1), wherein j is a positive integer and j is not equal to i;
And obtaining a baseline model until the number of clusters in the iteration round t and the iteration round t+1 is unchanged or the total number of clusters Vi(t) in the iteration round t is set to be k, wherein k is the kind number of basic network behaviors.
Further, the similarity between the feature vectorsWhere wr is a learnable weight matrix, cov (βi,βj) is the covariance between eigenvector βi and eigenvector βj, σ (βi) is the standard deviation of eigenvector βi, and σ (βj) is the standard deviation of eigenvector βj.
Further, after obtaining the abnormal behavior detection result of the software behavior chain, the method further comprises the steps of:
Manually studying and judging abnormal behavior detection results of a software behavior chain;
And under the condition that misjudgment occurs to the abnormal behavior detection result of the software behavior chain, recalculating the similarity of misjudgment behaviors, and adjusting the learnable weight matrix wr.
An application software abnormal behavior detection system based on network communication similarity, the system comprising:
The data packet grouping module is used for grouping the data packets according to the data packet characteristics and constructing each group of data packets into a software behavior chain;
The vector generation module is used for generating feature vectors of each software behavior chain, which describe statistical features of the behavior chains;
The system comprises an anomaly detection module, a baseline model and a software behavior chain, wherein the anomaly detection module is used for obtaining an anomaly behavior detection result of the software behavior chain based on the similarity between a feature vector describing statistical features of the behavior chain and a center vector of each cluster in the baseline model, and the baseline model is constructed based on normal application software communication flow data.
Compared with the prior art, the method combines the deep learning model with the behavior analysis, and designs a method for analyzing the network communication behavior similarity by using the deep learning model. The method and the system capture long-distance dependency relationship between data by using a self-attention mechanism so as to more accurately represent the characteristics of a software behavior chain, extract baselines of similar behaviors based on a clustering algorithm to obtain a baseline model of the behaviors, process and analyze real-time traffic on the basis of the baseline model output behavior baselines, judge whether the application software communication behaviors are abnormal by comparing the similarity of the newly generated behaviors and normal behaviors in the model, and provide scientific data support for application software abnormal behavior detection.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.
The method for detecting the abnormal behavior of the application software based on the network communication similarity, as shown in fig. 1, comprises the following steps.
And step 1, grouping the data packets according to the characteristics of the data packets, and constructing each group of data packets into a software behavior chain.
The present embodiment uses a 256-bit sliding window to receive network packets. When the sliding window reaches the upper limit, in order to acquire a series of network behaviors of the software, the data packets are grouped according to the characteristics of the data packets, and each group of the data packets is constructed into a software behavior chain. In a chain of actions, each data packet represents a node of the chain.
Specifically, the invention analyzes the collected data packets one by one, extracts the protocol field and the effective characteristic information, and classifies the data packets according to the protocol field and the effective characteristic information. Features used include, but are not limited to, protocol type (e.g., TCP, UDP, IP, DNS, etc.), departure time, arrival time, source IP, destination IP, source port, destination port, payload, software signature, packet length, TCP handshake metadata, total number of bytes, maximum/minimum/average/standard deviation of packet length, maximum/minimum/average/standard deviation of packet interval time, entropy of information, etc. Based on the extracted features, a match is made using predefined application signatures or rules to determine the specific application used in the data packet.
The specific grouping mode is divided into the following modes, and can be used in a matching way:
(1) Grouping into groups having identical software signatures;
(2) For a TCP connection, data packets from the same TCP connection are grouped into a group;
(3) The method comprises the steps of dividing a group of source ports, destination addresses and destination ports with the same protocol;
(4) The payloads are grouped into groups having the same specific file type or keyword.
And sorting the grouped data packets according to the departure time, normalizing and representing the characteristics of the data packets, generating a representation vector for each data packet, wherein a group of representation vectors are a behavior chain, and each data packet represents a node of the behavior chain.
And 2, generating feature vectors of each software behavior chain describing statistical features of the behavior chain.
The present invention uses a self-attention mechanism to calculate the context relationship between nodes in a chain of actions. Based on the output result of self-attention, the dimension is counted, and the statistical characteristics of average value, standard deviation and the like are calculated to obtain a feature vector describing the statistical characteristics of the behavior chain.
In one embodiment, for a behavior chain, each node of the behavior chain firstly self-attentively analyzes the correlation of the internal characteristics of the node through a self-attentive component to output a segment vector, then combines the segment vectors, inputs the segment vectors to another self-attentive layer after batch regularization (Batch Normalization), the self-attentive component analyzes the correlation among the nodes of the whole behavior chain, then enters a full-connection layer after batch regularization (Batch Normalization), the full-connection layer uses a ReLU as an activation function, and finally obtains a final output result through a softmax layer. The specific flow is shown in fig. 2.
The self-attention calculation formula is as follows:
Let the input vector x= (X1,x2,...,xn) and the output vector z= (Z1,z2,...,zn), then:
Where WQ,WK, wv is a learnable weight matrix and d is a feature dimension.
Thus, for the resulting output vector α= (s1,s2,...,sn) from the attention component, the eigenvector β= (y1,y2,...,yn) is calculated, the value of β having the meaning:
Beta= (vector dimension, average, standard deviation, median absolute deviation, skewness, kurtosis, march distance), where median absolute deviation may represent deviation of a concentration location of a behavioral characteristic of a behavioral chain from the average, skewness represents a bias of behavioral chain behavior, leaning left or right, kurtosis represents a peak of behavioral chain behavior, and mahalanobis distance represents similarity between a node and the average.
The calculation formula is as follows:
Vector dimension n
Average value μ= (x1+x2+...+xn)/n
Standard deviation:
median of m
Median absolute deviation: mad=mean (|xi -mean (α) |)
Degree of deviation:
kurtosis:
horse-type distance: Wherein S is a covariance matrix, let γ= (x1-μ,...,xn - μ),
And step 3, obtaining an abnormal behavior detection result of the software behavior chain based on the similarity between the feature vector describing the statistical features of the behavior chain and the central vector of each cluster in the baseline model.
The baseline model of the present invention is constructed based on normal application software traffic data. In the training set, the collected data are all behavior data of the normal network communication of the software, no abnormal data exist, and a baseline of the normal behavior of the software is obtained after training. For the data in the training set, invalid traffic, such as repetition, TCP retransmission, HTTP handshake data packets, etc., needs to be removed, and the repeated traffic is merged. The flow data is preprocessed and then used as subsequent input data.
And clustering the feature vectors by using a clustering algorithm after the data are completely processed. In the clustering process, the similarity between the feature vectors is used as a classification standard. The similarity distance between the feature vector and the center vector of each cluster is required to be recorded, and one of the maximum similarity distances is selected as a baseline standard for judging abnormal behavior. The trained model will become the baseline model for detecting abnormal behavior according to the invention.
In this embodiment, the feature vectors are classified according to a clustering algorithm to construct a baseline model. For the feature vector beta1,β2,...,βm, a hierarchical clustering method is adopted, the number of clusters after classification is set as k, and the k value is determined by the number of basic network behavior types of software.
The method comprises the steps of firstly regarding each vector as an independent cluster, calculating similarity between each feature vector, combining the similarity more recently into the same cluster, then calculating a center vector of each cluster, calculating variance of similarity between each vector and other vectors in the cluster, taking the vector with the smallest variance as the center vector, randomly selecting one vector as the center vector of the cluster if the variances are equal, and finally calculating the similarity between the center vectors of each cluster and combining the clusters with higher similarity. This process is repeated until there is no change from the last classification or the number of clusters is k. After classification is completed, each class has a center vector, the similarity between the vectors in the cluster and the center vector is calculated, the lowest similarity is selected as a base line value, and the generated model is a base line model.
The formula for calculating the similarity of the feature vectors is as follows: Wherein wr is a learnable weight matrix, the similarity between vectors is adjusted, cov (betai,βj)=E[βiβj]-E[βi]E[βj) is covariance between betai,βj, and sigma (betai),σ(βj) is standard deviation of the vector betai,βj.
And during testing, calculating the similarity of each feature vector and the center vector of each cluster in the baseline model, and if the feature vector cannot be divided into any cluster, regarding the behavior chain as abnormal communication behavior of the software.
And 4, optimizing the baseline model through manual research and judgment.
For abnormal communication behaviors detected by the model, manual research and judgment are needed. If the misjudgment exists, the similarity of the misjudgment behavior is recalculated, and the learnable weight matrix wr for calculating the similarity in the step is adjusted to optimize the baseline model
To sum up, in order to obtain a series of network behaviors of the software, the invention groups the collected network data packets according to the characteristics of the data packets, and distinguishes the data packets sent by each software and the different behaviors represented by the data packets. The feature types used include, but are not limited to, statistical features, protocol-based features, time sequence features, manually labeled features, deep learning model extracted features, and the like, and effective features are extracted by combining the features and feature extraction methods. Classifying the data packets according to the characteristics based on predefined rules, and finally normalizing the characteristics.
In order to acquire a series of network behaviors of software, the invention groups the collected network data packets according to the characteristics of the data packets, and distinguishes the data packets sent by each software and different behaviors represented by the data packets. The feature types used include, but are not limited to, statistical features, protocol-based features, time sequence features, manually labeled features, deep learning model extracted features, and the like, and effective features are extracted by combining the features and feature extraction methods. Classifying the data packets according to the characteristics based on predefined rules, and finally normalizing the characteristics.
The invention obtains the context relation of network communication behaviors based on self-attention. Self-attention is an attention mechanism for processing sequence data that allows a model to assign different attention weights to different locations in a sequence as it is processed, more flexibly capturing dependencies in a sequence. Therefore, the invention uses the self-attention component to process the software communication behavior, firstly obtains the relation inside a single node in the behavior to obtain the segment vector of the single node, and then combines the segment vectors to obtain the context relation of the whole behavior chain.
The statistical features of the invention can be used for knowing the behavior trend and the discrete degree of the software behavior chain, and for the vectors output by self-attention, the statistical features of the dimension, the calculated average value, the standard deviation, the variance, the deviation, the median absolute deviation, the kurtosis and the like are counted to obtain the feature vector for representing the statistical features of the communication behavior.
The specific category number of the present invention is determined according to the category of the software behavior. The basic behaviors of different software are similar, so that the basic behaviors from different software can be classified into the same class through clustering, the special behaviors of the software can be separately classified into one class, and after classification, the similarity baseline of each class is recorded.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention, and the scope of the present invention is defined by the claims.