Disclosure of Invention
The invention aims to provide a malicious code family classification method which can quickly and accurately judge the family information of malicious codes.
Embodiments of the invention may be implemented as follows:
in a first aspect, the present invention provides a method for classifying malicious code families, where the method includes:
labeling family information on malicious codes;
extracting static characteristics and dynamic characteristics from the marked malicious codes;
generating a dynamic relation graph of the malicious code according to the dynamic characteristics;
inputting the static characteristics and the dynamic relational graph into a graph neural network model to train the graph neural network model;
and acquiring static characteristics and a dynamic relation graph of the malicious codes to be classified, and inputting the trained graph neural network model to judge the family information of the malicious codes.
In an alternative embodiment, the static features include key API occurrences, key special character numbers, instruction code frequency numbers, instruction code n-grams, and byte sequence n-grams, and the dynamic features include API call dependency graphs, system call dependency graphs, and control flow graphs.
In an alternative embodiment, the dynamic feature includes a system call dependency graph, and the step of generating a dynamic relationship graph of malicious code according to the dynamic feature includes:
converting the system call dependency graph into a fixed-size directed weighted graph;
calculating distance values between directed weighted graphs;
and judging the two directed weighted graphs with the distance value smaller than the threshold value to be similar, and connecting the two directed weighted graphs to generate a dynamic relational graph of the malicious code.
In an alternative embodiment, the step of converting the system call dependency graph into a fixed-size directed weighted graph comprises:
grouping the system call dependencies by utilizing an open source tool, and constructing a system call dependency graph by utilizing the call relation among the system call dependencies;
the system call dependencies belonging to the same group are aggregated into a node, a new edge is redefined between two nodes, the weight of the new edge is the number of original edges between the two types of nodes, the system call dependency graph is converted into a group call graph with fixed size, and the group call graph is a directed weighted graph.
In an alternative embodiment, the step of calculating distance values between the directed weighted graphs comprises:
calculating Jaccard distance D for a set of nodes in two directed weighted graphsj;
Calculating out-degree cosine distance D between two same node pairs of directed weighted graphinAnd the inco-cosine distance Dout;
According to Jaccard distance DjDistance D of cosine of origininAnd the inco-cosine distance DoutThe distance value D between the two directed weighted graphs is calculated.
In an alternative embodiment, the distance value D is calculated by the formula:
where th is set empirically.
In an optional implementation manner, the step of obtaining the static features and the dynamic relationship graph of the malicious code to be classified, and inputting the trained graph neural network model to judge the family information of the malicious code includes:
inputting the static characteristics and the dynamic relational graph into a graph neural network model to obtain an embedded vector of the malicious code;
the embedded vector is input into a classifier to determine family information of the malicious code.
In an alternative embodiment, the classifier includes MLP, SVM and na iotave bayes.
In an alternative embodiment, the graph neural network model includes GraphSAGE, GCN, and GAT.
In an alternative embodiment, the step of extracting the static features and the dynamic features from the labeled malicious code comprises:
and extracting the static features by adopting an open source tool, wherein the open source tool comprises a PEFrame and an IDA.
The malicious code family classification method provided by the embodiment of the invention has the beneficial effects that:
1. the classification method comprises the steps of firstly labeling family information on malicious codes, acquiring static characteristics and dynamic characteristics from the malicious codes, inputting the acquired characteristics into a graph neural network model to train the graph neural network model, enabling the graph neural network model to acquire a standard for classifying the malicious codes, and finally inputting the static characteristics and the dynamic relation graph of the malicious codes to be classified into the trained graph neural network model, so that the family information of the malicious codes can be rapidly and accurately judged;
2. the classification method takes static characteristics and dynamic characteristics into consideration, utilizes the dynamic characteristics to generate a dynamic relation graph, then utilizes the dynamic relation graph and the static characteristics to fuse the two types of characteristics from the association angle, and is simple and high in accuracy.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.
It is understood that malicious code, also known as malware, is code that is capable of unauthorized operation in a computer system. Malicious code is written mostly for the purpose of commerce or to detect other people's material, such as promoting a certain product, providing a network charging service, or directly making intentional damage to others' computers, etc., and generally it has the purpose of malicious damage, itself being a program, and 3 features that act by execution. At present, many new malicious codes are variants of existing malicious codes, source malicious codes belong to the same family, and how to quickly identify the family to which the malicious codes belong is very important for guaranteeing network information security.
Referring to fig. 1, the present embodiment provides a classification method for malicious code families (hereinafter, referred to as "classification method"), which includes the following steps:
s1: and labeling the malicious code with family information.
Specifically, a sufficient amount of malicious code is collected first, and then family information of the malicious code is labeled, and the malicious code serves as original data of a training model.
S2: and extracting static characteristics and dynamic characteristics from the marked malicious code.
And extracting the static features by adopting an open source tool, wherein the open source tool comprises a PEFrame and an IDA.
The static characteristics comprise key API occurrence times, key special character number, instruction code frequency, instruction code n-gram and byte sequence n-gram. The dynamic features may be functional relationship call graphs of malicious code, including API (Application Programming Interface), system call dependency graphs, and control flow graphs.
Specifically, the process of extracting the static feature and the dynamic feature may be: and running the malicious codes in a virtual operating system layer of the dynamic sandbox, simulating the operation of calling all APIs of the operating system in the running process, and triggering and extracting dynamic features generated by the malicious codes. Wherein, dynamic sandbox includes: the system comprises a virtual machine layer and a virtual operating system layer, wherein the virtual machine layer is used for realizing the virtualization of the physical hardware of the computer, and the virtual operating system layer is used for running and analyzing samples.
The extracted features may also be dynamic behavior record files, features may be trained by using a model of text machine learning to generate a malicious code family classification model, the extracted features may also be pictures converted from the dynamic behavior record files, and features may be trained by using a model of picture machine learning to generate a malicious code family classification model.
S3: and generating a dynamic relation graph of the malicious code according to the dynamic characteristics.
The dynamic relationship graph is generated by mainly utilizing a system call dependency graph in the dynamic characteristics.
Firstly, a system call dependency graph in dynamic characteristics is obtained through dynamic analysis, and the system call dependency graph is converted into a directed weighted graph with a fixed size. Specifically, the open source tool is used for grouping the system call dependencies, and a system call dependency graph is formed by using the call relation among the system call dependencies, wherein the open source tool can be NtTrace and the like; and aggregating the system call dependencies belonging to the same group into a node, redefining a new edge between the two nodes, wherein the weight of the new edge is the number of the original edges between the two types of nodes, so that the system call dependency graph is converted into a group call graph with fixed size, and the group call graph is a directed weighted graph.
Then, the distance values between the directed weighted graphs are calculated, and in the embodiment, the mixed calculation is performed on both the node set and the structure of the directed weighted graph. Specifically, for the node sets in the two directed weighted graphs, the Jaccard distance D of the node sets in the two directed weighted graphs is calculated firstj(ii) a For a structure in two directed weighted graphs, calculating out-degree cosine distance D between the same node pair of the two directed weighted graphsinAnd the inco-cosine distance Dout(ii) a According to Jaccard distance DjDistance D of cosine of origininAnd the inco-cosine distance DoutThe distance value D between the two directed weighted graphs is calculated.
The calculation formula of the distance value D is as follows:
where th is set empirically.
And finally, setting a threshold lambda, judging the two directed weighted graphs with the distance value D smaller than the threshold lambda to be similar, and connecting the two directed weighted graphs to generate a dynamic relational graph of the malicious code.
The variable parameters alpha, beta, gamma and lambda can be manually preset and can also be obtained through small-batch sample training, and the training process is as follows: firstly, selecting a small number of samples of each family, extracting a system call dependency graph of each sample, converting the system call dependency graph into a group call graph with a fixed size, and calculating a distance value D between the group call graphs; training parameters alpha, beta and gamma to enable the distance value D between the group call graphs belonging to the same family to be smaller than the distance value D between the group call graphs belonging to different families; choosing an appropriate threshold λ makes the division between families obvious.
S4: and inputting the static characteristics and the dynamic relational graph into the graph neural network model so as to train the graph neural network model.
Wherein, the graph neural network model comprises GraphSAGE, GCN and GAT. In the embodiment, the graph neural network model adopts GraphSAGE, and the GraphSAGE has good flexibility and expansibility.
Firstly, inputting the static characteristics and the dynamic relational graph into a graph neural network model to obtain an embedded vector of the malicious code.
The embedded vector is then input into a classifier to determine the family information of the malicious code. Wherein, the classifier comprises MLP, SVM and naive Bayes.
The classifier can be trained in advance, embedded vectors of some marked malicious codes are obtained first, the obtained embedded vectors are input into the classifier, and the classifier is trained to enable the family classification to be accurate.
The trained graph neural network model and classifier are more suitable for classifying malicious codes, and the efficiency and the accuracy of the malicious code classification method are improved.
S5: and acquiring static characteristics and a dynamic relation graph of the malicious codes to be classified, and inputting the trained graph neural network model to judge the family information of the malicious codes.
Firstly, extracting static characteristics and dynamic characteristics of malicious codes to be classified, and then generating a dynamic relation graph of the malicious codes according to the dynamic characteristics; and finally, inputting the static characteristics and the dynamic relational graph into a graph neural network model, so that the family information of the malicious codes can be judged, and the classification of the malicious codes is completed.
The core of the malicious code family classification method provided by the embodiment is as follows: when the family information of a malicious code needs to be judged, firstly, a system call dependency graph in the static characteristic and the dynamic characteristic of the malicious code is extracted, then the system call dependency graph is converted into a group call graph with a fixed size, the distance between the group call graph and the existing group call graph is calculated, the group call graph is connected with the group call graph meeting the condition, then the characteristics of the neighbors of the group call graph are aggregated to obtain an embedded vector of the malicious code, and the embedded vector is input into a classifier to obtain judgment classification.
The malicious code family classification method provided by the embodiment has the beneficial effects that:
1. the classification method comprises the steps of firstly labeling family information on malicious codes, acquiring static characteristics and dynamic characteristics from the malicious codes, inputting the acquired characteristics into a graph neural network model to train the graph neural network model, enabling the graph neural network model to acquire a standard for classifying the malicious codes, and finally inputting the static characteristics and the dynamic relation graph of the malicious codes to be classified into the trained graph neural network model, so that the family information of the malicious codes can be rapidly and accurately judged;
2. the classification method takes static characteristics and dynamic characteristics into consideration, utilizes the dynamic characteristics to generate a dynamic relation graph, then utilizes the dynamic relation graph and the static characteristics to fuse the two types of characteristics from the association angle, and is simple and high in accuracy.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.