CN116886379B

Movatterモバイル変換

Info

Publication number: CN116886379B
Application number: CN202310907847.7A
Authority: CN
Inventors: 张曼; 韩伟红; 贾焰; 鲁辉; 胡宁; 贾世准; 孙丽群; 陶莎; 马兰; 李小霞
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2023-07-21
Filing date: 2023-07-21
Publication date: 2024-05-14
Anticipated expiration: 2043-07-21
Also published as: CN116886379A

Abstract

The application provides a network attack reconstruction method, a training method of a model and a related device, which belong to the technical field of network security.

Description

Translated fromChinese

技术领域Technical Field

本申请涉及网络安全技术领域，尤其涉及一种网络攻击重构方法、模型的训练方法及相关装置。The present application relates to the field of network security technology, and in particular to a network attack reconstruction method, a model training method and related devices.

背景技术Background technique

随着计算机及互联网的飞速发展，网络安全成为人们重点关注的问题。现如今网络中存在高级持续性威胁（Advanced Persistent Threat，APT）攻击，这是一种复杂而有组织的网络攻击行为，旨在长期潜伏并持续地对特定目标进行攻击和渗透。而对网络攻击进行重构有助于深入了解攻击手段、改善防御措施、预测未来威胁、增强应急响应能力。With the rapid development of computers and the Internet, network security has become a major concern. Nowadays, there are advanced persistent threat (APT) attacks in the network, which is a complex and organized network attack behavior that aims to lurk for a long time and continuously attack and penetrate specific targets. Reconstructing network attacks can help us gain a deeper understanding of attack methods, improve defense measures, predict future threats, and enhance emergency response capabilities.

相关技术中，常通过获取各种系统日志来重构网络攻击行为，但是APT攻击是定制且隐蔽的，因此基于攻击的隐蔽性需要从大量的日志中才能寻找攻击，增加了网络攻击重构的工作量，大量数据带来的复杂性和攻击的隐蔽性会导致将这些孤立的攻击步骤联系起来重构网络攻击的入侵路径变得更加困难，因此提高了网络攻击重构和难度，降低了网络攻击重构的效率和质量。In related technologies, network attack behaviors are often reconstructed by obtaining various system logs. However, APT attacks are customized and covert. Therefore, based on the concealment of the attacks, it is necessary to search through a large number of logs to find the attacks, which increases the workload of network attack reconstruction. The complexity brought by a large amount of data and the concealment of the attacks will make it more difficult to link these isolated attack steps to reconstruct the intrusion path of the network attack. Therefore, it increases the difficulty of network attack reconstruction and reduces the efficiency and quality of network attack reconstruction.

发明内容Summary of the invention

本申请实施例的主要目的在于提出一种网络攻击重构方法、模型的训练方法及相关装置，能够提高网络攻击重构的效率和质量。The main purpose of the embodiments of the present application is to propose a network attack reconstruction method, a model training method and related devices, which can improve the efficiency and quality of network attack reconstruction.

为实现上述目的，本申请实施例的第一方面提出了一种网络攻击重构方法，所述方法包括：获取ATT&CK技战术的执行审计日志；解析所述执行审计日志，从所述执行审计日志中提取多个实体和对应的事件，并以所述实体作为节点、以所述事件表征的实体之间的因果关系作为边构建因果图；提取所述实体的集合，根据所述实体的集合划分得到多个实体子集，并根据所包含的实体之间的因果关系将各个所述实体子集转换为对应的各个实体序列；将各个所述实体序列进行特征转换，得到对应的各个序列特征向量，并将各个所述序列特征向量输入到预先训练好的攻击识别模型中，预测得到各个所述实体序列的预测结果；根据所述预测结果确定标记为ATT&CK行为的所述实体序列为目标序列，确定所述目标序列中的所述实体为ATT&CK实体，并在所述因果图中确定所述ATT&CK实体所在的ATT&CK节点之间的目标路径，根据所述目标路径和所述ATT&CK节点将所述因果图转换为重构后的攻击溯源图。To achieve the above-mentioned purpose, the first aspect of an embodiment of the present application proposes a network attack reconstruction method, the method comprising: obtaining an execution audit log of ATT&CK techniques and tactics; parsing the execution audit log, extracting multiple entities and corresponding events from the execution audit log, and constructing a causal graph with the entities as nodes and the causal relationship between the entities represented by the events as edges; extracting a set of the entities, dividing the set of the entities to obtain multiple entity subsets, and converting each of the entity subsets into corresponding entity sequences according to the causal relationship between the included entities; performing feature conversion on each of the entity sequences to obtain corresponding sequence feature vectors, and inputting each of the sequence feature vectors into a pre-trained attack recognition model to predict the prediction results of each of the entity sequences; determining the entity sequence marked as ATT&CK behavior as a target sequence according to the prediction results, determining the entity in the target sequence as an ATT&CK entity, and determining the target path between the ATT&CK nodes where the ATT&CK entity is located in the causal graph, and converting the causal graph into a reconstructed attack tracing graph according to the target path and the ATT&CK node.

在一些实施例中，所述提取所述实体的集合，根据所述实体的集合划分得到多个实体子集，包括：从所述因果图中提取所述实体的集合，其中，所述实体的集合中包含有所述因果图中的所有所述实体；从所述实体的集合中选取任意至少两个所述实体，并将选取到的任意至少两个所述实体组成实体子集，得到多个所述实体子集。In some embodiments, the extracting of the set of entities and obtaining a plurality of entity subsets according to the partitioning of the set of entities include: extracting the set of entities from the causal graph, wherein the set of entities includes all the entities in the causal graph; selecting any at least two entities from the set of entities, and forming an entity subset with any at least two selected entities, to obtain a plurality of entity subsets.

在一些实施例中，所述攻击识别模型设置有编码器和解码器；所述将各个所述序列特征向量输入到预先训练好的攻击识别模型中，预测得到各个所述实体序列的预测结果，包括：获取各个所述序列特征向量在序列中的位置信息以及所述攻击识别模型的模型维度，并为各个所述序列特征向量建立维度索引，通过所述位置信息、所述模型维度和所述维度索引计算各个所述序列特征向量的位置编码；将各个所述序列特征向量嵌入对应的所述位置编码后，输入到模型的所述编码器中，经过所述编码器中多头注意力层和前馈层的处理后，得到各个所述序列特征向量对应的编码特征向量；将各个所述编码特征向量嵌入对应的所述位置编码后，输入到模型的所述解码器中，经过所述解码器中掩盖的多头注意力层、多头注意力层、残差连接和层归一化层、前馈层、线性层和分类输出层的处理后，预测得到各个所述实体序列的预测结果。In some embodiments, the attack recognition model is provided with an encoder and a decoder; the inputting each of the sequence feature vectors into a pre-trained attack recognition model to predict the prediction results of each of the entity sequences includes: obtaining the position information of each of the sequence feature vectors in the sequence and the model dimension of the attack recognition model, and establishing a dimension index for each of the sequence feature vectors, and calculating the position encoding of each of the sequence feature vectors through the position information, the model dimension and the dimension index; after embedding each of the sequence feature vectors into the corresponding position encoding, inputting them into the encoder of the model, and after processing by the multi-head attention layer and the feedforward layer in the encoder, obtaining the encoding feature vector corresponding to each of the sequence feature vectors; after embedding each of the encoding feature vectors into the corresponding position encoding, inputting them into the decoder of the model, and after processing by the masked multi-head attention layer, the multi-head attention layer, the residual connection and the layer normalization layer, the feedforward layer, the linear layer and the classification output layer in the decoder, predicting the prediction results of each of the entity sequences.

在一些实施例中，所述攻击识别模型通过以下步骤训练得到，包括：获取ATT&CK技战术的样本审计日志；解析所述样本审计日志，从所述样本审计日志中提取多个样本实体和对应的样本事件，并以所述样本实体作为节点、以所述样本事件表征的实体之间的因果关系作为边构建样本因果图；从所述样本因果图中提取所述样本实体的集合，从所述样本实体的集合中确定多个样本ATT&CK实体子集，并根据所包含的实体之间的因果关系将各个所述样本ATT&CK实体子集转换为对应的各个样本实体序列，其中，所述样本ATT&CK实体子集中至少包括一个样本ATT&CK实体；将各个所述样本实体序列进行特征转换，得到对应的各个样本序列特征向量，并依次将各个所述样本序列特征向量输入到所述攻击识别模型中，预测得到各个所述样本实体序列的样本预测结果；根据所包含的所述样本ATT&CK实体的情况为所述样本实体序列打上样本标签，并根据所述样本标签和所述样本预测结果调整所述攻击识别模型的参数，得到训练后的所述攻击识别模型。In some embodiments, the attack recognition model is trained by the following steps, including: obtaining a sample audit log of ATT&CK techniques and tactics; parsing the sample audit log, extracting multiple sample entities and corresponding sample events from the sample audit log, and constructing a sample causal graph with the sample entities as nodes and the causal relationships between entities represented by the sample events as edges; extracting a set of sample entities from the sample causal graph, determining multiple sample ATT&CK entity subsets from the set of sample entities, and converting each of the sample ATT&CK entity subsets into corresponding sample entity sequences according to the causal relationships between the included entities, wherein the sample ATT&CK entity subsets include at least one sample ATT&CK entity; performing feature conversion on each of the sample entity sequences to obtain corresponding sample sequence feature vectors, and inputting each of the sample sequence feature vectors into the attack recognition model in turn to predict sample prediction results of each of the sample entity sequences; labeling the sample entity sequence with sample labels according to the conditions of the included sample ATT&CK entities, and adjusting the parameters of the attack recognition model according to the sample labels and the sample prediction results to obtain the trained attack recognition model.

在一些实施例中，所述从所述样本实体的集合中确定多个样本ATT&CK实体子集，包括：从所述样本实体的集合中确定标记为ATT&CK行为的至少一个所述样本ATT&CK实体，并根据至少一个所述样本ATT&CK实体构建至少一个攻击子集，其中，每个所述攻击子集中的实体均为所述样本ATT&CK实体，并且所述攻击子集中至少含有一个所述样本ATT&CK实体；从所述样本实体的集合中确定标记为正常行为的至少一个所述样本正常实体，并为每个所述攻击子集添加任意至少一个所述样本正常实体，形成至少一个正常子集；将至少一个所述攻击子集和所述正常子集作为样本ATT&CK实体子集。In some embodiments, determining multiple sample ATT&CK entity subsets from the set of sample entities includes: determining at least one of the sample ATT&CK entity marked as ATT&CK behavior from the set of sample entities, and constructing at least one attack subset based on at least one of the sample ATT&CK entities, wherein each entity in the attack subset is the sample ATT&CK entity, and the attack subset contains at least one of the sample ATT&CK entities; determining at least one of the sample normal entities marked as normal behavior from the set of sample entities, and adding any at least one of the sample normal entities to each of the attack subsets to form at least one normal subset; and using at least one of the attack subsets and the normal subset as sample ATT&CK entity subsets.

在一些实施例中，所述依次将各个所述样本序列特征向量输入到所述攻击识别模型中，包括：若所述正常子集与所述攻击子集之间的子集数量比值大于预设阈值，对所述样本序列特征向量进行过采样处理，以增加所述攻击子集对应的所述样本序列特征向量的数量，或者，对所述样本序列特征向量进行欠采样处理，以减少所述正常子集对应的所述样本序列特征向量的数量；依次将过采样或欠采样处理之后的各个所述样本序列特征向量输入到所述攻击识别模型中。In some embodiments, the sequentially inputting each of the sample sequence feature vectors into the attack identification model includes: if the subset quantity ratio between the normal subset and the attack subset is greater than a preset threshold, oversampling the sample sequence feature vector to increase the quantity of the sample sequence feature vectors corresponding to the attack subset, or undersampling the sample sequence feature vector to reduce the quantity of the sample sequence feature vectors corresponding to the normal subset; sequentially inputting each of the sample sequence feature vectors after oversampling or undersampling into the attack identification model.

在一些实施例中，所述以所述样本实体作为节点、以所述样本事件表征的实体之间的因果关系作为边构建样本因果图，包括：以所述样本实体作为节点、以所述样本事件表征的实体之间的因果关系作为边构建初始因果图；对所述初始因果图中的节点和边进行清洗和整理，得到样本因果图。In some embodiments, constructing a sample causal graph using the sample entities as nodes and the causal relationships between entities represented by the sample events as edges includes: constructing an initial causal graph using the sample entities as nodes and the causal relationships between entities represented by the sample events as edges; cleaning and organizing the nodes and edges in the initial causal graph to obtain a sample causal graph.

在一些实施例中，所述对所述初始因果图中的节点和边进行清洗和整理，得到样本因果图，包括：确定所述初始因果图中标记为ATT&CK行为的样本ATT&CK节点，以及标记为正常行为的样本正常节点，删除所述初始因果图中达不到所述样本ATT&CK节点的所述样本正常节点和对应的边，得到样本因果图；确定所述初始因果图中任意两个节点之间重复的边，并删除重复的边，得到所述样本因果图；确定所述初始因果图中同一类型的事件不同节点，合并为同一个节点，并保留相同的输入和输出的边，得到所述样本因果图。In some embodiments, the nodes and edges in the initial causal graph are cleaned and sorted to obtain a sample causal graph, including: determining sample ATT&CK nodes marked as ATT&CK behaviors and sample normal nodes marked as normal behaviors in the initial causal graph, deleting the sample normal nodes and corresponding edges that do not reach the sample ATT&CK nodes in the initial causal graph, and obtaining a sample causal graph; determining repeated edges between any two nodes in the initial causal graph, and deleting the repeated edges to obtain the sample causal graph; determining different nodes of the same type of events in the initial causal graph, merging them into the same node, and retaining the same input and output edges to obtain the sample causal graph.

在一些实施例中，所述将各个所述实体序列进行特征转换，得到对应的各个序列特征向量，包括：获取预设的名称映射表；将各个所述实体序列中的实体名称映射到所述名称映射表中，根据所述名称映射表的映射结果重新确定各个所述实体序列中的实体名称，并根据重新确定后的实体名称确定词形还原后的所述实体序列；将词形还原后的各个所述实体序列进行特征转换，得到对应的各个序列特征向量。In some embodiments, the feature conversion of each entity sequence to obtain corresponding feature vectors of each sequence includes: obtaining a preset name mapping table; mapping the entity names in each entity sequence to the name mapping table, redetermining the entity names in each entity sequence according to the mapping results of the name mapping table, and determining the entity sequence after morphological restoration according to the redetermined entity names; and performing feature conversion on each entity sequence after morphological restoration to obtain corresponding feature vectors of each sequence.

在一些实施例中，所述在所述因果图中确定所述ATT&CK实体所在的ATT&CK节点之间的目标路径，根据所述目标路径和所述ATT&CK节点将所述因果图转换为重构后的攻击溯源图，包括：根据所述因果图中节点之间的因果关系或路径长度，为每条边配置对应的权重；选取任意一个所述ATT&CK实体所在的ATT&CK节点为初始节点，并选取任意的另一个所述ATT&CK实体所在的ATT&CK节点为目标节点，基于所述权重并通过迪杰斯特拉算法获取所述初始节点到所述目标节点之间的最短路径，将所述最短路径作为目标路径；在所述因果图中保留所述ATT&CK节点以及所述目标路径上的节点和边，并删除所述目标路径以外的节点和边，将处理后的所述因果图作为重构后的攻击溯源图。In some embodiments, determining the target path between the ATT&CK nodes where the ATT&CK entity is located in the causal graph, and converting the causal graph into a reconstructed attack tracing graph according to the target path and the ATT&CK node, includes: configuring a corresponding weight for each edge according to the causal relationship or path length between the nodes in the causal graph; selecting any ATT&CK node where the ATT&CK entity is located as the initial node, and selecting any other ATT&CK node where the ATT&CK entity is located as the target node, based on the weight and through the Dijkstra algorithm, obtaining the shortest path between the initial node and the target node, and using the shortest path as the target path; retaining the ATT&CK node and the nodes and edges on the target path in the causal graph, and deleting the nodes and edges outside the target path, and using the processed causal graph as the reconstructed attack tracing graph.

为实现上述目的，本申请实施例的第二方面提出了一种攻击识别模型的训练方法，所述方法包括：获取ATT&CK技战术的样本审计日志；解析所述样本审计日志，从所述样本审计日志中提取多个样本实体和对应的样本事件，并以所述样本实体作为节点、以所述样本事件表征的实体之间的因果关系作为边构建样本因果图；从所述样本因果图中提取所述样本实体的集合，从所述样本实体的集合中确定多个样本ATT&CK实体子集，并根据所包含的实体之间的因果关系将各个所述样本ATT&CK实体子集转换为对应的各个样本实体序列，其中，所述样本ATT&CK实体子集中至少包括一个样本ATT&CK实体；将各个所述样本实体序列进行特征转换，得到对应的各个样本序列特征向量，并依次将各个所述样本序列特征向量输入到攻击识别模型中，预测得到各个所述样本实体序列的样本预测结果；根据所包含的所述样本ATT&CK实体的情况为所述样本实体序列打上样本标签，并根据所述样本标签和所述样本预测结果调整所述攻击识别模型的参数，得到训练后的所述攻击识别模型。To achieve the above-mentioned purpose, the second aspect of the embodiment of the present application proposes a training method for an attack identification model, the method comprising: obtaining a sample audit log of ATT&CK techniques and tactics; parsing the sample audit log, extracting multiple sample entities and corresponding sample events from the sample audit log, and constructing a sample causal graph with the sample entities as nodes and the causal relationships between entities represented by the sample events as edges; extracting a set of the sample entities from the sample causal graph, determining multiple sample ATT&CK entity subsets from the set of the sample entities, and categorizing each of the sample ATT&CK entities according to the causal relationships between the entities contained therein. The entity subset is converted into corresponding sample entity sequences, wherein the sample ATT&CK entity subset includes at least one sample ATT&CK entity; each of the sample entity sequences is feature converted to obtain corresponding sample sequence feature vectors, and each of the sample sequence feature vectors is input into the attack recognition model in turn to predict the sample prediction results of each of the sample entity sequences; the sample entity sequence is labeled with a sample label according to the situation of the sample ATT&CK entities included, and the parameters of the attack recognition model are adjusted according to the sample label and the sample prediction result to obtain the trained attack recognition model.

为实现上述目的，本申请实施例的第三方面提出了一种网络攻击重构装置，所述装置包括：日志获取模块，用于获取ATT&CK技战术的执行审计日志；因果图构建模块，用于解析所述执行审计日志，从所述执行审计日志中提取多个实体和对应的事件，并以所述实体作为节点、以所述事件表征的实体之间的因果关系作为边构建因果图；序列提取模块，用于提取所述实体的集合，根据所述实体的集合划分得到多个实体子集，并根据所包含的实体之间的因果关系将各个所述实体子集转换为对应的各个实体序列；模型预测模块，用于将各个所述实体序列进行特征转换，得到对应的各个序列特征向量，并将各个所述序列特征向量输入到预先训练好的攻击识别模型中，预测得到各个所述实体序列的预测结果；攻击重构模块，用于根据所述预测结果确定标记为ATT&CK行为的所述实体序列为目标序列，确定所述目标序列中的所述实体为ATT&CK实体，并在所述因果图中确定所述ATT&CK实体所在的ATT&CK节点之间的目标路径，根据所述目标路径和所述ATT&CK节点将所述因果图转换为重构后的攻击溯源图。To achieve the above-mentioned purpose, the third aspect of the embodiment of the present application proposes a network attack reconstruction device, which includes: a log acquisition module, which is used to obtain the execution audit log of ATT&CK techniques and tactics; a causal graph construction module, which is used to parse the execution audit log, extract multiple entities and corresponding events from the execution audit log, and construct a causal graph with the entities as nodes and the causal relationships between the entities represented by the events as edges; a sequence extraction module, which is used to extract the set of entities, divide the set of entities into multiple entity subsets, and convert each of the entity subsets into corresponding entity sequences according to the causal relationships between the included entities; model A prediction module is used to perform feature conversion on each entity sequence to obtain corresponding sequence feature vectors, and input each sequence feature vector into a pre-trained attack recognition model to predict the prediction results of each entity sequence; an attack reconstruction module is used to determine that the entity sequence marked as ATT&CK behavior is a target sequence according to the prediction results, determine that the entity in the target sequence is an ATT&CK entity, and determine the target path between the ATT&CK nodes where the ATT&CK entity is located in the causal graph, and convert the causal graph into a reconstructed attack tracing graph according to the target path and the ATT&CK node.

为实现上述目的，本申请实施例的第四方面提出了一种电子设备，所述电子设备包括存储器和处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现上述第一方面实施例所述的网络攻击重构方法，或实现第二方面实施例所述的攻击识别模型的训练方法。To achieve the above-mentioned purpose, the fourth aspect of an embodiment of the present application proposes an electronic device, which includes a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, it implements the network attack reconstruction method described in the first aspect embodiment, or implements the attack identification model training method described in the second aspect embodiment.

为实现上述目的，本申请实施例的第五方面提出了一种存储介质，所述存储介质为计算机可读存储介质，所述存储介质存储有计算机程序，所述计算机程序被处理器执行时实现上述第一方面实施例所述的网络攻击重构方法，或实现第二方面实施例所述的攻击识别模型的训练方法。To achieve the above-mentioned purpose, the fifth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium, and the storage medium stores a computer program. When the computer program is executed by the processor, it implements the network attack reconstruction method described in the first aspect embodiment, or implements the attack identification model training method described in the second aspect embodiment.

本申请实施例具有以下有益效果：The embodiments of the present application have the following beneficial effects:

通过获取ATT&CK技战术的执行审计日志完成攻击溯源图的构建，首先需要从执行审计日志中提取到的实体作为节点和事件构建因果图，因果图中包含有大量的正常节点，且并不确定哪些是ATT&CK节点，无法作为最终的溯源图，需要进一步处理。因此本申请实施例需要提取实体的集合，根据实体的集合划分得到多个实体子集，并根据各个实体子集中所包含的实体之间的因果关系转换得到对应的各个实体序列，通过序列来包含不同节点以进行模型的处理，随后将各个实体序列进行特征转换后输入到预先训练好的攻击识别模型中，预测得到各个实体序列的预测结果，这样就可以通过模型判断序列中的实体是否为ATT&CK实体，ATT&CK实体对应的节点就是ATT&CK节点，就可以在因果图中确定ATT&CK节点之间的目标路径，最终根据目标路径和ATT&CK节点将因果图转换为重构后的攻击溯源图。本申请实施例仅需要获取ATT&CK技战术的执行审计日志就可以完成攻击溯源图的构建，无需增加额外的工作量，并且通过将实体存放于实体序列中以通过攻击识别模型来确定攻击序列，从而可以准确地确定ATT&CK实体和节点并完成攻击溯源图的构建，因此，本申请实施例能够提高网络攻击重构的效率和质量。By obtaining the execution audit log of ATT&CK techniques and tactics, the construction of the attack traceability graph is completed. First, the entities extracted from the execution audit log are used as nodes and events to construct a causal graph. The causal graph contains a large number of normal nodes, and it is not certain which ones are ATT&CK nodes. It cannot be used as the final traceability graph and needs further processing. Therefore, the embodiment of the present application needs to extract a set of entities, divide the set of entities to obtain multiple entity subsets, and convert the causal relationship between the entities contained in each entity subset to obtain the corresponding entity sequence, and include different nodes through the sequence for model processing. Then, each entity sequence is input into a pre-trained attack recognition model after feature conversion, and the prediction results of each entity sequence are obtained. In this way, the model can be used to determine whether the entity in the sequence is an ATT&CK entity. The node corresponding to the ATT&CK entity is the ATT&CK node, and the target path between the ATT&CK nodes can be determined in the causal graph. Finally, the causal graph is converted into a reconstructed attack traceability graph according to the target path and the ATT&CK node. The embodiment of the present application only needs to obtain the execution audit log of ATT&CK techniques and tactics to complete the construction of the attack tracing graph without adding additional workload, and by storing the entities in the entity sequence to determine the attack sequence through the attack identification model, the ATT&CK entities and nodes can be accurately determined and the construction of the attack tracing graph can be completed. Therefore, the embodiment of the present application can improve the efficiency and quality of network attack reconstruction.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本申请实施例提供的一种可选的实施环境的示意图；FIG1 is a schematic diagram of an optional implementation environment provided by an embodiment of the present application;

图2是本申请实施例提供的网络攻击重构方法的一个可选的流程图；FIG2 is an optional flow chart of a network attack reconstruction method provided in an embodiment of the present application;

图3是本申请实施例提供的一种初始因果图的示意图；FIG3 is a schematic diagram of an initial cause-effect diagram provided in an embodiment of the present application;

图4是本申请实施例提供的从因果图中提取实体序列的示意图；FIG4 is a schematic diagram of extracting entity sequences from a causal graph provided by an embodiment of the present application;

图5是图2中的步骤S103的流程示意图；FIG5 is a schematic flow chart of step S103 in FIG2 ;

图6是本申请实施例提供的攻击识别模型结构的示意图；FIG6 is a schematic diagram of an attack identification model structure provided in an embodiment of the present application;

图7是图2中的步骤S104的流程示意图；FIG7 is a schematic flow chart of step S104 in FIG2 ;

图8是本申请实施例提供的攻击识别模型训练过程的流程示意图；FIG8 is a flow chart of an attack identification model training process provided in an embodiment of the present application;

图9是图8中的步骤S403的流程示意图；FIG9 is a schematic flow chart of step S403 in FIG8 ;

图10是本申请实施例提供的提取到的攻击子集和正常子集的示意图；FIG10 is a schematic diagram of an attack subset and a normal subset extracted according to an embodiment of the present application;

图11是图8中的步骤S404的流程示意图；FIG11 is a schematic diagram of the process of step S404 in FIG8 ;

图12是本申请实施例提供的训练阶段数据处理过程示意图；FIG12 is a schematic diagram of a data processing process in a training phase according to an embodiment of the present application;

图13是图8中的步骤S402的流程示意图；FIG13 is a schematic diagram of the flow of step S402 in FIG8 ;

图14是图13中的步骤S702的流程示意图；FIG14 is a schematic diagram of the flow of step S702 in FIG13 ;

图15是本申请实施例提供的一种清洗和整理后的样本因果图的示意图；FIG15 is a schematic diagram of a sample cause-effect diagram after cleaning and sorting provided in an embodiment of the present application;

图16是图2中的步骤S104的另一个流程示意图；FIG16 is another schematic flow chart of step S104 in FIG2 ;

图17是图2中的步骤S105的流程示意图；FIG17 is a schematic flow chart of step S105 in FIG2 ;

图18是本申请实施例提供的攻击检测的完整流程示意图；FIG18 is a schematic diagram of a complete process of attack detection provided in an embodiment of the present application;

图19是本申请实施例提供的攻击识别模型的训练方法的一个可选的流程图；FIG19 is an optional flow chart of a training method for an attack recognition model provided in an embodiment of the present application;

图20是本申请实施例提供的网络攻击重构装置的功能模块示意图；FIG20 is a schematic diagram of functional modules of a network attack reconstruction device provided in an embodiment of the present application;

图21是本申请实施例提供的电子设备的硬件结构示意图。FIG. 21 is a schematic diagram of the hardware structure of the electronic device provided in an embodiment of the present application.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application more clearly understood, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application and are not used to limit the present application.

需要说明的是，虽然在装置示意图中进行了功能模块划分，在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于装置中的模块划分，或流程图中的顺序执行所示出或描述的步骤。说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。It should be noted that, although the functional modules are divided in the device schematic diagram and the logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the order in the flowchart. The terms "first", "second", etc. in the specification, claims and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence.

除非另有定义，本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的，不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as those commonly understood by those skilled in the art to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of this application and are not intended to limit this application.

首先，对本申请中涉及的若干名词进行解析：First, some nouns involved in this application are analyzed:

人工智能(artificial intelligence，AI)：是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学；人工智能是计算机科学的一个分支，人工智能企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器，该领域的研究包括机器人、语言识别、图像识别、自然语言处理和专家系统等。人工智能可以对人的意识、思维的信息过程的模拟。人工智能还是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。Artificial intelligence (AI) is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence. AI is a branch of computer science. AI attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Research in this field includes robots, language recognition, image recognition, natural language processing and expert systems. AI can simulate the information process of human consciousness and thinking. AI is also a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.

Transformer模型是一种基于自注意力机制的深度学习模型，用于处理序列数据。Transformer通过引入一种称为自注意力机制的机制来解决梯度消失和难以并行计算问题。Transformer模型由编码器和解码器组成。编码器用于将输入序列转换为一系列隐藏表示，而解码器则使用这些隐藏表示生成输出序列。每个编码器和解码器都包含多层的注意力机制和前馈神经网络。在注意力机制中，模型能够对输入序列中的不同位置进行自我关注，从而更好地理解它们之间的相互关系。通过计算每个位置与其他位置之间的注意力权重，Transformer模型能够准确地捕捉到序列中的重要信息，并实现更好的建模和特征提取。The Transformer model is a deep learning model based on the self-attention mechanism for processing sequence data. Transformer solves the problems of gradient disappearance and difficulty in parallel calculation by introducing a mechanism called self-attention mechanism. The Transformer model consists of an encoder and a decoder. The encoder is used to convert the input sequence into a series of hidden representations, while the decoder uses these hidden representations to generate the output sequence. Each encoder and decoder contains multiple layers of attention mechanisms and feedforward neural networks. In the attention mechanism, the model is able to self-pay attention to different positions in the input sequence to better understand the relationship between them. By calculating the attention weights between each position and other positions, the Transformer model can accurately capture the important information in the sequence and achieve better modeling and feature extraction.

高级持续性威胁（Advanced Persistent Threat，APT）攻击，这是一种复杂而有组织的网络攻击行为，是一种复杂且有目标性的网络攻击方式，APT攻击以隐秘性、持续性和高度定制化为特点，旨在长期潜伏并获取目标系统中的敏感信息，且通常由高度专业化和组织化的黑客团体、间谍机构或其他恶意组织执行，攻击者通过利用先进的技术手段、社会工程和漏洞利用等方式，渗透目标网络并逃避传统的安全防护措施。Advanced Persistent Threat (APT) attack is a complex and organized network attack behavior. It is a complex and targeted network attack method. APT attacks are characterized by secrecy, persistence and high customization. They aim to lurk for a long time and obtain sensitive information from the target system. They are usually carried out by highly professional and organized hacker groups, spy agencies or other malicious organizations. Attackers use advanced technical means, social engineering and vulnerability exploits to infiltrate target networks and evade traditional security protection measures.

ATT&CK（Adversarial Tactics，Techniques，and Common Knowledge）是一个针对网络对手行为的精选知识库和模型，反映了对手攻击生命周期的不同阶段以及他们已知的目标平台。ATT&CK专注于外部对手如何在计算机信息网络内进行攻击和操作。它起源于一个项目，该项目旨在记录和分类针对系统适应性攻击的对手战术、技术和程序(TTP)，以改进对恶意行为的检测。ATT&CK框架以一种结构化的方式记录了各种攻击战术和技术，为安全从业人员提供了一种更好的理解和应对威胁的方式。它分为多个部分，包括战术、技术和软件等，每个部分都提供了详细的攻击模式和相应的防御建议。ATT&CK (Adversarial Tactics, Techniques, and Common Knowledge) is a curated knowledge base and model for cyber adversary behavior, reflecting the different stages of the adversary attack lifecycle and their known target platforms. ATT&CK focuses on how external adversaries attack and operate within computer information networks. It originated from a project that aims to record and classify adversary tactics, techniques, and procedures (TTPs) for adaptive attacks on systems to improve the detection of malicious behavior. The ATT&CK framework records various attack tactics and techniques in a structured way, providing security practitioners with a better way to understand and respond to threats. It is divided into multiple sections, including tactics, techniques, and software, each of which provides detailed attack patterns and corresponding defense recommendations.

相关技术中，常通过获取各种系统日志来重构网络攻击行为，但是APT攻击是定制且隐蔽的，因此基于攻击的隐蔽性需要从大量的日志中才能寻找攻击，要从每天产生数以亿计的日志数据中去寻找攻击的行为就像是大海捞针一样，增加了网络攻击重构的工作量，大量数据带来的复杂性和攻击的隐蔽性会导致将这些孤立的攻击步骤联系起来重构网络攻击的入侵路径变得更加困难，因此提高了网络攻击重构和难度，这就导致安全研究人员往往很难发现APT攻击，因为这种攻击能够在信息系统中隐藏数个月之久，降低了网络攻击重构的效率和质量。In related technologies, network attack behaviors are often reconstructed by obtaining various system logs. However, APT attacks are customized and covert. Therefore, based on the concealment of attacks, it is necessary to search for attacks from a large number of logs. Searching for attack behaviors from hundreds of millions of log data generated every day is like looking for a needle in a haystack, which increases the workload of network attack reconstruction. The complexity brought by a large amount of data and the concealment of attacks will make it more difficult to link these isolated attack steps to reconstruct the invasion path of network attacks, thereby increasing the difficulty of network attack reconstruction. This makes it difficult for security researchers to discover APT attacks because such attacks can be hidden in information systems for several months, reducing the efficiency and quality of network attack reconstruction.

为了解决这个问题，相关技术提出了一种基于COTS审计数据的溯源图构造方法，通过设置两种信任度标签将代码与数据分离，然后通过标签传播策略将危险行为标记，最后通过反向搜索和正向搜索确定APT攻击的影响范围，但是基于标签传播的溯源图构造检测方法可以分为标签初始化和标签传播两个方面，标签初始化的选择需要专业人员的经验和知识，而标签传播的方法容易存在“依赖爆炸”问题，如果没有额外的控制手段，或当某个标签被误判且传播时，单个标签将会传播到任何地方，造成大量误报，若限制标签的传播次数和时间，攻击者还是可以通过长时间保持隐身和利用大量的中转节点来绕过这些检测。To solve this problem, related technologies have proposed a provenance graph construction method based on COTS audit data. The method separates code from data by setting two trust labels, then marks dangerous behaviors through label propagation strategies, and finally determines the impact range of APT attacks through reverse search and forward search. However, the provenance graph construction detection method based on label propagation can be divided into two aspects: label initialization and label propagation. The choice of label initialization requires the experience and knowledge of professionals, and the label propagation method is prone to the "dependency explosion" problem. If there is no additional control method, or when a label is misjudged and propagated, a single label will propagate everywhere, causing a large number of false alarms. If the number and time of label propagation are limited, attackers can still bypass these detections by remaining invisible for a long time and using a large number of transit nodes.

也有一些技术提出了一种将APT活动映射到杀伤链，引入战术、技术和程序（Tactics，Techniques，and Procedures，TTPs）和高级场景图（High-level ScenarioGraph，HSG）的方法解决低级别审计数据与攻击目标意图之间存在巨大的语义差距，但是依然会面临数据量大的问题，将这些孤立的攻击步骤联系起来重构网络攻击的入侵路径依旧困难。Some technologies have proposed a method of mapping APT activities to the kill chain, introducing tactics, techniques, and procedures (TTPs) and high-level scenario graphs (HSG) to solve the huge semantic gap between low-level audit data and the intention of the attack target. However, there is still the problem of large data volume, and it is still difficult to link these isolated attack steps to reconstruct the intrusion path of the network attack.

也有一些技术提出了通过长短时记忆网络模型（Long Short Term Memory，LSTM）实现攻击检测的方法，但是，现有的基于LSTM的实时检测方法可能无法准确区分恶意行为和良性行为，且LSTM在复杂的神经网络里还是会出现梯度消失的问题，并且每个词都会用到前面输入的信息，属于串行运算。Some technologies have proposed methods to achieve attack detection through the Long Short Term Memory (LSTM) network model. However, the existing LSTM-based real-time detection methods may not be able to accurately distinguish between malicious and benign behaviors. In addition, LSTM still has the problem of gradient disappearance in complex neural networks, and each word uses the previously input information, which is a serial operation.

基于此，本申请实施例提供了一种网络攻击重构方法、模型的训练方法及相关装置，仅需要获取ATT&CK技战术的执行审计日志就可以完成攻击溯源图的构建，无需增加额外的工作量，并且通过将实体存放于实体序列中以通过攻击识别模型来确定攻击序列，从而可以准确地确定ATT&CK实体和节点并完成攻击溯源图的构建，因此，本申请实施例能够提高网络攻击重构的效率和质量。Based on this, the embodiments of the present application provide a network attack reconstruction method, a model training method and related devices, which only require obtaining the execution audit log of ATT&CK techniques and tactics to complete the construction of the attack tracing graph without adding additional workload, and by storing the entities in the entity sequence to determine the attack sequence through the attack identification model, the ATT&CK entities and nodes can be accurately determined and the construction of the attack tracing graph can be completed. Therefore, the embodiments of the present application can improve the efficiency and quality of network attack reconstruction.

相比现有技术，本申请实施例能够具备以下优点：Compared with the prior art, the embodiments of the present application can have the following advantages:

（1）虽然有ATT&CK技战术里的行为不一定意味着攻击，但是几乎所有攻击都能体现在ATT&CK的技战术里面，而且一段时间内同一个APT组织所使用的技战术大多不会出现明显变化，确定ATT&CK技战术行为配和其他信息有助于溯源APT组织，从而使得本申请实施例能够在数日到数个月的海量审计日志中挖掘APT攻击的痕迹，并将孤立离散的攻击行为相互联系起来，重建APT攻击的完整流程，此外，将重构出的因果图还可以匹配ATT&CK战技术序列，找出攻击行为并提取出APT攻击所使用的技战术；(1) Although the behaviors in ATT&CK techniques and tactics do not necessarily mean attacks, almost all attacks can be reflected in ATT&CK techniques and tactics, and the techniques and tactics used by the same APT organization will not change significantly over a period of time. Determining the ATT&CK techniques and tactics behavior matching and other information helps to trace the APT organization, so that the embodiments of the present application can mine traces of APT attacks in massive audit logs of several days to several months, and link isolated and discrete attack behaviors to each other to reconstruct the complete process of APT attacks. In addition, the reconstructed causal graph can also match the ATT&CK technique sequence to find the attack behavior and extract the techniques and tactics used by the APT attack;

（2）基于标签的方法如果不限制传播策略，则容易存在“依赖爆炸”问题，而且如何准确地确定初始标签也是一个问题。而本申请实施例中的攻击识别模型是一个Transformer模型，相比之下本申请实施例采用序列学习的方法训练Transformer模型来识别这些可疑攻击序列。(2) If the label-based method does not restrict the propagation strategy, it is easy to have the problem of "dependency explosion", and how to accurately determine the initial label is also a problem. The attack identification model in the embodiment of the present application is a Transformer model. In contrast, the embodiment of the present application uses a sequence learning method to train the Transformer model to identify these suspicious attack sequences.

（3）LSTM在复杂的神经网络里还是会出现梯度消失的问题，并且每个词都会用到前面输入的信息，属于串行运算。而Transformer模型属于并行计算，直接计算每个词的注意力机制（Attention）值。(3) LSTM still has the problem of vanishing gradients in complex neural networks, and each word uses the previously input information, which is a serial operation. The Transformer model, on the other hand, is a parallel calculation that directly calculates the attention value of each word.

本申请实施例提供的网络攻击重构方法、模型的训练方法及相关装置，具体通过如下实施例进行说明，首先描述本申请实施例中的网络攻击重构系统。The network attack reconstruction method, model training method and related devices provided in the embodiments of the present application are specifically illustrated through the following embodiments. First, the network attack reconstruction system in the embodiments of the present application is described.

图1为本申请实施例提供的一种可选的实施环境的示意图，该实施环境包括终端11和服务器12，其中，终端11和服务器12之间通过通信网络连接。FIG1 is a schematic diagram of an optional implementation environment provided in an embodiment of the present application, wherein the implementation environment includes a terminal 11 and a server 12 , wherein the terminal 11 and the server 12 are connected via a communication network.

服务器12可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN（Content Delivery Network，内容分发网络）、以及大数据和人工智能平台等基础云计算服务的云服务器。另外，服务器12还可以是区块链网络中的一个节点服务器。The server 12 may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms. In addition, the server 12 may also be a node server in the blockchain network.

终端11可以是虚拟现实设备、智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、车载终端等，但并不局限于此。终端11以及服务器12可以通过有线或无线通信方式进行直接或间接地连接，本申请实施例在此不做限制。The terminal 11 may be a virtual reality device, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, a vehicle-mounted terminal, etc., but is not limited thereto. The terminal 11 and the server 12 may be directly or indirectly connected via wired or wireless communication, which is not limited in the embodiments of the present application.

示例性地，服务器12可以获取ATT&CK技战术的执行审计日志，ATT&CK技战术的执行审计日志可以是从终端11获取的，也可以是从其他服务器或终端获取的；随后解析执行审计日志，从执行审计日志中提取多个实体和对应的事件，并以实体作为节点、以事件表征的实体之间的因果关系作为边构建因果图；提取实体的集合，根据实体的集合划分得到多个实体子集，并根据各个实体子集中所包含的实体之间的因果关系转换得到对应的各个实体序列；将各个实体序列进行特征转换，得到对应的各个序列特征向量，并将各个序列特征向量输入到预先训练好的攻击识别模型中，预测得到各个实体序列的预测结果；根据预测结果确定标记为ATT&CK行为的实体序列为目标序列，确定目标序列中的实体为ATT&CK实体，并在因果图中确定ATT&CK实体所在的ATT&CK节点之间的目标路径，根据目标路径和ATT&CK节点将因果图转换为重构后的攻击溯源图。Exemplarily, the server 12 can obtain the execution audit log of ATT&CK techniques and tactics, which can be obtained from the terminal 11 or from other servers or terminals; then parse the execution audit log, extract multiple entities and corresponding events from the execution audit log, and construct a causal graph with entities as nodes and causal relationships between entities represented by events as edges; extract a set of entities, divide the entity set into multiple entity subsets, and convert the causal relationships between entities contained in each entity subset to obtain corresponding entity sequences; perform feature conversion on each entity sequence to obtain corresponding sequence feature vectors, and input each sequence feature vector into a pre-trained attack recognition model to predict the prediction results of each entity sequence; determine the entity sequence marked as ATT&CK behavior as the target sequence according to the prediction results, determine the entity in the target sequence as the ATT&CK entity, and determine the target path between the ATT&CK nodes where the ATT&CK entity is located in the causal graph, and convert the causal graph into a reconstructed attack tracing graph according to the target path and the ATT&CK node.

示例性的，终端11可以向服务器12发送ATT&CK技战术的执行审计日志，以使服务器12根据ATT&CK技战术的执行审计日志进行处理；终端11还可以接收服务器12发送的攻击溯源图并进行显示；终端11还可以向服务器12发送攻击溯源图的生成请求，以使得服务器12接收到生成请求后，执行相应的方法。Exemplarily, terminal 11 can send the execution audit log of ATT&CK techniques and tactics to server 12, so that server 12 performs processing according to the execution audit log of ATT&CK techniques and tactics; terminal 11 can also receive the attack source tracing graph sent by server 12 and display it; terminal 11 can also send a request to generate the attack source tracing graph to server 12, so that server 12 executes the corresponding method after receiving the generation request.

基于此，本申请实施例中的网络攻击重构方法可以通过如下实施例进行说明。Based on this, the network attack reconstruction method in the embodiment of the present application can be illustrated by the following embodiment.

本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中，人工智能（Artificial Intelligence，AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of the present application can acquire and process relevant data based on artificial intelligence technology. Artificial Intelligence (AI) is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.

人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、机器人技术、生物识别技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。The basic technologies of artificial intelligence generally include sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics, etc. Artificial intelligence software technologies mainly include computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

需要说明的是，在本申请的各个具体实施方式中，当涉及到需要根据用户信息、用户行为数据，用户历史数据以及用户位置信息等与用户身份或特性相关的数据进行相关处理时，都会先获得用户的许可或者同意，例如，获取ATT&CK技战术的执行审计日志时，会先获得用户的许可或者同意。而且，对这些数据的收集、使用和处理等，都会遵守相关法律法规和标准。此外，当本申请实施例需要获取用户的敏感个人信息时，会通过弹窗或者跳转到确认页面等方式获得用户的单独许可或者单独同意，在明确获得用户的单独许可或者单独同意之后，再获取用于使本申请实施例能够正常运行的必要的用户相关数据。It should be noted that in each specific implementation of the present application, when it comes to the need to perform relevant processing based on data related to user identity or characteristics such as user information, user behavior data, user historical data, and user location information, the user's permission or consent will be obtained first. For example, when obtaining the execution audit log of ATT&CK techniques and tactics, the user's permission or consent will be obtained first. Moreover, the collection, use, and processing of these data will comply with relevant laws, regulations, and standards. In addition, when the embodiment of the present application needs to obtain the user's sensitive personal information, the user's separate permission or consent will be obtained through a pop-up window or by jumping to a confirmation page. After clearly obtaining the user's separate permission or consent, the necessary user-related data used to enable the normal operation of the embodiment of the present application will be obtained.

图2是本申请实施例提供的网络攻击重构方法的一个可选的流程图，图2中的方法可以包括但不限于包括步骤S101至步骤S105。FIG2 is an optional flowchart of a network attack reconstruction method provided in an embodiment of the present application. The method in FIG2 may include but is not limited to steps S101 to S105.

步骤S101，获取ATT&CK技战术的执行审计日志；Step S101, obtaining the execution audit log of ATT&CK techniques and tactics;

示例性的，ATT&CK技战术的执行审计日志是指在网络环境中记录和审计执行ATT&CK技战术的活动的日志。执行APT攻击或其他恶意活动的攻击者使用各种技术和技巧来渗透和操纵目标网络。ATT&CK技战术定义了这些攻击者可能使用的一系列技术和方法。For example, an audit log of the execution of ATT&CK techniques and tactics refers to a log that records and audits the activities of executing ATT&CK techniques and tactics in a network environment. Attackers who perform APT attacks or other malicious activities use a variety of techniques and skills to infiltrate and manipulate target networks. ATT&CK techniques and tactics define a range of techniques and methods that these attackers may use.

示例性的，执行审计日志的目的是收集和记录与ATT&CK技战术相关的事件和活动，在网络中生成可追溯的日志记录。这些日志记录可以用于监控和检测潜在的攻击活动、分析攻击路径、进行取证和调查，并提供有价值的信息来防御和响应未来的攻击。For example, the purpose of performing audit logs is to collect and record events and activities related to ATT&CK techniques and tactics, and generate traceable log records in the network. These log records can be used to monitor and detect potential attack activities, analyze attack paths, conduct forensics and investigations, and provide valuable information to defend and respond to future attacks.

示例性的，收集和记录与ATT&CK技战术相关的执行审计日志的方式有多种。例如，可以配置和启用系统审计功能，以记录与ATT&CK技战术相关的事件和活动；还可以使用SIEM工具集成和集中管理来自不同系统和设备的日志记录；还可以使用网络流量监控工具来捕获和记录网络通信流量，这些工具可以提供对网络中数据包的实时分析和记录，并帮助检测和追踪与ATT&CK技战术相关的攻击活动；还可以使用终端安全解决方案，如终端检测与响应（EDR）工具、终端防护平台或其他服务器通信连接，收集和记录与ATT&CK技战术相关的终端活动；还可以基于云的环境收集和记录与ATT&CK技战术相关的云资源活动和事件。对ATT&CK技战术的执行审计日志的获取方式，本申请实施例不做具体限制。Exemplarily, there are many ways to collect and record execution audit logs related to ATT&CK techniques and tactics. For example, the system audit function can be configured and enabled to record events and activities related to ATT&CK techniques and tactics; SIEM tools can also be used to integrate and centrally manage log records from different systems and devices; network traffic monitoring tools can also be used to capture and record network communication traffic. These tools can provide real-time analysis and recording of data packets in the network and help detect and track attack activities related to ATT&CK techniques and tactics; terminal security solutions such as terminal detection and response (EDR) tools, terminal protection platforms, or other server communication connections can also be used to collect and record terminal activities related to ATT&CK techniques and tactics; cloud resource activities and events related to ATT&CK techniques and tactics can also be collected and recorded in a cloud-based environment. The embodiment of this application does not specifically limit the method of obtaining the execution audit log of ATT&CK techniques and tactics.

步骤S102，解析执行审计日志，从执行审计日志中提取多个实体和对应的事件，并以实体作为节点、以事件表征的实体之间的因果关系作为边构建因果图；Step S102, parsing the execution audit log, extracting multiple entities and corresponding events from the execution audit log, and constructing a causal graph with the entities as nodes and the causal relationships between the entities represented by the events as edges;

示例性的，在获取执行审计日志后，需要对日志进行解析，解析之后，可以得到日志中的主体和客体，并将主体和客体作为实体，例如，实体可以是进程，文件，网络连接等，并在形成因果图的时候，将实体作为因果图中的节点，例如图3中的圆圈就是一个节点，不同的节点表示不同的实体。而在执行审计日志中，事件代表了系统、网络或应用程序中发生的特定操作或活动，例如，用户登录、文件访问、进程启动等都可以被视为事件，通过分析这些事件，可以获取关于攻击者行为、异常活动或潜在安全威胁的信息，并在形成因果图的时候，以事件表征的实体之间的因果关系作为边来构建因果图，例如图3中的节点直接的连接线，不同的边表示不同的事件。Exemplarily, after obtaining the execution audit log, the log needs to be parsed. After parsing, the subject and object in the log can be obtained, and the subject and object are taken as entities. For example, the entity can be a process, a file, a network connection, etc., and when forming a causal graph, the entity is taken as a node in the causal graph. For example, the circle in Figure 3 is a node, and different nodes represent different entities. In the execution audit log, the event represents a specific operation or activity that occurs in the system, network or application. For example, user login, file access, process startup, etc. can all be regarded as events. By analyzing these events, information about attacker behavior, abnormal activities or potential security threats can be obtained. When forming a causal graph, the causal relationship between entities represented by the event is used as an edge to construct the causal graph, such as the direct connection line between the nodes in Figure 3, and different edges represent different events.

示例性的，日志转换成因果图是程序实现的方法。主体和客体都是实体（进程，文件，网络连接等），也就是节点，在图上表现为圆圈。实体间通过动作相连，事件可以表征对应节点之间的动作，而事件是一个四维向量（主体，动作，客体，时间戳），因此通过因果图，可以表示主体与客体之间在某一时间戳下执行的动作。例如下面这个例子，6点6秒进程A打开了文件B，其中进程A和文件B分别为主体和客体，是提取到的实体，可以做作为因果图中的节点，“打开”是动作，表示进程A和文件B之间的因果关系，这整个是事件。可以说明的是，攻击往往涉及多个节点和事件，因此可以构建得到多个节点和边的因果图。For example, converting logs into causal graphs is a program implementation method. Subjects and objects are entities (processes, files, network connections, etc.), that is, nodes, which are represented as circles on the graph. Entities are connected through actions, and events can represent actions between corresponding nodes. An event is a four-dimensional vector (subject, action, object, timestamp). Therefore, through a causal graph, the actions performed between a subject and an object at a certain timestamp can be represented. For example, in the following example, at 6:06, process A opens file B, where process A and file B are the subject and object, respectively, and are extracted entities. They can be used as nodes in the causal graph, and "open" is an action, which represents the causal relationship between process A and file B, which is an event. It can be explained that attacks often involve multiple nodes and events, so a causal graph with multiple nodes and edges can be constructed.

步骤S103，提取实体的集合，根据实体的集合划分得到多个实体子集，并根据所包含的实体之间的因果关系将各个实体子集转换为对应的各个实体序列；Step S103, extracting a set of entities, dividing the set of entities into multiple entity subsets, and converting each entity subset into a corresponding entity sequence according to the causal relationship between the included entities;

示例性的，在输入到模型进行处理之前，需要明确模型的输入数据，本申请实施例中模型的输入数据是根据序列转化得到的特征向量，而构建序列首先需要提取实体的集合。提取实体集合的方式有多种，例如，可以从因果图中提取，将因果图中的每个节点获取并得到实体的集合；或者，还可以从解析执行审计日志的时候获取，解析日志的时候会得到多个实体，这时候可以一并获取并组成实体的集合。Exemplarily, before inputting into the model for processing, the input data of the model needs to be clarified. In the embodiment of the present application, the input data of the model is a feature vector obtained by sequence transformation, and constructing a sequence first requires extracting a set of entities. There are many ways to extract a set of entities. For example, it can be extracted from a causal graph, and each node in the causal graph is obtained to obtain a set of entities; or, it can be obtained when parsing the execution audit log. When parsing the log, multiple entities will be obtained, and at this time, they can be obtained together and form a set of entities.

示例性的，在得到实体的集合后，这是一个总的集合，需要将其划分得到多个实体子集，其中，一个实体子集是作为总的实体集合的子集，因此，每个实体子集可以包括一个或多个实体。随后，在得到多个实体子集后，需要根据所包含的实体之间的因果关系将各个实体子集转换为对应的各个实体序列，而根据实体之间的因果关系将实体子集转换为对应的实体序列，可以帮助我们更好地理解和分析攻击事件的发展过程以及事件之间的联系，例如，通过考虑实体之间的因果关系，可以将相关事件按照发生的顺序连接起来，形成一个时间上的执行路径，这有助于还原攻击者的行动轨迹，分析攻击事件如何逐步发展和演化，更好地捕捉到攻击事件之间的依赖关系。Exemplarily, after obtaining a set of entities, which is a total set, it is necessary to divide it into multiple entity subsets, where an entity subset is a subset of the total entity set, and therefore, each entity subset may include one or more entities. Subsequently, after obtaining multiple entity subsets, it is necessary to convert each entity subset into a corresponding entity sequence according to the causal relationship between the included entities. Converting entity subsets into corresponding entity sequences according to the causal relationship between entities can help us better understand and analyze the development process of attack events and the connection between events. For example, by considering the causal relationship between entities, related events can be connected in the order of occurrence to form a time execution path, which helps to restore the attacker's action trajectory, analyze how the attack event gradually develops and evolves, and better capture the dependencies between attack events.

例如，根据所包含的实体之间的因果关系将各个实体子集转换为对应的各个实体序列，可以是按时间戳排序的可疑攻击事件连接起来，转换成序列。图4是本申请实施例提供的从因果图中提取实体序列的示意图，图中设置有正常节点P1、P2和P6，以及ATT&CK实体P3和P5，其中，P5对P3执行T1读操作、P5对P1执行T2写操作、P1对P4执行T3读操作、P5对P4执行T4读操作、P1对P2执行T5读操作、P3对P2执行T6读操作、P4对P2执行T7读操作、P2对P6执行T8读操作，这里的T表示不同的时刻，给定一个表征为ATT&CK行为的实体子集{P3，P5}，这些节点可以由攻击实体P3和P5按时间戳排序的可疑攻击事件连接起来，转换成序列，并训练时可以将其标记为ATT&CK序列。For example, each entity subset is converted into a corresponding entity sequence according to the causal relationship between the included entities, which can be connected by suspicious attack events sorted by timestamps and converted into a sequence. Figure 4 is a schematic diagram of extracting an entity sequence from a causal graph provided by an embodiment of the present application, in which normal nodes P1, P2 and P6, and ATT&CK entities P3 and P5 are provided, wherein P5 performs a T1 read operation on P3, P5 performs a T2 write operation on P1, P1 performs a T3 read operation on P4, P5 performs a T4 read operation on P4, P1 performs a T5 read operation on P2, P3 performs a T6 read operation on P2, P4 performs a T7 read operation on P2, and P2 performs a T8 read operation on P6, where T represents different moments. Given an entity subset {P3, P5} characterized as an ATT&CK behavior, these nodes can be connected by suspicious attack events sorted by timestamps by attack entities P3 and P5, converted into a sequence, and can be marked as an ATT&CK sequence during training.

示例性的，转换成序列是为了通过序列建模和自然语言处理技术更好地利用事件之间的时间顺序和上下文关系，从而提高对ATT&CK技战术的分析和模型训练的效果。具体的，执行审计日志中的事件往往按照时间顺序记录，通过将事件转换成序列形式，可以保留事件的时间关系，有助于分析攻击者行为和识别攻击链的不同阶段非常重要。将事件转换成序列形式还可以考虑事件之间的上下文关系，相邻事件可能存在一定的依赖关系或语义相关性，通过序列建模可以更好地捕捉这些关系，并提高后续模型的性能。而将审计日志中的事件转换成序列后，可以应用自然语言处理技术进行进一步的分析和处理。例如，词嵌入、Transformer等模型都是基于序列数据的处理方法，在处理自然语言文本方面具有广泛的应用，因此提取成序列后，进一步应用词形还原、词嵌入等技术可以将事件转换成可计算的向量表示，以便进行后续的机器学习或深度学习模型训练。For example, the purpose of converting to a sequence is to better utilize the temporal order and contextual relationship between events through sequence modeling and natural language processing technology, thereby improving the analysis of ATT&CK techniques and tactics and the effect of model training. Specifically, events in the execution audit log are often recorded in chronological order. By converting events into sequence form, the temporal relationship of events can be retained, which is very important for analyzing attacker behavior and identifying different stages of the attack chain. Converting events into sequence form can also consider the contextual relationship between events. Adjacent events may have certain dependencies or semantic correlations. Sequence modeling can better capture these relationships and improve the performance of subsequent models. After converting the events in the audit log into sequences, natural language processing technology can be applied for further analysis and processing. For example, models such as word embedding and Transformer are all processing methods based on sequence data and have a wide range of applications in processing natural language text. Therefore, after extracting into sequences, further application of morphological restoration, word embedding and other technologies can convert events into computable vector representations for subsequent machine learning or deep learning model training.

需要说明的是，本申请实施例中在构建实体子集，实体子集中实体的顺序按照因果关系排序，从而方便后续根据所包含的实体之间的因果关系将各个实体子集转换为对应的各个实体序列。It should be noted that in the embodiment of the present application, when constructing entity subsets, the order of entities in the entity subsets is sorted according to causal relationships, so as to facilitate the subsequent conversion of each entity subset into corresponding entity sequences according to the causal relationships between the included entities.

步骤S104，将各个实体序列进行特征转换，得到对应的各个序列特征向量，并将各个序列特征向量输入到预先训练好的攻击识别模型中，预测得到各个实体序列的预测结果；Step S104, performing feature conversion on each entity sequence to obtain corresponding feature vectors of each sequence, and inputting each sequence feature vector into a pre-trained attack recognition model to predict each entity sequence to obtain a prediction result;

示例性的，在得到各个实体序列后，需要先将各个实体序列进行特征转换，得到对应的各个序列特征向量，随后将各个序列特征向量输入到预先训练好的攻击识别模型中，以便攻击识别模型进行相应的处理。其中，攻击识别模型是本申请实施例中预先训练好的Transformer模型，该模型用于找出审计日志中的ATT&CK行为，当序列特征向量输入到模型中后，可以通过其编码器和解码器的处理，并最终输出去序列特征向量的预测结果，该预测结果可以表征该序列特征向量是否为ATT&CK序列的向量，也就是说，预测结果可以表征各个实体序列是否为ATT&CK序列。Exemplarily, after obtaining each entity sequence, it is necessary to first perform feature conversion on each entity sequence to obtain the corresponding sequence feature vectors, and then input each sequence feature vector into a pre-trained attack recognition model so that the attack recognition model can perform corresponding processing. Among them, the attack recognition model is a pre-trained Transformer model in an embodiment of the present application, which is used to find the ATT&CK behavior in the audit log. When the sequence feature vector is input into the model, it can be processed by its encoder and decoder, and finally output the prediction result of the sequence feature vector. The prediction result can characterize whether the sequence feature vector is a vector of the ATT&CK sequence, that is, the prediction result can characterize whether each entity sequence is an ATT&CK sequence.

示例性的，本申请实施例中采用词嵌入（Word Embedding）的方式实现特征转换，词嵌入是一种将单词映射到低维实数向量空间的技术，它通过学习单词在向量空间中的分布和关系，将单词转换为连续的实值向量表示。通过词嵌入，本申请实施例可以将单词表示成具有语义信息的稠密向量，从而在自然语言处理任务中更好地捕捉语义和上下文关系。此外，本申请实施例中还可以根据实际需要采用独热编码、词袋模型等方式完成特征转换。Exemplarily, in the embodiment of the present application, word embedding is used to realize feature conversion. Word embedding is a technology that maps words to low-dimensional real vector space. It converts words into continuous real-valued vector representations by learning the distribution and relationship of words in the vector space. Through word embedding, the embodiment of the present application can represent words as dense vectors with semantic information, so as to better capture semantics and contextual relationships in natural language processing tasks. In addition, in the embodiment of the present application, unique hot encoding, bag of words model and other methods can be used to complete feature conversion according to actual needs.

需要说明的是，本申请实施例中的实体序列是根据所包含的实体之间的因果关系将各个实体子集转换得到的，因此模型在识别过程中，可以根据节点或者说是实体之间的因果关系来识别是否为ATT&CK序列，从而提高了模型识别的准确率。It should be noted that the entity sequence in the embodiment of the present application is obtained by converting each entity subset according to the causal relationship between the entities contained therein. Therefore, during the recognition process, the model can identify whether it is an ATT&CK sequence based on the causal relationship between the nodes or entities, thereby improving the accuracy of model recognition.

步骤S105，根据预测结果确定标记为ATT&CK行为的实体序列为目标序列，确定目标序列中的实体为ATT&CK实体，并在因果图中确定ATT&CK实体所在的ATT&CK节点之间的目标路径，根据目标路径和ATT&CK节点将因果图转换为重构后的攻击溯源图。Step S105: Determine the entity sequence marked as ATT&CK behavior as the target sequence based on the prediction results, determine the entity in the target sequence as the ATT&CK entity, and determine the target path between the ATT&CK nodes where the ATT&CK entity is located in the causal graph, and convert the causal graph into a reconstructed attack tracing graph based on the target path and the ATT&CK node.

示例性的，在得到预测结果后，可以根据预测结果得到各个实体序列的预测情况，通过模型预测后，有的实体序列为正常序列，而有的实体序列被确定标记为ATT&CK行为的实体序列，因此这些被标记为ATT&CK行为的实体序列可以定义为目标序列，而目标序列中的所有实体均为ATT&CK实体，也就是攻击实体，随后可以依据确定的ATT&CK实体进行溯源图的生成。For example, after obtaining the prediction results, the prediction status of each entity sequence can be obtained according to the prediction results. After model prediction, some entity sequences are normal sequences, while some entity sequences are determined to be marked as entity sequences of ATT&CK behaviors. Therefore, these entity sequences marked as ATT&CK behaviors can be defined as target sequences, and all entities in the target sequence are ATT&CK entities, that is, attack entities. Subsequently, a traceability graph can be generated based on the determined ATT&CK entities.

示例性的，溯源图是一种抽象表达能力强的工具，效率相对较高。因此现在越来越多的研究工作开始关注基于溯源图的检测和响应算法，并相信溯源图有潜力成为下一代更强大的检测机制。溯源图是所有主体、客体和事件的集合，可以用G=(S，O，E)来表示，其中S代表主体的集合，O代表客体的集合，E代表事件的集合。这些操作由审计工具收集，并生成带有时间戳的事件流。事件的顺序影响语义，事件是定向的，指示数据流或控制流。因此，溯源图具有很强的时空特性。这种性质被称为溯源图的因果性。在溯源图中，主体和客体都被表示为节点，而事件被表示为边。具有不同时间或操作的两个节点之间可以有多条边。For example, the provenance graph is a tool with strong abstract expression ability and relatively high efficiency. Therefore, more and more research works are now focusing on detection and response algorithms based on the provenance graph, and believe that the provenance graph has the potential to become a more powerful detection mechanism for the next generation. The provenance graph is a set of all subjects, objects, and events, which can be represented by G=(S, O, E), where S represents the set of subjects, O represents the set of objects, and E represents the set of events. These operations are collected by the audit tool and a stream of events with timestamps is generated. The order of events affects the semantics, and events are directional, indicating data flow or control flow. Therefore, the provenance graph has strong spatiotemporal characteristics. This property is called the causality of the provenance graph. In the provenance graph, both subjects and objects are represented as nodes, and events are represented as edges. There can be multiple edges between two nodes with different times or operations.

示例性的，攻击溯源图就是本申请实施例中最终生成的溯源图，该溯源图是基于之前构建的因果图转换后得到的。具体的，在确定ATT&CK实体后，因果图中ATT&CK实体所在的节点就是ATT&CK节点，而在ATT&CK技战术中，APT攻击通常包含了两个关联的攻击实体，分别是APT组织和APT工具，因此，一般ATT&CK节点有两个，本申请需要确定ATT&CK节点之间的目标路径，而因果图中各个节点之间都有对应的路径，在确定ATT&CK节点之间的目标路径之后，会有一些正常节点和边不在目标路径之上，则可以删除这些节点和边，使得因果图更加精简，而目标路径可以是相对较短达到对方节点的路径，因此，在删除掉目标路径以外的节点和边后，本申请实施例将转换后的因果图作为攻击重构后的攻击溯源图。Exemplarily, the attack source tracing graph is the source tracing graph finally generated in the embodiment of the present application, and the source tracing graph is obtained after conversion based on the previously constructed causal graph. Specifically, after determining the ATT&CK entity, the node where the ATT&CK entity is located in the causal graph is the ATT&CK node, and in the ATT&CK techniques and tactics, APT attacks usually include two related attack entities, namely the APT organization and the APT tool. Therefore, there are generally two ATT&CK nodes. This application needs to determine the target path between the ATT&CK nodes, and there are corresponding paths between each node in the causal graph. After determining the target path between the ATT&CK nodes, there will be some normal nodes and edges that are not on the target path, then these nodes and edges can be deleted, making the causal graph more streamlined, and the target path can be a relatively short path to the other node. Therefore, after deleting the nodes and edges outside the target path, the embodiment of the present application uses the converted causal graph as the attack source tracing graph after the attack reconstruction.

本申请实施例通过获取ATT&CK技战术的执行审计日志完成攻击溯源图的构建，首先需要从执行审计日志中提取到的实体作为节点和事件构建因果图，因果图中包含有大量的正常节点，且并不确定哪些是ATT&CK节点，无法作为最终的溯源图，需要进一步处理。因此本申请实施例需要提取实体的集合，根据实体的集合划分得到多个实体子集，并根据各个实体子集中所包含的实体之间的因果关系转换得到对应的各个实体序列，通过序列来包含不同节点以进行模型的处理，随后将各个实体序列进行特征转换后输入到预先训练好的攻击识别模型中，预测得到各个实体序列的预测结果，这样就可以通过模型判断序列中的实体是否为ATT&CK实体，ATT&CK实体对应的节点就是ATT&CK节点，就可以在因果图中确定ATT&CK节点之间的目标路径，最终根据目标路径和ATT&CK节点将因果图转换为重构后的攻击溯源图。本申请实施例仅需要获取ATT&CK技战术的执行审计日志就可以完成攻击溯源图的构建，无需增加额外的工作量，并且通过将实体存放于实体序列中以通过攻击识别模型来确定攻击序列，从而可以准确地确定ATT&CK实体和节点并完成攻击溯源图的构建，因此，本申请实施例能够提高网络攻击重构的效率和质量。The embodiment of the present application completes the construction of the attack traceability graph by obtaining the execution audit log of ATT&CK techniques and tactics. First, the entities extracted from the execution audit log are used as nodes and events to construct a causal graph. The causal graph contains a large number of normal nodes, and it is not certain which ones are ATT&CK nodes. It cannot be used as the final traceability graph and needs further processing. Therefore, the embodiment of the present application needs to extract a set of entities, divide the set of entities to obtain multiple entity subsets, and convert the causal relationship between the entities contained in each entity subset to obtain the corresponding entity sequence, and include different nodes through the sequence for model processing. Then, each entity sequence is input into a pre-trained attack recognition model after feature conversion, and the prediction results of each entity sequence are predicted. In this way, the model can be used to determine whether the entity in the sequence is an ATT&CK entity. The node corresponding to the ATT&CK entity is the ATT&CK node, and the target path between the ATT&CK nodes can be determined in the causal graph. Finally, the causal graph is converted into a reconstructed attack traceability graph according to the target path and the ATT&CK node. The embodiment of the present application only needs to obtain the execution audit log of ATT&CK techniques and tactics to complete the construction of the attack tracing graph without adding additional workload, and by storing the entities in the entity sequence to determine the attack sequence through the attack identification model, the ATT&CK entities and nodes can be accurately determined and the construction of the attack tracing graph can be completed. Therefore, the embodiment of the present application can improve the efficiency and quality of network attack reconstruction.

需要说明的是，本申请实施例中数据来源于通用攻击知识库Mitre ATT&CK矩阵，该矩阵几乎覆盖了目前所有的攻击手法，使用该矩阵作为攻击行为的代表一定程度上也能反映真实的攻击行为，所使用ATT&CK数据则更新频率高，适配于当前真实的攻防环境。It should be noted that the data in the embodiments of the present application are derived from the Mitre ATT&CK matrix, a general attack knowledge base. The matrix covers almost all current attack methods. Using the matrix as a representative of attack behavior can also reflect the real attack behavior to a certain extent. The ATT&CK data used is updated frequently and is adapted to the current real attack and defense environment.

请参阅图5，在一些实施例中，步骤S103可以包括步骤S201至步骤S202：Please refer to FIG. 5 . In some embodiments, step S103 may include steps S201 to S202 :

步骤S201，从因果图中提取实体的集合，其中，实体的集合中包含有因果图中的所有实体；Step S201, extracting a set of entities from the causal graph, wherein the set of entities includes all entities in the causal graph;

步骤S202，从实体的集合中选取任意至少两个实体，并将选取到的任意至少两个实体组成实体子集，得到多个实体子集。Step S202: select any at least two entities from the set of entities, and form an entity subset with the selected at least two entities to obtain a plurality of entity subsets.

示例性的，本申请实施例中实体的集合是从因果图中提取到的，而实体的集合中会包含有因果图中的所有实体。例如，以图4中的因果图为例子，图4中存在P1、P2、P3、P4、P5、P6共6个节点，那么根据图4中的因果图提取到的实体集合为{P1，P2，P3，P4，P5，P6}。Exemplarily, the set of entities in the embodiment of the present application is extracted from the causal graph, and the set of entities includes all entities in the causal graph. For example, taking the causal graph in FIG4 as an example, there are 6 nodes in FIG4, namely, P1, P2, P3, P4, P5, and P6. Then, the set of entities extracted from the causal graph in FIG4 is {P1, P2, P3, P4, P5, and P6}.

示例性的，本申请实施例中所构建的实体子集，将至少包括两个实体，在划分得到各个实体子集时，需要从实体的集合中选取任意至少两个实体，并将选取到的任意至少两个实体组成实体子集，得到多个实体子集。例如，再以上述图4为例子，在根据实体的集合{P1，P2，P3，P4，P5，P6}划分出来的实体子集中，可以有{P1，P2}、{P1，P3}、{P1，P2，P4}、{P1，P2，P3，P4，P5}等等，这些实体子集中最少都包括两个实体。Exemplarily, the entity subsets constructed in the embodiments of the present application will include at least two entities. When dividing each entity subset, it is necessary to select any at least two entities from the set of entities, and form any at least two selected entities into an entity subset to obtain multiple entity subsets. For example, taking the above-mentioned Figure 4 as an example, in the entity subsets divided according to the entity set {P1, P2, P3, P4, P5, P6}, there may be {P1, P2}, {P1, P3}, {P1, P2, P4}, {P1, P2, P3, P4, P5}, etc., and these entity subsets include at least two entities.

需要说明的是，由于在ATT&CK技战术中，APT攻击通常包含了两个关联的攻击实体，分别是APT组织和APT工具，因此，一般ATT&CK节点或者说ATT&CK实体最少有两个，那么通过确定包含两个实体的实体子集以生成实体序列用于模型输入，可以进一步提高对ATT&CK行为的识别能力，例如，若存在ATT&CK序列为{P3，P5}，那么仅有{P3，P5}序列对应的特征向量输入到模型处理后，输出的预测结果为ATT&CK行为，而其他如即使夹杂着某一个ATT&CK节点并带有正常节点的序列，不会被认定为ATT&CK行为，如{P3，P6}、{P2，P5}等序列。进一步的，ATT&CK是一个总称，一般里面包含100多种技战术，那么包含的ATT&CK实体和正常实体都会特别的多，因此实际上的ATT&CK实体会超过2个，这样通过每个实体子集来包含至少两个实体，可以进一步提高分辨出ATT&CK序列的效率和质量，可以更好区别正常行为和ATT&CK行为。It should be noted that in ATT&CK techniques and tactics, APT attacks usually include two related attack entities, namely APT organizations and APT tools. Therefore, there are generally at least two ATT&CK nodes or ATT&CK entities. Therefore, by determining an entity subset containing two entities to generate an entity sequence for model input, the ability to recognize ATT&CK behavior can be further improved. For example, if there is an ATT&CK sequence of {P3, P5}, then only the feature vector corresponding to the {P3, P5} sequence is input into the model for processing, and the output prediction result is ATT&CK behavior. Other sequences, such as {P3, P6}, {P2, P5}, etc., will not be identified as ATT&CK behavior even if they are interspersed with a certain ATT&CK node and have normal nodes. Furthermore, ATT&CK is a general term that generally includes more than 100 techniques and tactics, so it contains a lot of ATT&CK entities and normal entities. Therefore, the actual number of ATT&CK entities is more than 2. In this way, by including at least two entities in each entity subset, the efficiency and quality of identifying ATT&CK sequences can be further improved, and normal behavior and ATT&CK behavior can be better distinguished.

示例性的，攻击识别模型是一个Transformer模型，因此模型中设置有编码器和解码器，如图6所示，示出了本申请实施例提供的攻击识别模型结构的示意图。其中，编码器设置有多头注意力层（Multi-Head Attention）和前馈层（Feed Forward）两个子层，两个子层之后都连接有归一化层，而在解码器中，设置有掩盖的多头注意力层（Masked Multi HeadAttention）、多头注意力层、前馈层、线性层等多个子层，每个子层之间都包含了一个残差连接和归一化层或归一化层，具体如图所示，并在最后连接解码器的是一个分类输出层，由一个softmax层构成。Exemplarily, the attack recognition model is a Transformer model, so an encoder and a decoder are provided in the model, as shown in Figure 6, which shows a schematic diagram of the attack recognition model structure provided by an embodiment of the present application. Among them, the encoder is provided with two sub-layers, a multi-head attention layer (Multi-Head Attention) and a feed forward layer (Feed Forward), and both sub-layers are connected with a normalization layer, and in the decoder, a masked multi-head attention layer (Masked Multi Head Attention), a multi-head attention layer, a feed forward layer, a linear layer and other sub-layers are provided, and each sub-layer contains a residual connection and a normalization layer or a normalization layer, as shown in the figure, and the decoder is finally connected to a classification output layer, which is composed of a softmax layer.

请参阅图7，在一些实施例中，步骤S104可以包括步骤S301至步骤S303：Please refer to FIG. 7 . In some embodiments, step S104 may include steps S301 to S303 :

步骤S301，获取各个序列特征向量在序列中的位置信息以及攻击识别模型的模型维度，并为各个序列特征向量建立维度索引，通过位置信息、模型维度和维度索引计算各个序列特征向量的位置编码；Step S301, obtaining the position information of each sequence feature vector in the sequence and the model dimension of the attack recognition model, and establishing a dimension index for each sequence feature vector, and calculating the position code of each sequence feature vector through the position information, model dimension and dimension index;

步骤S302，将各个序列特征向量嵌入对应的位置编码后，输入到模型的编码器中，经过编码器中多头注意力层和前馈层的处理后，得到各个序列特征向量对应的编码特征向量；Step S302, after embedding each sequence feature vector into the corresponding position code, input it into the encoder of the model, and after being processed by the multi-head attention layer and the feedforward layer in the encoder, obtain the encoding feature vector corresponding to each sequence feature vector;

步骤S303，将各个编码特征向量嵌入对应的位置编码后，输入到模型的解码器中，经过解码器中掩盖的多头注意力层、多头注意力层、残差连接和层归一化层、前馈层、线性层和分类输出层的处理后，预测得到各个实体序列的预测结果。Step S303, after embedding each encoded feature vector into the corresponding position code, input it into the decoder of the model, and after being processed by the masked multi-head attention layer, multi-head attention layer, residual connection and layer normalization layer, feedforward layer, linear layer and classification output layer in the decoder, the prediction results of each entity sequence are obtained.

示例性的，攻击识别模型的输入部分有两个输入，一个是从编码器输入，一个是从解码器输入。但是两次输入的数据不同，实际解码器的输入是编码器输出的样本，而输入编码器的是经嵌入的序列样本，并且两次输入进入编码器和解码器前还需嵌入位置编码。For example, the input part of the attack recognition model has two inputs, one from the encoder and the other from the decoder. However, the data of the two inputs are different. The actual decoder input is the sample output by the encoder, while the input to the encoder is the embedded sequence sample, and the two inputs need to be embedded in the position code before entering the encoder and decoder.

示例性的，输入进入编码器和解码器前还需要嵌入位置编码，获取各个序列特征向量在序列中的位置信息以及攻击识别模型的模型维度，并为各个序列特征向量建立维度索引，通过位置信息、模型维度和维度索引计算各个序列特征向量的位置编码。需要嵌入位置编码是因为Transformer没有对输入模型的数据位置信息进行处理，也不像长短期记忆神经网络一样由串行的方式运行，可以得到位置信息，所以需要将位置信息嵌入进输入数据中。For example, before the input enters the encoder and decoder, it is necessary to embed the position code, obtain the position information of each sequence feature vector in the sequence and the model dimension of the attack recognition model, and establish a dimension index for each sequence feature vector. The position code of each sequence feature vector is calculated through the position information, model dimension and dimension index. The position code needs to be embedded because the Transformer does not process the data position information of the input model, and does not run in a serial manner like the long short-term memory neural network to obtain the position information, so the position information needs to be embedded in the input data.

其中位置编码的计算如下：The position encoding is calculated as follows:

其中，为序列中的位置信息，/>为维度，也就是为各个序列特征向量建立的维度索引，/>为模型维度，/>表示偶数索引，对应位置编码向量PE中的偶数维度，/>表示奇数索引，对应位置编码向量PE中的奇数维度，这样在计算位置编码时能够区分不同的维度，并应用到对应的输入数据中。in, is the position information in the sequence, /> is the dimension, that is, the dimension index established for each sequence feature vector,/> is the model dimension, /> Represents an even index, corresponding to the even dimension in the position encoding vector PE,/> It represents an odd index, corresponding to the odd dimension in the position encoding vector PE, so that different dimensions can be distinguished when calculating the position encoding and applied to the corresponding input data.

通过上述的公式可以得到的位置编码维度与输入数据相同，将输入向量与位置编码向量相加，即嵌入了位置编码，并得到了序列的位置信息，后将完成位置编码嵌入的数据向量输入进编码器和解码器中。因此，在得到位置编码后，将各个序列特征向量嵌入对应的位置编码后，输入到模型的编码器中，经过编码器中多头注意力层和前馈层的处理后，得到各个序列特征向量对应的编码特征向量，再将各个编码特征向量嵌入对应的位置编码后，输入到模型的解码器中，经过解码器中掩盖的多头注意力层、多头注意力层、残差连接和层归一化层、前馈层、线性层和分类输出层的处理后，预测得到各个实体序列的预测结果。The position encoding dimension obtained by the above formula is the same as the input data. The position encoding is embedded by adding the input vector to the position encoding vector, and the position information of the sequence is obtained. Then, the data vector with the position encoding embedded is input into the encoder and decoder. Therefore, after obtaining the position encoding, each sequence feature vector is embedded in the corresponding position encoding and input into the encoder of the model. After being processed by the multi-head attention layer and the feedforward layer in the encoder, the encoding feature vector corresponding to each sequence feature vector is obtained. After each encoding feature vector is embedded in the corresponding position encoding, it is input into the decoder of the model. After being processed by the masked multi-head attention layer, multi-head attention layer, residual connection and layer normalization layer, feedforward layer, linear layer and classification output layer in the decoder, the prediction results of each entity sequence are obtained.

请参阅图8，在一些实施例中，攻击识别模型通过以下步骤训练得到，可以包括步骤S401至步骤S405：Please refer to FIG. 8 . In some embodiments, the attack identification model is trained by the following steps, which may include steps S401 to S405:

步骤S401，获取ATT&CK技战术的样本审计日志；Step S401, obtaining a sample audit log of ATT&CK techniques and tactics;

步骤S402，解析样本审计日志，从样本审计日志中提取多个样本实体和对应的样本事件，并以样本实体作为节点、以样本事件表征的实体之间的因果关系作为边构建样本因果图；Step S402, parsing the sample audit log, extracting multiple sample entities and corresponding sample events from the sample audit log, and constructing a sample causal graph with the sample entities as nodes and the causal relationships between entities represented by the sample events as edges;

步骤S403，从样本因果图中提取样本实体的集合，从样本实体的集合中确定多个样本ATT&CK实体子集，并根据所包含的实体之间的因果关系将各个样本ATT&CK实体子集转换为对应的各个样本实体序列，其中，样本ATT&CK实体子集中至少包括一个样本ATT&CK实体；Step S403, extracting a set of sample entities from the sample causal graph, determining multiple sample ATT&CK entity subsets from the set of sample entities, and converting each sample ATT&CK entity subset into a corresponding sequence of sample entities according to the causal relationship between the included entities, wherein the sample ATT&CK entity subset includes at least one sample ATT&CK entity;

步骤S404，将各个样本实体序列进行特征转换，得到对应的各个样本序列特征向量，并依次将各个样本序列特征向量输入到攻击识别模型中，预测得到各个样本实体序列的样本预测结果；Step S404, performing feature conversion on each sample entity sequence to obtain corresponding feature vectors of each sample sequence, and sequentially inputting the feature vectors of each sample sequence into the attack recognition model to predict and obtain sample prediction results of each sample entity sequence;

步骤S405，根据所包含的样本ATT&CK实体的情况为样本实体序列打上样本标签，并根据样本标签和样本预测结果调整攻击识别模型的参数，得到训练后的攻击识别模型。Step S405: label the sample entity sequence according to the sample ATT&CK entities included therein, and adjust the parameters of the attack recognition model according to the sample labels and the sample prediction results to obtain a trained attack recognition model.

示例性的，ATT&CK技战术的样本审计日志是训练过中获取到的日志，与上述实施例中的执行审计日志类似，仅代表在模型训练阶段，在此不再赘述。Exemplarily, the sample audit log of ATT&CK techniques and tactics is a log obtained during the training process, which is similar to the execution audit log in the above embodiment and only represents the model training stage, which will not be elaborated here.

示例性的，在获取样本审计日志后，需要对日志进行解析，解析之后，可以得到日志中的主体和客体，并将主体和客体作为样本实体，例如，样本实体可以是进程，文件，网络连接等，并在形成样本因果图的时候，将样本实体作为因果图中的节点，而在样本审计日志中，样本事件代表了系统、网络或应用程序中发生的特定操作或活动，例如，用户登录、文件访问、进程启动等都可以被视为样本事件，通过分析这些样本事件，可以获取关于攻击者行为、异常活动或潜在安全威胁的信息，并在形成因果图的时候，以事件表征的实体之间的样本因果关系作为边来构建样本因果图。Exemplarily, after obtaining the sample audit log, the log needs to be parsed. After parsing, the subject and object in the log can be obtained, and the subject and object are used as sample entities. For example, the sample entity can be a process, a file, a network connection, etc., and when forming a sample causal graph, the sample entity is used as a node in the causal graph. In the sample audit log, the sample event represents a specific operation or activity that occurs in the system, network or application. For example, user login, file access, process startup, etc. can all be regarded as sample events. By analyzing these sample events, information about attacker behavior, abnormal activities or potential security threats can be obtained, and when forming a causal graph, the sample causal relationship between entities represented by the event is used as an edge to construct the sample causal graph.

示例性的，在输入到模型进行处理之前，需要明确模型的输入数据，本申请实施例中模型的输入数据是根据序列转化得到的特征向量，而构建序列首先需要提取样本实体的集合，可以从样本因果图中提取，将样本因果图中的每个节点获取并得到样本实体的集合。Exemplarily, before inputting into the model for processing, it is necessary to clarify the input data of the model. In the embodiment of the present application, the input data of the model is a feature vector obtained based on sequence transformation, and constructing the sequence first requires extracting a set of sample entities, which can be extracted from the sample causal graph, and each node in the sample causal graph is obtained to obtain a set of sample entities.

示例性的，在得到样本实体的集合后，这是一个总的集合，需要将其划分得到多个子集，与应用过程不同的是，训练过程中建立的子集将包含有ATT&CK实体。具体的，本申请实施例中从样本实体的集合中确定多个样本ATT&CK实体子集，并根据所包含的实体之间的因果关系将各个样本ATT&CK实体子集转换为对应的各个样本实体序列，其中，样本ATT&CK实体子集中至少包括一个样本ATT&CK实体，也就是说，样本ATT&CK实体子集中要么全都是样本ATT&CK实体，要么至少含有一个样本ATT&CK实体。Exemplarily, after obtaining a set of sample entities, this is a total set, which needs to be divided into multiple subsets. Unlike the application process, the subsets established during the training process will contain ATT&CK entities. Specifically, in an embodiment of the present application, multiple sample ATT&CK entity subsets are determined from the set of sample entities, and each sample ATT&CK entity subset is converted into a corresponding sequence of sample entities based on the causal relationship between the included entities, wherein the sample ATT&CK entity subset includes at least one sample ATT&CK entity, that is, the sample ATT&CK entity subset is either all sample ATT&CK entities or contains at least one sample ATT&CK entity.

需要说明的是，训练过程可以对样本因果图进行标记，对其中的样本ATT&CK节点进行标记，标记后所提取到的样本实体就是样本ATT&CK实体，通过标记攻击行为的方式有助于后续进行训练。It should be noted that the training process can mark the sample causal graph and the sample ATT&CK nodes therein. The sample entities extracted after marking are the sample ATT&CK entities. Marking attack behaviors will help with subsequent training.

示例性的，根据所包含的实体之间的因果关系将各个实体子集转换为对应的各个实体序列，而根据实体之间的因果关系将样本ATT&CK实体子集转换为对应的样本实体序列，可以帮助我们更好地理解和分析攻击事件的发展过程以及事件之间的联系，其转换过程与上述应用过程中转换得到实体序列的过程相似，在此不再赘述。Exemplarily, each entity subset is converted into a corresponding entity sequence according to the causal relationship between the entities contained therein, and the sample ATT&CK entity subset is converted into a corresponding sample entity sequence according to the causal relationship between the entities. This can help us better understand and analyze the development process of the attack incident and the connection between the incidents. The conversion process is similar to the process of converting the entity sequence in the above application process, and will not be repeated here.

示例性的，在得到各个样本实体序列后，需要先将各个样本实体序列进行特征转换，得到对应的各个样本序列特征向量，随后将各个样本序列特征向量输入到攻击识别模型中，以便攻击识别模型进行相应的处理。当样本序列特征向量输入到模型中后，可以通过其编码器和解码器的处理，并最终输出去样本序列特征向量的样本预测结果，该样本预测结果可以表征该样本序列特征向量是否为ATT&CK序列的向量，也就是说，样本预测结果可以表征各个样本实体序列是否为ATT&CK序列。Exemplarily, after obtaining each sample entity sequence, it is necessary to first perform feature conversion on each sample entity sequence to obtain the corresponding feature vectors of each sample sequence, and then input each sample sequence feature vector into the attack recognition model so that the attack recognition model performs corresponding processing. After the sample sequence feature vector is input into the model, it can be processed by its encoder and decoder, and finally output the sample prediction result of the sample sequence feature vector. The sample prediction result can characterize whether the sample sequence feature vector is a vector of the ATT&CK sequence, that is, the sample prediction result can characterize whether each sample entity sequence is an ATT&CK sequence.

需要说明的是，本申请实施例的样本ATT&CK实体子集中至少包括一个样本ATT&CK实体，这样在训练的时候，模型可以更好地学习到异常的样本，以便在应用的实际，可以更好确定输入的实体序列是否为ATT&CK序列。It should be noted that the sample ATT&CK entity subset in the embodiment of the present application includes at least one sample ATT&CK entity, so that during training, the model can better learn abnormal samples, so that in actual application, it can better determine whether the input entity sequence is an ATT&CK sequence.

示例性的，训练过程需要根据预测结果来计算损失值，并调整攻击识别模型的参数。具体的，本申请实施例根据所包含的样本ATT&CK实体的情况为样本实体序列打上样本标签，也就是说，样本实体序列在转换为特征向量输入到模型之前，可以通过标签知道其是否表征为ATT&CK行为，随后，根据样本标签和样本预测结果是否相同，可以计算得到模型的损失值，并根据损失值调整攻击识别模型的参数，得到训练后的攻击识别模型。Exemplarily, the training process needs to calculate the loss value based on the prediction results and adjust the parameters of the attack recognition model. Specifically, the embodiment of the present application labels the sample entity sequence according to the sample ATT&CK entities included. That is, before the sample entity sequence is converted into a feature vector and input into the model, it can be known through the label whether it is characterized as an ATT&CK behavior. Subsequently, based on whether the sample label and the sample prediction result are the same, the loss value of the model can be calculated, and the parameters of the attack recognition model can be adjusted according to the loss value to obtain the trained attack recognition model.

需要说明的是，本申请实施例中使用自适应运动估计算法Adam优化器进行训练，使用的原因是Adam梯度下降速度比较快，这样，每次训练完成后，判断损失是否比上次的要大，若是损失比上次小的话则完成训练，对具体的训练过程，本申请实施例不做具体限制。It should be noted that the adaptive motion estimation algorithm Adam optimizer is used for training in the embodiment of the present application. The reason for using it is that the Adam gradient descent speed is relatively fast. In this way, after each training is completed, it is determined whether the loss is greater than the previous one. If the loss is smaller than the previous one, the training is completed. The embodiment of the present application does not make any specific restrictions on the specific training process.

请参阅图9，在一些实施例中，步骤S403可以包括步骤S501至步骤S503：Please refer to FIG. 9 . In some embodiments, step S403 may include steps S501 to S503:

步骤S501，从样本实体的集合中确定标记为ATT&CK行为的至少一个样本ATT&CK实体，并根据至少一个样本ATT&CK实体构建至少一个攻击子集，其中，每个攻击子集中的实体均为样本ATT&CK实体，并且攻击子集中至少含有一个样本ATT&CK实体；Step S501, determining at least one sample ATT&CK entity marked as an ATT&CK behavior from the set of sample entities, and constructing at least one attack subset based on the at least one sample ATT&CK entity, wherein the entities in each attack subset are sample ATT&CK entities, and the attack subset contains at least one sample ATT&CK entity;

步骤S502，从样本实体的集合中确定标记为正常行为的至少一个样本正常实体，并为每个攻击子集添加任意至少一个所样本正常实体，形成至少一个正常子集；Step S502, determining at least one sample normal entity marked as normal behavior from the set of sample entities, and adding any at least one sample normal entity to each attack subset to form at least one normal subset;

步骤S503，将至少一个攻击子集和正常子集作为样本ATT&CK实体子集。Step S503: taking at least one attack subset and a normal subset as sample ATT&CK entity subsets.

示例性的，样本ATT&CK实体子集中至少包括一个样本ATT&CK实体，也就是说，样本ATT&CK实体子集中要么全都是样本ATT&CK实体，要么至少含有一个样本ATT&CK实体。具体的，本申请实施例可以从样本实体的集合中确定标记为ATT&CK行为的至少一个样本ATT&CK实体，这时候确定的样本ATT&CK实体可以有多个。随后根据至少一个样本ATT&CK实体构建至少一个攻击子集，这里有多种情况，例如，若样本ATT&CK实体有一个，那么仅构建一个攻击子集，若样本ATT&CK实体有两个，那么仅构建一个攻击子集，或者形成两个攻击子集，若样本ATT&CK实体超过两个，那么所构建的攻击子集就有多个。Exemplarily, the sample ATT&CK entity subset includes at least one sample ATT&CK entity, that is, the sample ATT&CK entity subset is either all sample ATT&CK entities or contains at least one sample ATT&CK entity. Specifically, the embodiment of the present application can determine at least one sample ATT&CK entity marked as ATT&CK behavior from the set of sample entities. At this time, there can be multiple sample ATT&CK entities determined. Then, at least one attack subset is constructed based on at least one sample ATT&CK entity. There are multiple situations here. For example, if there is one sample ATT&CK entity, only one attack subset is constructed. If there are two sample ATT&CK entities, only one attack subset is constructed, or two attack subsets are formed. If there are more than two sample ATT&CK entities, there are multiple attack subsets constructed.

示例性的，每个攻击子集中的实体均为样本ATT&CK实体，也就是说所构建的攻击子集就是一个ATT&CK实体的集合，并且攻击子集中至少含有一个样本ATT&CK实体。进一步的，由于APT攻击通常包含了两个关联的攻击实体，分别是APT组织和APT工具，因此，一般ATT&CK节点或者说ATT&CK实体最少有两个，所以本申请实施例中所构建的攻击子集，至少包括两个样本ATT&CK实体。Exemplarily, the entities in each attack subset are sample ATT&CK entities, that is, the constructed attack subset is a collection of ATT&CK entities, and the attack subset contains at least one sample ATT&CK entity. Furthermore, since an APT attack usually includes two related attack entities, namely an APT organization and an APT tool, there are generally at least two ATT&CK nodes or ATT&CK entities, so the attack subset constructed in the embodiment of the present application includes at least two sample ATT&CK entities.

示例性的，在构建攻击子集后，模型通过攻击子集学习到序列中的ATT&CK行为，然而，在模型的应用过程中，节点数量较多，这导致正常节点的数量较多，因此后续建立的实体序列中正常的序列会非常多，数量远大于ATT&CK序列。因此，本申请实施例中需要模型可以学习正常行为，能够区分恶意和非恶意活动之间的边界，所以本申请实施例中需要形成一些正常的子集，以便模型训练所用。For example, after constructing the attack subset, the model learns the ATT&CK behavior in the sequence through the attack subset. However, in the application process of the model, the number of nodes is large, which leads to a large number of normal nodes. Therefore, there will be a lot of normal sequences in the entity sequence established subsequently, which is much larger than the ATT&CK sequence. Therefore, in the embodiment of the present application, it is necessary for the model to learn normal behavior and distinguish the boundary between malicious and non-malicious activities. Therefore, in the embodiment of the present application, it is necessary to form some normal subsets for model training.

示例性的，本申请实施例可以从样本实体的集合中确定标记为正常行为的至少一个样本正常实体，并为每个攻击子集添加任意至少一个所样本正常实体，形成至少一个正常子集，例如，若上述形成的攻击子集包括{P1，P2}、{P2，P3}，实体P1、P2和P3为样本ATT&CK实体，而P4为样本正常实体，因此，可以形成的正常子集包括{P1，P2，P4}、{P2，P3，P4}。Exemplarily, an embodiment of the present application can determine at least one sample normal entity marked as normal behavior from a set of sample entities, and add any at least one sample normal entity to each attack subset to form at least one normal subset. For example, if the attack subset formed above includes {P1, P2}, {P2, P3}, entities P1, P2 and P3 are sample ATT&CK entities, and P4 is a sample normal entity. Therefore, the normal subset that can be formed includes {P1, P2, P4}, {P2, P3, P4}.

进一步的，在为每个攻击子集添加任意至少一个所样本正常实体时，可以对攻击子集进行拆分，也就是可以得到各个样本ATT&CK实体，随机选取至少一个样本ATT&CK实体，与至少一个样本正常实体进行组合，形成至少一个正常子集。例如，假设有{P1，P2，P3，P4，P5，P6}这些样本实体，其中{P1，P2，P3}是攻击序列，则提取的攻击子集是{P1，P2}、{P1，P3}、{P2，P3}、{P1，P2，P3}。在提取正常序列的时候，则提取{P1}、{P2}、{P3}、{P1，P2}、{P1，P3}、{P2，P3}、{P1，P2，P3}，然后从三个非攻击实体{P4，P5，P6}中追加一个实体到前面说的子集里，变成{P1，P4}，{P2，P4}和{P1，P2，P3，P4}等等作为祖字红的正常子集。Furthermore, when adding any at least one sample normal entity to each attack subset, the attack subset can be split, that is, each sample ATT&CK entity can be obtained, and at least one sample ATT&CK entity is randomly selected and combined with at least one sample normal entity to form at least one normal subset. For example, assuming that there are sample entities {P1, P2, P3, P4, P5, P6}, where {P1, P2, P3} is the attack sequence, the extracted attack subsets are {P1, P2}, {P1, P3}, {P2, P3}, {P1, P2, P3}. When extracting the normal sequence, {P1}, {P2}, {P3}, {P1, P2}, {P1, P3}, {P2, P3}, {P1, P2, P3} are extracted, and then an entity is added from the three non-attack entities {P4, P5, P6} to the previously mentioned subset, becoming {P1, P4}, {P2, P4} and {P1, P2, P3, P4} and so on as the normal subset of Zuzihong.

示例性的，图10示出了本申请实施例提供的提取到的攻击子集和正常子集的示意图。此时假设图4为训练过程的样本因果图，图10是从图4中的样本因果图中提取到的，在图中，P3和P5为样本ATT&CK节点，其他的节点为样本正常节点，则提取{P5，P3}为一个攻击子集，其中P5对P3执行T1读操作，因此子集中P5在前而P3在后，还可以提取得到{P5，P1}、{P4，P5}和{P3，P2}为正常子集，正常子集中也是按照执行的动作的因果关系排列的，在此不再赘述。Exemplarily, Figure 10 shows a schematic diagram of the extracted attack subset and normal subset provided by an embodiment of the present application. At this time, it is assumed that Figure 4 is a sample causal graph of the training process, and Figure 10 is extracted from the sample causal graph in Figure 4. In the figure, P3 and P5 are sample ATT&CK nodes, and the other nodes are sample normal nodes. Then {P5, P3} is extracted as an attack subset, where P5 performs a T1 read operation on P3, so P5 is in front and P3 is in the back in the subset. It is also possible to extract {P5, P1}, {P4, P5} and {P3, P2} as normal subsets, which are also arranged according to the causal relationship of the executed actions, which will not be repeated here.

示例性的，正常子集中的样本正常实体可以是随机添加的，在训练的时候，可以将每个样本正常实体单个或多个组合后添加形成正常子集，对此本申请实施例不做具体限制。Exemplarily, the sample normal entities in the normal subset may be added randomly. During training, each sample normal entity may be added individually or in combination to form a normal subset. This embodiment of the present application does not impose any specific limitation on this.

示例性的，正常子集还包括全正常子集，全正常子集就是子集中的所有样本实体都是样本正常实体，而上述在攻击子集中添加了样本ATT&CK实体的为部分正常子集，训练的时候可以将全正常子集也投入到模型中进行训练，以使得模型可以学习正常行为。Exemplarily, the normal subset also includes a fully normal subset, where all sample entities in the subset are sample normal entities, and the aforementioned attack subset with sample ATT&CK entities added is a partially normal subset. During training, the fully normal subset can also be put into the model for training so that the model can learn normal behavior.

请参阅图11，在一些实施例中，步骤S404可以包括步骤S601至步骤S602：Please refer to FIG. 11 , in some embodiments, step S404 may include steps S601 to S602:

步骤S601，若正常子集与攻击子集之间的子集数量比值大于预设阈值，对样本序列特征向量进行过采样处理，以增加攻击子集对应的样本序列特征向量的数量，或者，对样本序列特征向量进行欠采样处理，以减少正常子集对应的样本序列特征向量的数量；Step S601, if the subset number ratio between the normal subset and the attack subset is greater than a preset threshold, oversampling is performed on the sample sequence feature vector to increase the number of sample sequence feature vectors corresponding to the attack subset, or undersampling is performed on the sample sequence feature vector to reduce the number of sample sequence feature vectors corresponding to the normal subset;

步骤S602，依次将过采样或欠采样处理之后的各个样本序列特征向量输入到攻击识别模型中。Step S602: inputting the feature vectors of each sample sequence after oversampling or undersampling processing into the attack recognition model in sequence.

示例性的，图12示出了本申请实施例提供的训练阶段数据处理过程示意图，在训练过程，样本审计日志需要经过处理后构造样本因果图，并提取序列，之后经过词形还原和词嵌入后，在样本数据输入到攻击识别模型进行训练之前，需要进行平衡数据。Exemplarily, Figure 12 shows a schematic diagram of the data processing process in the training phase provided in an embodiment of the present application. During the training process, the sample audit log needs to be processed to construct a sample causal graph and extract the sequence. After word form restoration and word embedding, the sample data needs to be balanced before it is input into the attack identification model for training.

示例性的，基于上述实施例所说的，样本正常实体的数量会远大于样本ATT&CK实体的数量，那么所建立的正常子集的数量也会远大于攻击子集的数量，收集的ATT&CK样本肯定比正常样本的数量少的多，如果使用这样的样本训练出的模型则会偏向正常样本。因此，本申请实施例需要平衡正常样本与ATT&CK样本之间的数量，并在正常子集与攻击子集之间的子集数量比值大于预设阈值时，认为正常样本远大于ATT&CK样本，需要进行数据平衡操作。For example, based on the above-mentioned embodiments, the number of sample normal entities will be much larger than the number of sample ATT&CK entities, so the number of normal subsets established will also be much larger than the number of attack subsets, and the collected ATT&CK samples will definitely be much smaller than the number of normal samples. If the model trained using such samples is biased towards normal samples, it will be biased towards normal samples. Therefore, the embodiments of the present application need to balance the number of normal samples and ATT&CK samples, and when the ratio of the number of subsets between the normal subset and the attack subset is greater than a preset threshold, it is considered that the normal samples are much larger than the ATT&CK samples, and data balancing operations need to be performed.

示例性的，本申请实施例中通过采用过采样（Over-sampling）和欠采样（Under-sampling）处理实现数据平衡，过采样和欠采样都是在处理不平衡数据集时常用的方法，目的是解决类别之间样本数量差异较大的问题。Exemplarily, in the embodiments of the present application, data balance is achieved by using over-sampling and under-sampling processing. Over-sampling and under-sampling are both commonly used methods when processing unbalanced data sets, and their purpose is to solve the problem of large differences in the number of samples between categories.

具体的，在过采样处理的过程中，本申请实施例可以使用合成少数类过采样技术人工少数类过采样法（Synthetic Minority Over-sampling Technique，SMOTE）对ATT&CK样本进行过采样，这是一种合成少数类样本的方法，它通过对少数类样本进行插值操作，生成新的合成样本，从而扩充少数类样本的数量。其中，ATT&CK样本就是少数类样本，对于少数类每一个样本以欧几里得距离计算到少数类样本集中所有样本的距离，得到k邻近。然后根据样本不平衡率设置一个采样比N，从少数类样本x的k邻近中随机选取若干个样本，则新构建的的样本为。其中，/>为平衡后的ATT&CK样本，/>为所有样本的平均值，/>用于生成一个介于0到1之间的随机数，过采样操作时会进行直到平衡少数样本和多数样本的数量。Specifically, in the process of oversampling, the embodiment of the present application can use the synthetic minority oversampling technique artificial minority oversampling technique (SMOTE) to oversample the ATT&CK samples. This is a method for synthesizing minority samples. It generates new synthetic samples by interpolating minority samples, thereby expanding the number of minority samples. Among them, the ATT&CK samples are minority samples. For each minority sample, the distance to all samples in the minority sample set is calculated using the Euclidean distance to obtain the k neighbors. Then, a sampling ratio N is set according to the sample imbalance rate, and several samples are randomly selected from the k neighbors of the minority sample x. The newly constructed sample is Among them, /> This is the balanced ATT&CK sample. /> is the average value of all samples, /> Used to generate a random number between 0 and 1. The oversampling operation is performed until the number of minority samples and majority samples is balanced.

具体的，欠采样是通过减少多数类样本数量来平衡数据集，多数类样本就是正常子集的样本，本申请实施例中可以采用随机删除多数类样本、聚类中心（ClusterCentroids）等方式进行欠采样。随机删除多数类样本就是简单地随机删除多数类样本，使得多数类样本的数量与少数类样本接近。聚类中心可以通过聚类算法将多数类样本聚类成较少的簇，然后从每个簇中选择一个样本作为代表，最终生成新的欠采样数据集。Specifically, undersampling is to balance the data set by reducing the number of majority class samples. The majority class samples are samples of the normal subset. In the embodiment of the present application, undersampling can be performed by randomly deleting majority class samples, cluster centers (ClusterCentroids), etc. Randomly deleting majority class samples is simply to randomly delete majority class samples so that the number of majority class samples is close to that of minority class samples. The cluster center can cluster the majority class samples into fewer clusters through a clustering algorithm, and then select a sample from each cluster as a representative, and finally generate a new undersampled data set.

需要说明的是，本申请实施例中可以依次将过采样或欠采样处理之后的各个样本序列特征向量作为输入到攻击识别模型中的样本数据。不仅如此，本申请实施例还可以结合两种方法，如过采样少数类样本、欠采样多数类样本，以获得更好的平衡效果，并综合过采样和欠采样处理之后的各个样本序列特征向量作为输入到攻击识别模型中的样本数据。It should be noted that in the embodiment of the present application, each sample sequence feature vector after oversampling or undersampling processing can be used as sample data input into the attack recognition model in sequence. In addition, the embodiment of the present application can also combine the two methods, such as oversampling minority class samples and undersampling majority class samples, to obtain a better balance effect, and comprehensively use each sample sequence feature vector after oversampling and undersampling processing as sample data input into the attack recognition model.

请参阅图13，在一些实施例中，步骤S402可以包括步骤S701至步骤S702：Please refer to FIG. 13 , in some embodiments, step S402 may include steps S701 to S702:

步骤S701，以样本实体作为节点、以样本事件表征的实体之间的因果关系作为边构建初始因果图；Step S701, constructing an initial causal graph using sample entities as nodes and causal relationships between entities represented by sample events as edges;

步骤S702，对初始因果图中的节点和边进行清洗和整理，得到样本因果图。Step S702, cleaning and sorting the nodes and edges in the initial causal graph to obtain a sample causal graph.

示例性的，本申请实施例中在将审计日志转换成因果图的过程中，由于海量审计日志中节点数量非常多，在训练的时候，因此需要对因果图进行清洗和整理处理。具体的，先以样本实体作为节点、以样本事件表征的实体之间的因果关系作为边构建初始因果图，初始因果图是还没经过清洗和整理的，存在有一些无用的节点和边，随后对初始因果图中的节点和边进行清洗和整理，得到清洗后的样本因果图。Exemplarily, in the process of converting the audit log into a causal graph in the embodiment of the present application, since there are a large number of nodes in the massive audit log, the causal graph needs to be cleaned and sorted during training. Specifically, an initial causal graph is first constructed with sample entities as nodes and causal relationships between entities represented by sample events as edges. The initial causal graph has not been cleaned and sorted, and there are some useless nodes and edges. Subsequently, the nodes and edges in the initial causal graph are cleaned and sorted to obtain a cleaned sample causal graph.

需要说明的是，通过对节点和边的清洗和整理，可以大大降低因果图的复杂性，使图形更加简洁明了，减少不必要的细节和噪声，确保数据的准确性和一致性，便于对整体情况进行分析和理解，同时可以提升模型的学习效果，使模型能够更好地关注与安全威胁相关的节点和边。It should be noted that by cleaning and organizing the nodes and edges, the complexity of the causal graph can be greatly reduced, making the graph more concise and clear, reducing unnecessary details and noise, ensuring the accuracy and consistency of the data, and facilitating the analysis and understanding of the overall situation. At the same time, it can improve the learning effect of the model, enabling the model to better focus on nodes and edges related to security threats.

示例性的，本申请实施例中提供了多种清洗和整理的手段，请参阅图14，在一些实施例中，因此步骤S702可以包括步骤S801至步骤S803：Exemplarily, various cleaning and finishing methods are provided in the embodiments of the present application. Please refer to FIG. 14 . In some embodiments, step S702 may include steps S801 to S803:

步骤S801，确定初始因果图中标记为ATT&CK行为的样本ATT&CK节点，以及标记为正常行为的样本正常节点，删除初始因果图中达不到样本ATT&CK节点的样本正常节点和对应的边，得到样本因果图；Step S801, determine the sample ATT&CK nodes marked as ATT&CK behaviors and the sample normal nodes marked as normal behaviors in the initial causal graph, delete the sample normal nodes and corresponding edges that cannot reach the sample ATT&CK nodes in the initial causal graph, and obtain the sample causal graph;

步骤S802，确定初始因果图中任意两个节点之间重复的边，并删除重复的边，得到样本因果图；Step S802, determining the repeated edges between any two nodes in the initial causal graph, and deleting the repeated edges to obtain a sample causal graph;

步骤S803，确定初始因果图中同一类型的事件不同节点，合并为同一个节点，并保留相同的输入和输出的边，得到样本因果图。Step S803, determine different nodes of the same type of event in the initial causal graph, merge them into the same node, and retain the same input and output edges to obtain a sample causal graph.

示例性的，本申请实施例中可以在模型学习中则消除ATT&CK节点所不能到达的边和节点，首先就需要根据标记的情况确定初始因果图中标记为ATT&CK行为的样本ATT&CK节点，以及标记为正常行为的样本正常节点，然后寻找初始因果图中达不到样本ATT&CK节点的样本正常节点和对应的边，进行删除，从而得到样本因果图。Exemplarily, in an embodiment of the present application, edges and nodes that cannot be reached by the ATT&CK node can be eliminated during model learning. First, it is necessary to determine the sample ATT&CK nodes marked as ATT&CK behaviors and the sample normal nodes marked as normal behaviors in the initial causal graph based on the marking situation, and then find the sample normal nodes and corresponding edges that cannot reach the sample ATT&CK nodes in the initial causal graph and delete them to obtain a sample causal graph.

示例性的，本申请实施例中可以确定初始因果图中任意两个节点之间重复的边，这里主要是指同一个实体反复对另一个实体之间的操作，例如反复读、反复写等，并删除重复的边，得到样本因果图。Exemplarily, in an embodiment of the present application, repeated edges between any two nodes in the initial causal graph can be determined, which mainly refers to repeated operations between the same entity on another entity, such as repeated reading, repeated writing, etc., and the repeated edges can be deleted to obtain a sample causal graph.

示例性的，本申请实施例中可以确定初始因果图中同一类型的事件不同节点，即输入节点和输出节点相同，则合并为同一个节点，并保留相同的输入和输出的边，得到样本因果图。Exemplarily, in an embodiment of the present application, different nodes of the same type of events in the initial causal graph, that is, if the input nodes and output nodes are the same, can be determined and merged into the same node, and the same input and output edges are retained to obtain a sample causal graph.

如图3所示，示出了本申请实施例提供的一种初始因果图的示意图。图中设置有节点P1、P2、P3、P4、P5、P6、P7、P8和P9共9个节点，P9是样本ATT&CK节点，P1对P4执行T1读操作、P1对P2执行T2读操作、P2对P3执行T3写操作、P4对P5执行T4绑定操作、P4对P6执行T5绑定操作、P4对P7执行T6绑定操作、P5对P8执行T7发送操作、P6对P8执行T8发送操作、P7对P8执行T9发送操作、P1对P4执行T10读操作、P8对P9执行T11执行操作，其中，T表示不同的时刻。As shown in Figure 3, a schematic diagram of an initial causal graph provided by an embodiment of the present application is shown. In the figure, there are 9 nodes, namely, P1, P2, P3, P4, P5, P6, P7, P8 and P9, where P9 is a sample ATT&CK node, P1 performs a T1 read operation on P4, P1 performs a T2 read operation on P2, P2 performs a T3 write operation on P3, P4 performs a T4 binding operation on P5, P4 performs a T5 binding operation on P6, P4 performs a T6 binding operation on P7, P5 performs a T7 sending operation on P8, P6 performs a T8 sending operation on P8, P7 performs a T9 sending operation on P8, P1 performs a T10 read operation on P4, and P8 performs a T11 execution operation on P9, where T represents different moments.

示例性的，根据上述步骤，在模型学习中则消除ATT&CK节点所不能到达的边和节点，如删除节点P2、P3以及删除T2和T3时刻的边；其次删除所有重复的边，时间戳使用最早的表示，就像节点P1在T1时刻和T10时刻都对节点P4进行读操作，这时应当合并为节点P1在T1时刻对节点P4进行读操作，这里主要是指同一个实体反复对另一个实体之间的操作，例如反复读、反复写等；此外，如果是同一类型的事件不同节点，如节点P5、P6和P7，即输入节点和输出节点相同，也将其合并为一个节点，保留相同的输入及输出，经过清洗和整理后，得到的样本因果图如图15所示，图15示出了本申请实施例提供的一种清洗和整理后的样本因果图的示意图。Exemplarily, according to the above steps, in model learning, the edges and nodes that cannot be reached by the ATT&CK nodes are eliminated, such as deleting nodes P2 and P3 and deleting the edges at times T2 and T3; secondly, all duplicate edges are deleted, and the timestamp is represented using the earliest representation, just like node P1 performs a read operation on node P4 at times T1 and T10, which should be merged into node P1 performing a read operation on node P4 at time T1. This mainly refers to the repeated operations between the same entity and another entity, such as repeated reading, repeated writing, etc.; in addition, if they are different nodes of the same type of event, such as nodes P5, P6 and P7, that is, the input nodes and output nodes are the same, they are also merged into one node, retaining the same input and output. After cleaning and sorting, the obtained sample causal graph is shown in Figure 15. Figure 15 shows a schematic diagram of a cleaned and sorted sample causal graph provided in an embodiment of the present application.

进一步的，在应用过程中，也可以对因果图进行清洗和整理，最终根据清洗和整理后的因果图提取实体的集合，或根据清洗和整理后的因果图转换形成溯源图，本申请实施例不做具体限制。Furthermore, during the application process, the causal graph may be cleaned and organized, and finally a set of entities may be extracted based on the cleaned and organized causal graph, or a traceability graph may be formed based on the cleaned and organized causal graph. The embodiments of the present application do not impose any specific limitations thereto.

请参阅图16，在一些实施例中，步骤S104还可以包括步骤S901至步骤S903：Please refer to FIG. 16 , in some embodiments, step S104 may further include steps S901 to S903:

步骤S901，获取预设的名称映射表；Step S901, obtaining a preset name mapping table;

步骤S902，将各个实体序列中的实体名称映射到名称映射表中，根据名称映射表的映射结果重新确定各个实体序列中的实体名称，并根据重新确定后的实体名称确定词形还原后的实体序列；Step S902, mapping the entity names in each entity sequence into a name mapping table, re-determining the entity names in each entity sequence according to the mapping results of the name mapping table, and determining the entity sequence after lemma restoration according to the re-determined entity names;

步骤S903，将词形还原后的各个实体序列进行特征转换，得到对应的各个序列特征向量。Step S903: Perform feature conversion on each entity sequence after word form restoration to obtain corresponding feature vectors of each sequence.

示例性的，如图12所示，在提取到序列之后，需要对序列进行词形还原。实体一般是一个进程或文件，在计算机里面进程实际上是文件被执行了，因此在描述进程时会有该进程的路径，例如我们打开软件A是双击A的图标，而在计算机里实际上是C:\User\Desktop\A.exe，这个文件加载的代码和数据加载到内存之中，这其中要通过绝对路径或相对路径来寻找。因此，词形还原是为了减少检测的复杂度，将不同的名称统一映射到同一个名称映射表中。Exemplarily, as shown in FIG12, after the sequence is extracted, the sequence needs to be lemmatized. An entity is generally a process or a file. In a computer, a process is actually a file that is executed. Therefore, when describing a process, there will be a path for the process. For example, we open software A by double-clicking the icon of A, but in the computer it is actually C:\User\Desktop\A.exe. The code and data loaded by this file are loaded into the memory, which must be found through an absolute path or a relative path. Therefore, lemmatization is to reduce the complexity of detection and uniformly map different names to the same name mapping table.

示例性的，名称映射表是本申请实施例中预先建立的用于进行词形还原的表，名称映射表中分为进程、文件和事件三大类，其中每一大类又分为多种小类。例如：Exemplarily, the name mapping table is a table pre-established in the embodiment of the present application for word form restoration, and the name mapping table is divided into three categories: process, file, and event, each of which is further divided into multiple subcategories. For example:

进程包括：系统进程（system_process）、库进程（lib_process）、程序进程（programs_process）和用户进程（user_process）；Processes include: system process (system_process), library process (lib_process), program process (programs_process) and user process (user_process);

文件包括：系统文件（system_file）、库文件（lib_file）、程序文件（program_file）、用户文件（user_file）和组合文件（combined_files）；Files include: system files (system_file), library files (lib_file), program files (program_file), user files (user_file) and combined files (combined_files);

动作包括：读（read）、写（write）、删除（delete）、执行（execute）、分叉（fork）、请求（request）、引用（refer）、绑定（bind）、接收（receive）、发送（send）、连接（connect）、ip连接（ip_connect），会话连接（session_connect）和解析（resolve）。Actions include: read, write, delete, execute, fork, request, refer, bind, receive, send, connect, ip_connect, session_connect, and resolve.

这些类型几乎包括所有的审计日志事件，足以捕捉因果图中实体的上下文，语义和句法相似性以及与其他词的关系。因此，需要将各个实体序列中的实体名称映射到名称映射表中，根据名称映射表的映射结果重新确定各个实体序列中的实体名称，并根据重新确定后的实体名称确定词形还原后的实体序列，最后将词形还原后的各个实体序列进行特征转换，得到对应的各个序列特征向量。These types include almost all audit log events, which are sufficient to capture the context, semantic and syntactic similarity of entities in the causal graph, and the relationship with other words. Therefore, it is necessary to map the entity names in each entity sequence to the name mapping table, re-determine the entity names in each entity sequence according to the mapping results of the name mapping table, and determine the entity sequence after lemmatization according to the re-determined entity names. Finally, perform feature conversion on each entity sequence after lemmatization to obtain the corresponding feature vectors of each sequence.

例如，本申请实施例中解析每个序列，找到实体对应的实体名称，并将它们映射到相应的词汇表。例如，若上述生成的一个实体序列为：<C:\Windows\System32\cmd.exe，execute，C:\User\Desktop\malicious.exe>，这里面包括三个实体，分别是实体C:\Windows\System32\cmd.exe、实体execute和实体C:\User\Desktop\malicious.exe，转换后，词形还原后的实体序列为<system_process，execute，user_file>。后续如果出现映射后的结果相同则认为动作很相近，可以丢弃重复的序列。For example, in the embodiment of the present application, each sequence is parsed, the entity name corresponding to the entity is found, and they are mapped to the corresponding vocabulary. For example, if an entity sequence generated above is: <C:\Windows\System32\cmd.exe, execute, C:\User\Desktop\malicious.exe>, it includes three entities, namely entity C:\Windows\System32\cmd.exe, entity execute and entity C:\User\Desktop\malicious.exe. After conversion, the entity sequence after word form restoration is <system_process, execute, user_file>. If the results after mapping are the same later, it is considered that the actions are very similar and the repeated sequences can be discarded.

示例性的，本申请实施例中在应用阶段，也需要进行词形还原和词嵌入，在此不做具体限制。For example, in the embodiment of the present application, word form restoration and word embedding are also required during the application stage, and no specific restrictions are made here.

请参阅图17，在一些实施例中，步骤S105还可以包括步骤S1001至步骤S1003：Please refer to FIG. 17 . In some embodiments, step S105 may further include steps S1001 to S1003:

步骤S1001，根据因果图中节点之间的因果关系或路径长度，为每条边配置对应的权重；Step S1001, configuring a corresponding weight for each edge according to the causal relationship or path length between nodes in the causal graph;

步骤S1002，选取任意一个ATT&CK实体所在的ATT&CK节点为初始节点，并选取任意的另一个ATT&CK实体所在的ATT&CK节点为目标节点，基于权重并通过迪杰斯特拉算法获取初始节点到目标节点之间的最短路径，将最短路径作为目标路径；Step S1002: select an ATT&CK node where any ATT&CK entity is located as the initial node, and select an ATT&CK node where any other ATT&CK entity is located as the target node, obtain the shortest path from the initial node to the target node based on the weight and using the Dijkstra algorithm, and use the shortest path as the target path;

步骤S1003，在因果图中保留ATT&CK节点以及目标路径上的节点和边，并删除目标路径以外的节点和边，将处理后的因果图作为重构后的攻击溯源图。Step S1003: retain the ATT&CK nodes and the nodes and edges on the target path in the causal graph, delete the nodes and edges outside the target path, and use the processed causal graph as the reconstructed attack tracing graph.

示例性的，本申请实施例中采用自定义权重的迪杰斯特拉（Dijkstra）算法寻找每一个ATT&CK节点到未经过的ATT&CK节点的最短路径，并作为最终的目标路径。Exemplarily, in the embodiments of the present application, a Dijkstra algorithm with custom weights is used to find the shortest path from each ATT&CK node to an unpassed ATT&CK node, and use it as the final target path.

具体的，每个ATT&CK节点作为图中的一个节点，节点之间的连接代表不同的关联关系，本申请实施例中需要根据因果图中节点之间的因果关系或路径长度，为每条边配置对应的权重，构建一个有向加权图。随后，选取任意一个ATT&CK实体所在的ATT&CK节点为初始节点，并选取任意的另一个ATT&CK实体所在的ATT&CK节点为目标节点，基于权重并通过迪杰斯特拉算法获取初始节点到目标节点之间的最短路径，将最短路径作为目标路径。最后，需要把每一个实体ATT&CK实体联系起来就是本申请实施例的最终输出，可以在因果图中保留ATT&CK节点以及目标路径上的节点和边，并删除目标路径以外的节点和边，将处理后的因果图作为重构后的攻击溯源图。Specifically, each ATT&CK node is a node in the graph, and the connections between the nodes represent different associations. In the embodiment of the present application, it is necessary to configure the corresponding weight for each edge according to the causal relationship or path length between the nodes in the causal graph to construct a directed weighted graph. Subsequently, the ATT&CK node where any ATT&CK entity is located is selected as the initial node, and the ATT&CK node where any other ATT&CK entity is located is selected as the target node. Based on the weight and the Dijkstra algorithm, the shortest path between the initial node and the target node is obtained, and the shortest path is used as the target path. Finally, it is necessary to connect each entity ATT&CK entity to be the final output of the embodiment of the present application. The ATT&CK nodes and the nodes and edges on the target path can be retained in the causal graph, and the nodes and edges outside the target path can be deleted. The processed causal graph is used as the reconstructed attack tracing graph.

示例性的，图18示出了本申请实施例提供的攻击检测的完整流程示意图，在应用过程，执行审计日志需要经过处理后构造因果图，并提取序列，之后经过词形还原和词嵌入后，在数据输入到训练完的攻击识别模型进行处理，最终，可以确定图中的节点1、3、5和8节点为ATT&CK节点，因此最终可以保留ATT&CK节点以及目标路径上的节点和边，并删除目标路径以外的节点和边，将处理后的因果图作为重构后的攻击溯源图，得到最终的修剪结果。Exemplarily, Figure 18 shows a complete flow chart of attack detection provided by an embodiment of the present application. In the application process, the execution audit log needs to be processed to construct a causal graph and extract the sequence. After word form restoration and word embedding, the data is input into the trained attack recognition model for processing. Finally, it can be determined that nodes 1, 3, 5 and 8 in the graph are ATT&CK nodes. Therefore, the ATT&CK nodes and the nodes and edges on the target path can be retained, and the nodes and edges outside the target path can be deleted. The processed causal graph is used as the reconstructed attack tracing graph to obtain the final pruning result.

图19是本申请实施例提供的攻击识别模型的训练方法的一个可选的流程图，图2中的方法可以包括但不限于包括步骤S1101至步骤S1105。FIG. 19 is an optional flowchart of a training method for an attack identification model provided in an embodiment of the present application. The method in FIG. 2 may include but is not limited to steps S1101 to S1105.

步骤S1101，获取ATT&CK技战术的样本审计日志；Step S1101, obtaining a sample audit log of ATT&CK techniques and tactics;

步骤S1102，解析样本审计日志，从样本审计日志中提取多个样本实体和对应的样本事件，并以样本实体作为节点、以样本事件表征的实体之间的因果关系作为边构建样本因果图；Step S1102, parsing the sample audit log, extracting multiple sample entities and corresponding sample events from the sample audit log, and constructing a sample causal graph with the sample entities as nodes and the causal relationships between entities represented by the sample events as edges;

步骤S1103，从样本因果图中提取样本实体的集合，从样本实体的集合中确定多个样本ATT&CK实体子集，并根据所包含的实体之间的因果关系将各个样本ATT&CK实体子集转换为对应的各个样本实体序列，其中，样本ATT&CK实体子集中至少包括一个样本ATT&CK实体；Step S1103, extracting a set of sample entities from the sample causal graph, determining multiple sample ATT&CK entity subsets from the set of sample entities, and converting each sample ATT&CK entity subset into a corresponding sample entity sequence according to the causal relationship between the included entities, wherein the sample ATT&CK entity subset includes at least one sample ATT&CK entity;

步骤S1104，将各个样本实体序列进行特征转换，得到对应的各个样本序列特征向量，并依次将各个样本序列特征向量输入到攻击识别模型中，预测得到各个样本实体序列的样本预测结果；Step S1104, performing feature conversion on each sample entity sequence to obtain corresponding feature vectors of each sample sequence, and sequentially inputting the feature vectors of each sample sequence into the attack recognition model to predict and obtain sample prediction results of each sample entity sequence;

步骤S1105，根据所包含的样本ATT&CK实体的情况为样本实体序列打上样本标签，并根据样本标签和样本预测结果调整攻击识别模型的参数，得到训练后的攻击识别模型。Step S1105: label the sample entity sequence according to the sample ATT&CK entities included therein, and adjust the parameters of the attack recognition model according to the sample labels and the sample prediction results to obtain a trained attack recognition model.

请参阅图20，本申请实施例还提供一种网络攻击重构装置，可以实现上述网络攻击重构方法，网络攻击重构装置包括：Please refer to FIG. 20 . The embodiment of the present application further provides a network attack reconstruction device, which can implement the above network attack reconstruction method. The network attack reconstruction device includes:

日志获取模块2001，用于获取ATT&CK技战术的执行审计日志；Log acquisition module 2001, used to obtain the execution audit log of ATT&CK techniques and tactics;

因果图构建模块2002，用于解析执行审计日志，从执行审计日志中提取多个实体和对应的事件，并以实体作为节点、以事件表征的实体之间的因果关系作为边构建因果图；A causal graph construction module 2002 is used to parse the execution audit log, extract multiple entities and corresponding events from the execution audit log, and construct a causal graph using the entities as nodes and the causal relationships between the entities represented by the events as edges;

序列提取模块2003，用于提取实体的集合，根据实体的集合划分得到多个实体子集，并根据所包含的实体之间的因果关系将各个实体子集转换为对应的各个实体序列；The sequence extraction module 2003 is used to extract a set of entities, obtain multiple entity subsets according to the entity set division, and convert each entity subset into a corresponding entity sequence according to the causal relationship between the included entities;

模型预测模块2004，用于将各个实体序列进行特征转换，得到对应的各个序列特征向量，并将各个序列特征向量输入到预先训练好的攻击识别模型中，预测得到各个实体序列的预测结果；The model prediction module 2004 is used to perform feature conversion on each entity sequence to obtain corresponding feature vectors of each sequence, and input each sequence feature vector into a pre-trained attack recognition model to predict the prediction results of each entity sequence;

攻击重构模块2005，用于根据预测结果确定标记为ATT&CK行为的实体序列为目标序列，确定目标序列中的实体为ATT&CK实体，并在因果图中确定ATT&CK实体所在的ATT&CK节点之间的目标路径，根据目标路径和ATT&CK节点将因果图转换为重构后的攻击溯源图。The attack reconstruction module 2005 is used to determine that the entity sequence marked as ATT&CK behavior is the target sequence based on the prediction results, determine that the entity in the target sequence is the ATT&CK entity, and determine the target path between the ATT&CK nodes where the ATT&CK entity is located in the causal graph, and convert the causal graph into a reconstructed attack tracing graph based on the target path and the ATT&CK node.

示例性的，网络攻击重构装置可以执行上述网络攻击重构方法，通过获取ATT&CK技战术的执行审计日志完成攻击溯源图的构建，首先需要从执行审计日志中提取到的实体作为节点和事件构建因果图，因果图中包含有大量的正常节点，且并不确定哪些是ATT&CK节点，无法作为最终的溯源图，需要进一步处理。因此本申请实施例需要提取实体的集合，根据实体的集合划分得到多个实体子集，并根据各个实体子集中所包含的实体之间的因果关系转换得到对应的各个实体序列，通过序列来包含不同节点以进行模型的处理，随后将各个实体序列进行特征转换后输入到预先训练好的攻击识别模型中，预测得到各个实体序列的预测结果，这样就可以通过模型判断序列中的实体是否为ATT&CK实体，ATT&CK实体对应的节点就是ATT&CK节点，就可以在因果图中确定ATT&CK节点之间的目标路径，最终根据目标路径和ATT&CK节点将因果图转换为重构后的攻击溯源图。本申请实施例仅需要获取ATT&CK技战术的执行审计日志就可以完成攻击溯源图的构建，无需增加额外的工作量，并且通过将实体存放于实体序列中以通过攻击识别模型来确定攻击序列，从而可以准确地确定ATT&CK实体和节点并完成攻击溯源图的构建，因此，本申请实施例能够提高网络攻击重构的效率和质量。Exemplarily, the network attack reconstruction device can execute the above network attack reconstruction method, and complete the construction of the attack traceability graph by obtaining the execution audit log of ATT&CK techniques and tactics. First, the entities extracted from the execution audit log are used as nodes and events to construct a causal graph. The causal graph contains a large number of normal nodes, and it is not certain which are ATT&CK nodes. It cannot be used as the final traceability graph and needs further processing. Therefore, the embodiment of the present application needs to extract a set of entities, divide the set of entities to obtain multiple entity subsets, and convert the causal relationship between the entities contained in each entity subset to obtain the corresponding entity sequence, and include different nodes through the sequence for model processing, and then convert each entity sequence into a pre-trained attack recognition model after feature conversion, and predict the prediction results of each entity sequence. In this way, the model can be used to determine whether the entity in the sequence is an ATT&CK entity. The node corresponding to the ATT&CK entity is the ATT&CK node, and the target path between the ATT&CK nodes can be determined in the causal graph. Finally, the causal graph is converted into a reconstructed attack traceability graph according to the target path and the ATT&CK node. The embodiment of the present application only needs to obtain the execution audit log of ATT&CK techniques and tactics to complete the construction of the attack tracing graph without adding additional workload, and by storing the entities in the entity sequence to determine the attack sequence through the attack identification model, the ATT&CK entities and nodes can be accurately determined and the construction of the attack tracing graph can be completed. Therefore, the embodiment of the present application can improve the efficiency and quality of network attack reconstruction.

该网络攻击重构装置的具体实施方式与上述网络攻击重构方法的具体实施例基本相同，在此不再赘述。在满足本申请实施例要求的前提下，网络攻击重构装置还可以设置其他功能模块，以实现上述实施例中的网络攻击重构方法。The specific implementation of the network attack reconstruction device is basically the same as the specific implementation of the network attack reconstruction method described above, and will not be repeated here. On the premise of meeting the requirements of the embodiment of the present application, the network attack reconstruction device can also be provided with other functional modules to implement the network attack reconstruction method in the above embodiment.

本申请实施例还提供了一种电子设备，电子设备包括存储器和处理器，存储器存储有计算机程序，处理器执行计算机程序时实现上述网络攻击重构方法。该电子设备可以为包括平板电脑、车载电脑等任意智能终端。The embodiment of the present application also provides an electronic device, the electronic device includes a memory and a processor, the memory stores a computer program, and the processor implements the above network attack reconstruction method when executing the computer program. The electronic device can be any intelligent terminal including a tablet computer, a car computer, etc.

请参阅图21，图21示意了另一实施例的电子设备的硬件结构，电子设备包括：Please refer to FIG. 21 , which schematically shows the hardware structure of an electronic device according to another embodiment. The electronic device includes:

处理器2101，可以采用通用的CPU(CentralProcessingUnit，中央处理器)、微处理器、应用专用集成电路(ApplicationSpecificIntegratedCircuit，ASIC)、或者一个或多个集成电路等方式实现，用于执行相关程序，以实现本申请实施例所提供的技术方案；The processor 2101 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of the present application;

存储器2102，可以采用只读存储器（ReadOnlyMemory，ROM）、静态存储设备、动态存储设备或者随机存取存储器(RandomAccessMemory，RAM)等形式实现。存储器2102可以存储操作系统和其他应用程序，在通过软件或者固件来实现本说明书实施例所提供的技术方案时，相关的程序代码保存在存储器2102中，并由处理器2101来调用执行本申请实施例的网络攻击重构方法或攻击识别模型的训练方法；The memory 2102 can be implemented in the form of a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 2102 can store an operating system and other applications. When the technical solution provided in the embodiments of this specification is implemented by software or firmware, the relevant program code is stored in the memory 2102, and the processor 2101 calls and executes the network attack reconstruction method or the attack identification model training method of the embodiments of this application;

输入/输出接口2103，用于实现信息输入及输出；Input/output interface 2103, used to implement information input and output;

通信接口2104，用于实现本设备与其他设备的通信交互，可以通过有线方式（例如USB、网线等）实现通信，也可以通过无线方式（例如移动网络、WIFI、蓝牙等）实现通信；Communication interface 2104, used to realize communication interaction between the device and other devices, which can be realized through wired mode (such as USB, network cable, etc.) or wireless mode (such as mobile network, WIFI, Bluetooth, etc.);

总线2105，在设备的各个组件（例如处理器2101、存储器2102、输入/输出接口2103和通信接口2104）之间传输信息；A bus 2105 that transmits information between the various components of the device (e.g., the processor 2101, the memory 2102, the input/output interface 2103, and the communication interface 2104);

其中处理器2101、存储器2102、输入/输出接口2103和通信接口2104通过总线2105实现彼此之间在设备内部的通信连接。The processor 2101 , the memory 2102 , the input/output interface 2103 and the communication interface 2104 are connected to each other in communication within the device via the bus 2105 .

本申请实施例还提供了一种计算机可读存储介质，该计算机可读存储介质存储有计算机程序，该计算机程序被处理器执行时实现上述网络攻击重构方法或攻击识别模型的训练方法。An embodiment of the present application also provides a computer-readable storage medium, which stores a computer program. When the computer program is executed by a processor, it implements the above-mentioned network attack reconstruction method or attack identification model training method.

存储器作为一种非暂态计算机可读存储介质，可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外，存储器可以包括高速随机存取存储器，还可以包括非暂态存储器，例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中，存储器可选包括相对于处理器远程设置的存储器，这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory, as a non-transient computer-readable storage medium, can be used to store non-transient software programs and non-transient computer executable programs. In addition, the memory may include a high-speed random access memory, and may also include a non-transient memory, such as at least one disk storage device, a flash memory device, or other non-transient solid-state storage device. In some embodiments, the memory may optionally include a memory remotely disposed relative to the processor, and these remote memories may be connected to the processor via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

本申请实施例描述的实施例是为了更加清楚的说明本申请实施例的技术方案，并不构成对于本申请实施例提供的技术方案的限定，本领域技术人员可知，随着技术的演变和新应用场景的出现，本申请实施例提供的技术方案对于类似的技术问题，同样适用。The embodiments described in the embodiments of the present application are intended to more clearly illustrate the technical solutions of the embodiments of the present application and do not constitute a limitation on the technical solutions provided in the embodiments of the present application. Those skilled in the art will appreciate that with the evolution of technology and the emergence of new application scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

本领域技术人员可以理解的是，图中示出的技术方案并不构成对本申请实施例的限定，可以包括比图示更多或更少的步骤，或者组合某些步骤，或者不同的步骤。Those skilled in the art will appreciate that the technical solutions shown in the figures do not constitute a limitation on the embodiments of the present application, and may include more or fewer steps than shown in the figures, or a combination of certain steps, or different steps.

以上所描述的装置实施例仅仅是示意性的，其中作为分离部件说明的单元可以是或者也可以不是物理上分开的，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The device embodiments described above are merely illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

本领域普通技术人员可以理解，上文中所公开方法中的全部或某些步骤、装置、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。Those skilled in the art will appreciate that all or some of the steps, devices, and functional modules/units in the methods disclosed above may be implemented as software, firmware, hardware, or a suitable combination thereof.

本申请的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、装置、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the specification of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable where appropriate, so that the embodiments of the present application described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any of their variations are intended to cover non-exclusive inclusions, for example, a process, method, device, product or equipment comprising a series of steps or units is not necessarily limited to those steps or units clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or equipment.

应当理解，在本申请中，“至少一个(项)”是指一个或者多个，“多个”是指两个或两个以上。“和/或”，用于描述关联对象的关联关系，表示可以存在三种关系，例如，“A和/或B”可以表示：只存在A，只存在B以及同时存在A和B三种情况，其中A，B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达，是指这些项中的任意组合，包括单项(个)或复数项(个)的任意组合。例如，a，b或c中的至少一项(个)，可以表示：a，b，c，“a和b”，“a和c”，“b和c”，或“a和b和c”，其中a，b，c可以是单个，也可以是多个。It should be understood that in the present application, "at least one (item)" means one or more, and "plurality" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that three relationships may exist. For example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time, where A and B can be singular or plural. The character "/" generally indicates that the objects associated before and after are in an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single or plural items. For example, at least one of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, c can be single or multiple.

在本申请所提供的几个实施例中，应该理解到，所揭露的装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，上述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个装置，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in the present application, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the above units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another device, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.

上述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.

集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括多指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例的方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(Read-Only Memory，简称ROM)、随机存取存储器(Random Access Memory，简称RAM)、磁碟或者光盘等各种可以存储程序的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including multiple instructions to enable a computer device (which can be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM), disk or optical disk and other media that can store programs.

以上参照附图说明了本申请实施例的优选实施例，并非因此局限本申请实施例的权利范围。本领域技术人员不脱离本申请实施例的范围和实质内所作的任何修改、等同替换和改进，均应在本申请实施例的权利范围之内。The preferred embodiments of the present application are described above with reference to the accompanying drawings, but the scope of the rights of the present application is not limited thereto. Any modification, equivalent substitution and improvement made by a person skilled in the art without departing from the scope and essence of the present application should be within the scope of the rights of the present application.