CN118467743A

Movatterモバイル変換

Info

Publication number: CN118467743A
Application number: CN202410932859.XA
Authority: CN
Inventors: 肖勇才; 杨浩; 王晨希; 徐健; 喻诚斐; 张俊锋; 林楠
Original assignee: Electric Power Research Institute of State Grid Jiangxi Electric Power Co Ltd; State Grid Corp of China SGCC
Current assignee: Electric Power Research Institute of State Grid Jiangxi Electric Power Co Ltd; State Grid Corp of China SGCC
Priority date: 2024-07-12
Filing date: 2024-07-12
Publication date: 2024-08-09
Anticipated expiration: 2044-07-12
Also published as: CN118467743B

Abstract

Translated fromChinese

本发明公开了一种基于双向并行树优化日志解析的日志异常检测方法，本发明对日志数据进行处理并创建初始组，对于每个初始组，创建一个以最长公共模式为根的双向并行树；然后进行父方向和子方向的节点更新，更新完成后，双向并行树输出日志模板；使用Bi‑kmeans对日志模板进行聚类，根据日志模版选择能够代表日志行为的条目，按时间顺序排列的条目构成日志序列，将日志序列中的每个条目转换为日志键，然后对LogBERT模型进行训练，用训练好的LogBERT模型进行异常日志序列检测。本发明可以提高日志解析的准确率，从而提高日志异常检测的精度。

The present invention discloses a log anomaly detection method based on bidirectional parallel tree optimization log parsing, the present invention processes log data and creates an initial group, for each initial group, creates a bidirectional parallel tree with the longest common mode as the root; then updates the nodes in the parent direction and the child direction, after the update is completed, the bidirectional parallel tree outputs a log template; uses Bi‑kmeans to cluster the log template, selects entries that can represent log behavior according to the log template, and the entries arranged in chronological order constitute a log sequence, converts each entry in the log sequence into a log key, and then trains the LogBERT model, and uses the trained LogBERT model to detect abnormal log sequences. The present invention can improve the accuracy of log parsing, thereby improving the accuracy of log anomaly detection.

Description

Translated fromChinese

基于双向并行树优化日志解析的日志异常检测方法Log anomaly detection method based on bidirectional parallel tree optimization log parsing

技术领域Technical Field

本发明涉及日志异常检测技术领域，具体涉及一种基于双向并行树优化日志解析的日志异常检测方法。The present invention relates to the technical field of log anomaly detection, and in particular to a log anomaly detection method based on bidirectional parallel tree optimized log parsing.

背景技术Background Art

在线计算机系统容易受到网络空间中的各种恶意攻击。及时检测在线计算机系统的异常事件是保护系统的根本步骤。系统日志记录了计算机系统生成的计算事件的详细信息，在当今的异常检测中发挥着重要作用。日志作为记录系统实时运行状态和程序执行状况的数据，既是软件开发人员和运维人员监测系统运行状态重要资源，也是实现系统异常检测的绝佳数据源之一。当下基于深度学习的日志异常检测已经成为自动化系统异常检测研究领域的重点关注内容之一，但仍存在以下难题亟需解决：日志文本数据其中所包含的信息和特征很难直接作为输入被深度学习模型所学习，需要构建特定的日志解析方法；传统的日志异常检测模型仅适合特定系统，忽略了日志所特有的时序特征与统计特征，使得日志前后事件之间的上下文信息难以被捕获进而导致检测能力的不足。Online computer systems are vulnerable to various malicious attacks in cyberspace. Timely detection of abnormal events in online computer systems is a fundamental step in protecting the system. System logs record detailed information about computing events generated by computer systems and play an important role in today's anomaly detection. As data that records the real-time operating status of the system and the execution status of the program, logs are not only an important resource for software developers and operation and maintenance personnel to monitor the operating status of the system, but also one of the excellent data sources for realizing system anomaly detection. At present, log anomaly detection based on deep learning has become one of the key focuses in the field of anomaly detection in automated systems, but there are still the following problems that need to be solved urgently: the information and features contained in log text data are difficult to be directly learned by deep learning models as input, and specific log parsing methods need to be constructed; traditional log anomaly detection models are only suitable for specific systems, ignoring the unique temporal and statistical features of logs, making it difficult to capture the contextual information between events before and after the log, which leads to insufficient detection capabilities.

发明内容Summary of the invention

为了克服上述现有技术的不足，本发明提供了一种基于双向并行树优化日志解析的日志异常检测方法，以提高日志解析的准确率，从而提高日志异常检测的精度。In order to overcome the above-mentioned deficiencies of the prior art, the present invention provides a log anomaly detection method based on bidirectional parallel tree optimization log parsing to improve the accuracy of log parsing, thereby improving the precision of log anomaly detection.

本发明所采用的技术方案是：基于双向并行树优化日志解析的日志异常检测方法，包括以下步骤：The technical solution adopted by the present invention is: a log anomaly detection method based on bidirectional parallel tree optimization log parsing, comprising the following steps:

步骤S1、对日志数据进行分词和公共变量过滤，得到每条日志的所有单词组合；Step S1, perform word segmentation and common variable filtering on the log data to obtain all word combinations of each log;

步骤S2、从频率大于频率阈值的单词组合中选择的最长单词组合来创建初始组，对于每个初始组，创建一个以最长公共模式为根的双向并行树；Step S2, creating an initial group by selecting the longest word combination from the word combinations whose frequencies are greater than the frequency threshold, and for each initial group, creating a bidirectional parallel tree with the longest common pattern as the root;

步骤S3、双向并行树的父方向的节点更新，检查代表频率高于根节点的列中是否有缺失常量词，将日志消息的列中代表频率高于根节点的所有单词都将添加到树的父方向；Step S3, updating the nodes in the parent direction of the bidirectional parallel tree, checking whether there are missing constant words in the column representing a frequency higher than the root node, and adding all the words in the column representing a frequency higher than the root node in the log message to the parent direction of the tree;

步骤S4、双向并行树的子方向的节点更新，检查是否有缺失代表频率低于根节点的列中的常量词，在树的子方向上添加所有在代表频率低于根节点的列中的单词；Step S4, updating nodes in the child direction of the bidirectional parallel tree, checking whether there are any missing constant words in the column with a lower representative frequency than the root node, and adding all words in the column with a lower representative frequency than the root node in the child direction of the tree;

步骤S5、对于每个初始组，父方向和子方向的节点添加完成后，双向并行树输出日志模板；初始组中的每个单词都可以对应于树中的一个节点，该节点被标记为常量或变量；Step S5: For each initial group, after the nodes in the parent direction and the child direction are added, the bidirectional parallel tree outputs a log template; each word in the initial group can correspond to a node in the tree, and the node is marked as a constant or a variable;

步骤S6、利用TF-IDF算法将日志模板中的文本数据转换为特征向量；然后使用Bi-kmeans对日志模板进行聚类，在父方向上基于TF-IDF算法得到的特征向量对日志模板进行聚类，识别出在日志数据中频繁出现的模式；在子方向上，根据日志模板的结构特征进行聚类；通过Bi-kmeans聚类，将相似的日志模板分组，形成不同的类别；Step S6, using the TF-IDF algorithm to convert the text data in the log template into a feature vector; then using Bi-kmeans to cluster the log template, clustering the log template based on the feature vector obtained by the TF-IDF algorithm in the parent direction to identify the patterns that frequently appear in the log data; in the child direction, clustering is performed according to the structural features of the log template; through Bi-kmeans clustering, similar log templates are grouped to form different categories;

步骤S7、根据日志模版选择能够代表日志行为的条目，按时间顺序排列的条目构成日志序列，将日志序列中的每个条目转换为日志键，日志键按照在原始日志序列中的顺序排列形成的日志键序列；为每个解析的日志键提供一个唯一的事件id，事件id用作LogBERT模型训练的词汇表；将一个日志序列作为LogBERT模型的输入数据集，用标签指示日志序列是正常还是异常，创建一个随机生成的矩阵E∈R^K×d，其中矩阵中的每一行表示词汇表中每个日志键的嵌入，R为维度符号，d是每个日志键嵌入的维度，K表示从日志消息中提取的一组日志键，创建位置嵌入，而位置嵌入D∈R^K×d是通过正弦函数生成的，以对日志序列中日志键的位置信息进行编码；LogBERT模型将时间点t的日志键k^t表示为输入表示x^t，输入表示x^t是日志键嵌入和位置嵌入的总和，t∈1,2,…,T；T为日志序列的时间段；Step S7, select entries that can represent log behaviors according to the log template, and the entries arranged in chronological order constitute a log sequence. Convert each entry in the log sequence into a log key, and arrange the log keys in the order in the original log sequence to form a log key sequence; provide a unique event id for each parsed log key, and the event id is used as the vocabulary for training the LogBERT model; use a log sequence as the input data set of the LogBERT model, use a label to indicate whether the log sequence is normal or abnormal, create a randomly generated matrix E∈R^K×d , where each row in the matrix represents the embedding of each log key in the vocabulary, R is the dimension symbol, d is the dimension of each log key embedding, K represents a set of log keys extracted from the log message, create a position embedding, and the position embedding D∈R^K×d is generated by a sine function to encode the position information of the log key in the log sequence; the LogBERT model represents the log key^kt at time point t as the input representation^xt , and the input representation^xt is the sum of the log key embedding and the position embedding, t∈1,2,…,T; T is the time period of the log sequence;

步骤S8、将计算出的输入表示{x^dist,x¹,x²,…,x^t,…,x^T-1,x^T}作为输入馈送到logBERT模型的Transformer编码器；x^dist是添加到每个日志序列开头的距离标记的输入表示，Transformer编码器的输出是上下文嵌入{h^dist,h¹,h²,…,h^t,…,h^T-1,h^T}，对应于日志序列中的每个日志键，h^dist是距离标志的上下文嵌入向量，h^t是时间点t的日志键的上下文嵌入；将掩码标记的上下文嵌入h^mask的掩码令牌传递给全连接层，然后传递给softmax函数，以获得标记在词汇表上的概率分布；该概率分布用于预测最合适的掩码令牌来代替原掩码令牌；Step S8, feed the calculated input representation {x^dist ,x¹ ,x² ,…,x^t ,…,x^T-1 ,x^T } as input to the Transformer encoder of the logBERT model; x^dist is the input representation of the distance mark added to the beginning of each log sequence, and the output of the Transformer encoder is the context embedding {h^dist ,h¹ ,h² ,…,h^t ,…,h^T-1 ,h^T }, corresponding to each log key in the log sequence, h^dist is the context embedding vector of the distance mark, and h^t is the context embedding of the log key at time point t; pass the mask token of the context embedding h^mask of the mask mark to the fully connected layer and then to the softmax function to obtain the probability distribution of the mark over the vocabulary; the probability distribution is used to predict the most appropriate mask token to replace the original mask token;

步骤S9、使用球形目标函数来调节正常日志序列的分布，使正常的日志序列在嵌入空间中集中且彼此靠近，而异常的日志序列则远离球体的中心；Step S9: Use a spherical objective function to adjust the distribution of normal log sequences so that normal log sequences are concentrated and close to each other in the embedding space, while abnormal log sequences are far away from the center of the sphere;

步骤S10、训练完成后，进行异常日志序列检测。Step S10: After the training is completed, abnormal log sequence detection is performed.

进一步地，步骤S1中，将日志拆分为单词，收集每个单词在所有日志中的频率，为每个单词形成一个元组，每个日志中具有相同频率的元组将被组合成单词组合。Furthermore, in step S1, the logs are split into words, the frequency of each word in all logs is collected, a tuple is formed for each word, and tuples with the same frequency in each log are combined into word combinations.

进一步地，步骤S1中所述公共变量过滤：首先，需要定义一组正则表达式规则，然后读取日志数据，对每条日志进行分词处理，将连续的文本拆分成单独的单词或词组，对分词后的日志进行迭代，使用定义好的正则表达式规则去匹配公共变量；一旦正则表达式匹配到公共变量，就将匹配到的部分替换为通配符“<*>”，在过滤掉公共变量后，收集每个单词在整个日志集中的出现频率；为每个单词创建一个元组，包含单词本身和它的频率，将具有相同频率的单词元组合并成一个单词组合。Furthermore, the common variable filtering in step S1: first, a set of regular expression rules needs to be defined, and then the log data is read, each log is segmented, the continuous text is split into separate words or phrases, the segmented logs are iterated, and the defined regular expression rules are used to match the common variables; once the regular expression matches the common variable, the matched part is replaced with the wildcard "<*>", and after filtering out the common variables, the frequency of occurrence of each word in the entire log set is collected; a tuple is created for each word, including the word itself and its frequency, and word tuples with the same frequency are merged into a word combination.

进一步地，步骤S2中，首先，将相同长度的日志分组，并收集每个日志生成的频率大于设定的频率阈值的单词组合；其次，每个日志都选择单词最多的单词组合作为最长单词组合；最后，具有相同最长单词组合的日志将被分组；创建初始日志组后，公共最长单词组合就是每个初始组中最长的公共模式。Furthermore, in step S2, first, logs of the same length are grouped, and word combinations generated by each log with a frequency greater than a set frequency threshold are collected; secondly, each log selects the word combination with the most words as the longest word combination; finally, logs with the same longest word combination will be grouped; after the initial log group is created, the common longest word combination is the longest common pattern in each initial group.

进一步地，步骤S2中，对于每个初始组，创建一个以最长公共模式为根的双向并行树；同一列中的单词将位于相同的深度，某个单词在某列中出现的最大频率被认为是该列的代表频率；双向并行树的两个方向称为父方向和子方向，其中父方向用于检查代表频率高于根节点的列中常量词的缺失，子方向用于检查缺失代表频率低于根节点的列中的常量词。Furthermore, in step S2, for each initial group, a bidirectional parallel tree with the longest common pattern as the root is created; words in the same column will be located at the same depth, and the maximum frequency of a word appearing in a column is considered to be the representative frequency of the column; the two directions of the bidirectional parallel tree are called the parent direction and the child direction, wherein the parent direction is used to check the missing of constant words in columns with a higher representative frequency than the root node, and the child direction is used to check the missing of constant words in columns with a lower representative frequency than the root node.

进一步地，步骤S3中，将日志消息的列中代表频率高于根节点的所有单词都将添加到树的父方向；如果存在可变单词，则计算每个父列中不同单词的数量；如果有不同的单词，列中的单词将作为变量节点添加到树；如果有多个列满足父节点的需求，将并行地添加和分类节点。Further, in step S3, all words in the columns of the log message that represent a higher frequency than the root node will be added to the parent direction of the tree; if there are variable words, the number of different words in each parent column is calculated; if there are different words, the words in the column will be added to the tree as variable nodes; if there are multiple columns that meet the requirements of the parent node, the nodes will be added and classified in parallel.

进一步地，步骤S4中，根据不同单词的数量对列进行排序，并且将使用每个列中不同单词数量的阈值进行分类；排序完成后，子方向的节点相加按照顺序进行；如果某一列中不同的单词数量超过了根据经验知识设置的预设阈值，则将该列中的单词分类为可变单词；如果该列中不同单词的数量未超过预设阈值，则将该列中的单词分类为常量；一旦向双向并行树中添加了一个新的常量词，所有包含从根节点到这个叶节点的词的日志将生成一个新的日志组，随后添加的子节点将基于这个新的日志组；这个分层过程将一直执行，直到分组精确并找到所有常量单词。Further, in step S4, the columns are sorted according to the number of different words, and the threshold of the number of different words in each column is used for classification; after the sorting is completed, the nodes in the sub-direction are added in order; if the number of different words in a column exceeds a preset threshold set according to empirical knowledge, the words in the column are classified as variable words; if the number of different words in the column does not exceed the preset threshold, the words in the column are classified as constants; once a new constant word is added to the bidirectional parallel tree, all logs containing the word from the root node to this leaf node will generate a new log group, and the subsequently added child nodes will be based on this new log group; this hierarchical process will be executed until the grouping is accurate and all constant words are found.

进一步地，步骤S7中，每个日志序列现在都是一个日志键的序列{k¹, k²,…, k^t…, kᵀ^-¹, kᵀ}， kᵗ为时间点 t 的日志键，k∈1,2,…,T-1,T；在每个日志序列的开头添加了一个唯一的距离标记 k^dist。Furthermore, in step S7, each log sequence is now a sequence of log keys {k¹ , k² ,…, k^t …, kᵀ^- ¹, kᵀ}, where kᵗ is the log key at time point t, k∈1,2,…,T-1,T; a unique distance marker k^dist is added at the beginning of each log sequence.

进一步地，步骤S9中，球形目标函数公式如下所示：Furthermore, in step S9, the spherical objective function formula is as follows:

； ;

其中，L_VHM表示球形目标函数损失值，是距离标记的上下文嵌入向量，N为上下文嵌入向量的数量，c是训练数据集中正常日志序列的中心， c=Mean(h^dist)，Mean表示计算均值。Among them, L_VHM represents the spherical objective function loss value, is the context embedding vector of the distance marker, N is the number of context embedding vectors, c is the center of the normal log sequence in the training dataset, c=Mean(h^dist ), where Mean represents the calculated mean.

进一步地，步骤S10中，给定一个日志序列，首先用掩码令牌随机替换一定比率日志密钥，并使用随机屏蔽的日志序列作为LogBERT模型的输入；然后，给定一个掩码令牌，建立一个由g个正常日志键组成的候选集；如果真实的日志键在候选集中，将该日志键视为正常键；如果观察到的日志序列不在LogBERT模型预测的top-g候选集中，就认为该日志序列是异常日志序列，top-g候选集指的是在给定的概率分布中概率最高的前g个元素；然后，当一个日志序列由超过r个异常日志键组成时，将这个日志序列标记为异常；g和r都是超参数；当LogBERT模型预测一个被掩码的日志关键词时，生成一个概率分布。Further, in step S10, given a log sequence, firstly, a certain ratio of log keys are randomly replaced with mask tokens, and the randomly masked log sequence is used as the input of the LogBERT model; then, given a mask token, a candidate set consisting of g normal log keys is established; if the real log key is in the candidate set, the log key is regarded as a normal key; if the observed log sequence is not in the top-g candidate set predicted by the LogBERT model, the log sequence is considered to be an abnormal log sequence, and the top-g candidate set refers to the top g elements with the highest probability in a given probability distribution; then, when a log sequence consists of more than r abnormal log keys, the log sequence is marked as abnormal; g and r are both hyperparameters; when the LogBERT model predicts a masked log keyword, a probability distribution is generated.

本发明的有益效果：基于双向并行树的解析方法根据最长的常见模式创建初始组，然后使用双向并行树将常量词分层补充到最长公共模式，从而高效地形成完整的日志模板。该方法在HDFS、BGL、Thunderbird三个数据集上都取得了较高的解析准确率。LogBERT是基于上下文的模型，它可以根据上下文来生成单词的嵌入，能够更好的捕获整个日志的序列的信息，从而提高异常检测的精度。The beneficial effects of the present invention are as follows: the parsing method based on the bidirectional parallel tree creates an initial group according to the longest common pattern, and then uses the bidirectional parallel tree to hierarchically supplement the constant words to the longest common pattern, thereby efficiently forming a complete log template. The method has achieved high parsing accuracy on three data sets: HDFS, BGL, and Thunderbird. LogBERT is a context-based model that can generate word embeddings based on the context, and can better capture the information of the entire log sequence, thereby improving the accuracy of anomaly detection.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的方法流程图。FIG. 1 is a flow chart of the method of the present invention.

图2为父方向的节点更新示意图。FIG2 is a schematic diagram of node update in the parent direction.

图3是子方向树的构建过程。Figure 3 shows the construction process of the sub-direction tree.

具体实施方式DETAILED DESCRIPTION

下面结合附图进一步详细阐明本发明。The present invention will be further described in detail below with reference to the accompanying drawings.

参照图1，基于双向并行树优化日志解析的日志异常检测方法，包括步骤S1-步骤S10。1 , the log anomaly detection method based on bidirectional parallel tree optimized log parsing includes steps S1 to S10.

步骤S1、日志数据预处理：对日志数据进行分词和公共变量过滤，得到每条日志的所有单词组合。Step S1, log data preprocessing: perform word segmentation and common variable filtering on the log data to obtain all word combinations of each log.

步骤S1中，将日志拆分为单词，收集每个单词在所有日志中的频率，为每个单词形成一个元组，每个日志中具有相同频率的元组将被组合成单词组合。分词过程：L->W->T->F->C；L（日志）、W（单词）、T（元组）、F（频率）、C（单词组合）。In step S1, the log is split into words, the frequency of each word in all logs is collected, a tuple is formed for each word, and tuples with the same frequency in each log are combined into word combinations. Word segmentation process: L->W->T->F->C; L (log), W (word), T (tuple), F (frequency), C (word combination).

步骤S1中所述公共变量过滤：首先，需要定义一组正则表达式规则，这些规则能够识别出日志中的公共变量。例如，一个常见的正则表达式用于匹配IP地址可能是\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b。然后读取日志数据，对每条日志进行分词处理，将连续的文本拆分成单独的单词或词组。对分词后的日志进行迭代，使用定义好的正则表达式规则去匹配可能的公共变量。一旦正则表达式匹配到公共变量，就将匹配到的部分替换为通配符“<*>”。这样做的目的是移除这些变量的具体值，因为它们对于分析来说是无关紧要的。在过滤掉公共变量后，收集每个单词在整个日志集中的出现频率。为每个单词创建一个元组，包含单词本身和它的频率。将具有相同频率的单词元组合并成一个单词组合。最后，输出所有形成的单词组合，这些组合可以用于进一步的数据分析或模式识别。Common variable filtering in step S1: First, a set of regular expression rules need to be defined, which can identify common variables in the log. For example, a common regular expression for matching IP addresses may be \b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b. Then read the log data, perform word segmentation on each log, and split the continuous text into separate words or phrases. Iterate the word-segmented logs and use the defined regular expression rules to match possible common variables. Once the regular expression matches a common variable, replace the matched part with the wildcard "<*>". The purpose of this is to remove the specific values of these variables because they are irrelevant for analysis. After filtering out the common variables, collect the frequency of occurrence of each word in the entire log set. Create a tuple for each word, containing the word itself and its frequency. Merge word tuples with the same frequency into a word combination. Finally, output all the formed word combinations, which can be used for further data analysis or pattern recognition.

步骤S2、初始组创建：从频率大于频率阈值的单词组合中选择的最长单词组合来创建初始组，对于每个初始组，创建一个以最长公共模式为根的双向并行树。Step S2, initial group creation: create an initial group by selecting the longest word combination from the word combinations whose frequency is greater than the frequency threshold, and for each initial group, create a bidirectional parallel tree with the longest common pattern as the root.

由于恒定词比可变词频率一致，因此每个日志中最长的词组合最有可能是日志模板的一部分。然而，当某些变量的频率较低时，它们的频率可能不会相互干扰，从而使它们的频率一致并组合成最长的单词组合。这种情况导致日志中只有两个单词组合（即变量和常量单词组合），并且变量单词组合比常量单词组合具有更多单词。所以需要找到这些只有两个不同频率的日志，并检查最长的单词组合是否是变量组合。如果是变量组合，设置一个频率阈值来避免选择这些变量词。因此，本实施例步骤S2从频率大于频率阈值的单词组合中选择的最长单词组合来创建初始组。通过将日志中单词的最高频率乘以超参数权重来设置频率阈值。Since constant words have more consistent frequencies than variable words, the longest word combination in each log is most likely to be part of the log template. However, when the frequencies of some variables are low, their frequencies may not interfere with each other, so that their frequencies are consistent and combined into the longest word combination. This situation results in only two word combinations in the log (i.e., variable and constant word combinations), and the variable word combination has more words than the constant word combination. So it is necessary to find these logs with only two different frequencies and check whether the longest word combination is a variable combination. If it is a variable combination, set a frequency threshold to avoid selecting these variable words. Therefore, step S2 of this embodiment creates an initial group by selecting the longest word combination from the word combination whose frequency is greater than the frequency threshold. The frequency threshold is set by multiplying the highest frequency of the word in the log by the hyperparameter weight.

步骤S2中，首先，将相同长度的日志分组，并收集每个日志生成的频率大于设定的频率阈值的单词组合。其次，每个日志都会选择单词最多的单词组合作为最长单词组合。最后，具有相同最长单词组合的日志将被分组。创建初始日志组后，公共最长单词组合就是每个初始组中最长的公共模式。In step S2, first, logs of the same length are grouped, and word combinations generated by each log with a frequency greater than a set frequency threshold are collected. Secondly, each log will select the word combination with the most words as the longest word combination. Finally, logs with the same longest word combination will be grouped. After the initial log group is created, the common longest word combination is the longest common pattern in each initial group.

步骤S2中，对于每个初始组，创建一个以最长公共模式为根的双向并行树。同一列中的单词将位于相同的深度，因为它们具有相同的分类。某个单词在某列中出现的最大频率被认为是该列的代表频率。双向并行树的两个方向称为父方向和子方向，其中父方向用于检查代表频率高于根节点的列中常量词的缺失，子方向用于检查缺失代表频率低于根节点的列中的常量词。In step S2, for each initial group, a bidirectional parallel tree with the longest common pattern as the root is created. Words in the same column will be at the same depth because they have the same classification. The maximum frequency of a word in a column is considered to be the representative frequency of the column. The two directions of the bidirectional parallel tree are called the parent direction and the child direction, where the parent direction is used to check the missing of constant words in columns with higher representative frequencies than the root node, and the child direction is used to check the missing of constant words in columns with lower representative frequencies than the root node.

步骤S3、更新父方向节点：如图2所示，图2中空心圆为常量节点，实心圆为变量节点，双向并行树的父方向的节点更新，检查代表频率高于根节点的列中是否有缺失常量词，将日志消息的列中代表频率高于根节点的所有单词都将添加到树的父方向。Step S3, update the parent direction node: as shown in Figure 2, the hollow circles in Figure 2 are constant nodes, the solid circles are variable nodes, the nodes in the parent direction of the bidirectional parallel tree are updated, and it is checked whether there are missing constant words in the columns representing frequencies higher than the root node. All words in the columns of the log message representing frequencies higher than the root node will be added to the parent direction of the tree.

步骤S3中，将日志消息的列中代表频率高于根节点的所有单词都将添加到树的父方向。在这些列中，如果存在可变单词，则出现不同单词的概率很大，因为可变位置总是产生不同的单词因此，将计算每个父列中不同单词的数量。如果有不同的单词，该列中的单词将作为变量节点添加到树。如果有多个列满足父节点的需求，将并行地添加和分类节点。In step S3, all words in the columns of the log message that represent a higher frequency than the root node will be added to the parent direction of the tree. In these columns, if there are variable words, the probability of different words appearing is high, because variable positions always produce different words. Therefore, the number of different words in each parent column will be calculated. If there are different words, the words in that column will be added to the tree as variable nodes. If there are multiple columns that meet the needs of the parent node, the nodes will be added and classified in parallel.

步骤S4、更新子方向节点：双向并行树的子方向的节点更新，检查是否有缺失代表频率低于根节点的列中的常量词，在树的子方向上添加所有在代表频率低于根节点的列中的单词。Step S4, update sub-direction nodes: update the nodes in the sub-direction of the bidirectional parallel tree, check whether there are any missing constant words in the column with a lower representative frequency than the root node, and add all the words in the column with a lower representative frequency than the root node in the sub-direction of the tree.

步骤S4中，具有不同单词的子列中的单词也可能是常数单词，因此父方向的规则不能在该方向上使用。由于可变位置比固定位置产生更多不同的单词，因此将根据其中不同单词的数量对列进行排序，并且将使用每个列中不同单词数量的阈值进行分类。排序完成后，子方向的节点相加可以按照顺序进行。如果某一列中不同的单词数量超过了根据经验知识设置的预设阈值，则将该列中的单词分类为可变单词。如果该列中不同单词的数量未超过预设阈值，则将该列中的单词分类为常量。一旦向双向并行树中添加了一个新的常量词，所有包含从根节点到这个叶节点的词的日志将生成一个新的日志组，随后添加的子节点将基于这个新的日志组。这个分层过程将一直执行，直到分组精确并找到所有常量单词。In step S4, the words in the sub-columns with different words may also be constant words, so the rules of the parent direction cannot be used in this direction. Since variable positions produce more different words than fixed positions, the columns will be sorted according to the number of different words therein, and the threshold of the number of different words in each column will be used for classification. After the sorting is completed, the addition of nodes in the sub-direction can be performed in sequence. If the number of different words in a column exceeds the preset threshold set according to empirical knowledge, the words in the column are classified as variable words. If the number of different words in the column does not exceed the preset threshold, the words in the column are classified as constants. Once a new constant word is added to the bidirectional parallel tree, all logs containing words from the root node to this leaf node will generate a new log group, and the child nodes added subsequently will be based on this new log group. This hierarchical process will be performed until the grouping is accurate and all constant words are found.

图3是子方向树的构建过程。从图3中的日志来看，有两列的代表性频率低于根节点，即第4列和第5列。第5列中的单词将首先被添加到树中，因为它的代表频率更高。根节点将添加两个新的常量节点“HTTPS”和“SOCKS5”，因为分支数低于阈值。然后，根据左右路径将初始组划分为两个子组，即子组1和子组2。子组1的第四列中的单词将作为节点“HTTPS”的子节点添加，子组2的第4列中的单词将作为节点“SOCKS5”的子节点添加。由于这两个节点的分支数大于阈值，因此将这些新节点分类为变量节点。Figure 3 is the construction process of the sub-direction tree. From the logs in Figure 3, there are two columns with lower representative frequencies than the root node, namely the 4th and 5th columns. The words in the 5th column will be added to the tree first because it has a higher representative frequency. Two new constant nodes "HTTPS" and "SOCKS5" will be added to the root node because the number of branches is lower than the threshold. Then, the initial group is divided into two subgroups, namely subgroup 1 and subgroup 2, according to the left and right paths. The words in the fourth column of subgroup 1 will be added as child nodes of the node "HTTPS", and the words in the fourth column of subgroup 2 will be added as child nodes of the node "SOCKS5". Since the number of branches of these two nodes is greater than the threshold, these new nodes are classified as variable nodes.

步骤S5、模板生成：对于每个初始组，父方向和子方向的节点添加完成后，双向并行树输出日志模板；初始组中的每个单词都可以对应于树中的一个节点，该节点被标记为常量或变量；Step S5, template generation: for each initial group, after the nodes in the parent direction and the child direction are added, the bidirectional parallel tree outputs the log template; each word in the initial group can correspond to a node in the tree, and the node is marked as a constant or a variable;

步骤S6、Bi-kmeans聚类：利用TF-IDF算法将日志模板中的文本数据转换为特征向量；然后使用Bi-kmeans对日志模板进行聚类，在父方向上基于TF-IDF算法得到的特征向量对日志模板进行聚类，识别出在日志数据中频繁出现的模式；在子方向上，根据日志模板的结构特征进行聚类；通过Bi-kmeans聚类，将相似的日志模板分组，形成不同的类别。Step S6, Bi-kmeans clustering: Use the TF-IDF algorithm to convert the text data in the log template into a feature vector; then use Bi-kmeans to cluster the log template, cluster the log template based on the feature vector obtained by the TF-IDF algorithm in the parent direction, and identify the patterns that frequently appear in the log data; in the child direction, cluster the log template according to the structural characteristics; through Bi-kmeans clustering, group similar log templates into different categories.

步骤S6中，TF-IDF算法通过计算每个单词在日志模板中出现的频率（TF）以及在整个日志数据集中的逆文档频率（IDF），来评估单词对于日志模板的重要性。TF-IDF将日志数据转换为特征向量，量化日志模板的内容。In step S6, the TF-IDF algorithm evaluates the importance of a word to a log template by calculating the frequency (TF) of each word in the log template and the inverse document frequency (IDF) in the entire log data set. TF-IDF converts log data into feature vectors to quantify the content of the log template.

步骤S7：特征提取。根据日志模版选择能够代表日志行为的条目，如事件类型、时间戳、日志级别等，一系列按时间顺序排列的条目构成日志序列，将日志序列中的每个条目转换为日志键，日志键按照在原始日志序列中的顺序排列形成的日志键序列，日志键是通过日志解析器从日志消息中提取的字符串模板。Step S7: Feature extraction. According to the log template, select entries that can represent log behaviors, such as event type, timestamp, log level, etc. A series of entries arranged in chronological order constitute a log sequence. Each entry in the log sequence is converted into a log key. The log keys are arranged in the order in the original log sequence to form a log key sequence. The log key is a string template extracted from the log message by the log parser.

为每个解析的日志键提供一个唯一的事件id，该事件id用作BERT模型训练的词汇表；在创建这个词汇表时，只接收那些出现次数超过指定阈值的事件(日志键)；将日志行转换为日志序列。Provide a unique event id for each parsed log key, which is used as the vocabulary for BERT model training; when creating this vocabulary, only receive those events (log keys) whose occurrences exceed the specified threshold; convert log lines into log sequences.

将日志行转换为日志序列可以通过几种不同的方式做到这一点：Converting log lines into log sequences can be done in several different ways:

A、滑动窗口：使用带有窗口大小的滑动窗口来创建日志序列，其中一个时间窗口(例如5分钟)中的所有日志都被结构化为日志序列；A. Sliding window: Use a sliding window with a window size to create a log sequence, where all logs in a time window (e.g., 5 minutes) are structured as a log sequence;

B、固定时间窗口；B. Fixed time window;

C、基于日志属性的方法；C. Methods based on log attributes;

D、日志行数方法。D. Log line number method.

下面显示了使用滑动窗口生成的一些日志序列。Some log sequences generated using a sliding window are shown below.

如上表所示，每个日志序列现在都是一个日志键的序列{k¹, k²,…, k^t …, kᵀ^-¹, kᵀ}， kᵗ为时间点 t 的日志键，k∈1,2,…,T-1,T；T为日志序列的时间段。在每个日志序列的开头添加了一个唯一的距离标记k^dist。距离标记用于计算此日志序列与中心之间的距离（中心是使用输入中的所有日志序列计算的）。As shown in the table above, each log sequence is now a sequence of log keys {k¹ , k² ,…, k^t …, kᵀ^- ¹, kᵀ}, where kᵗ is the log key at time point t, k∈1,2,…,T-1,T; T is the time period of the log sequence. A unique distance marker k^dist is added to the beginning of each log sequence. The distance marker is used to calculate the distance between this log sequence and the center (the center is calculated using all log sequences in the input).

将一个日志序列作为输入数据集，其中标签指示日志序列是正常（0）还是异常（1）。接下来，创建一个随机生成的矩阵E ∈ R^K×d，它表示日志键嵌入（如单词嵌入），其中矩阵中的每一行表示词汇表中每个日志键的嵌入。R为维度符号，d是每个日志键嵌入的维度。K表示从日志消息中提取的一组日志键。除了这种嵌入，还创建了位置嵌入，而位置嵌入D∈R^K×d是通过正弦函数生成的，以对日志序列中日志键的位置信息进行编码。A log sequence is taken as the input dataset, where the label indicates whether the log sequence is normal (0) or abnormal (1). Next, a randomly generated matrix E ∈ R^K×d is created, which represents the log key embedding (such as word embedding), where each row in the matrix represents the embedding of each log key in the vocabulary. R is the dimension symbol and d is the dimension of each log key embedding. K represents a set of log keys extracted from the log message. In addition to this embedding, a position embedding is also created, and the position embedding D∈R^K×d is generated by a sine function to encode the position information of the log key in the log sequence.

LogBERT模型将时间点t的日志键k^t表示为输入表示x^t，输入表示 x^t是日志键嵌入和位置嵌入的总和，t∈1,2,…,T。The LogBERT model represents the log key^kt at time point t as the input representation^xt ,^which is the sum of the log key embedding and the position embedding, t∈1,2,…,T.

步骤S8、屏蔽日志键预测。将计算出的输入表示{x^dist, x¹, x², …,x^t,…, x^T-1,x^T}作为输入馈送到logBERT模型的Transformer编码器。x^dist是添加到每个日志序列开头的距离标记的输入表示。输入表示通过具有类似于BERT模型的堆叠变压器层的Transformer自注意力机制传递；当将输入表示提供给Transformer编码器时，作为训练目标的一部分，随机屏蔽了一定比例的掩码令牌。该Transformer编码器的输出是上下文嵌入{h^dist,h¹,h², …, h^t,…,h^T-1,h^T}，对应于日志序列中的每个日志键，h^dist 是距离标志的上下文嵌入向量，h^t是时间点t的日志键的上下文嵌入。Step S8, masked log key prediction. The computed input representation {x^dist , x¹ , x² , …, x^t , …, x^T-1 , x^T } is fed as input to the Transformer encoder of the logBERT model. x^dist is the input representation of the distance marker added to the beginning of each log sequence. The input representation is passed through a Transformer self-attention mechanism with stacked transformer layers similar to the BERT model; when the input representation is provided to the Transformer encoder, a certain proportion of the mask tokens are randomly masked as part of the training objective. The output of this Transformer encoder is a contextual embedding {h^dist ,h¹ ,h² , …, h^t , …,h^T-1 ,h^T }, corresponding to each log key in the log sequence, h^dist is the contextual embedding vector of the distance marker, and h^t is the contextual embedding of the log key at time point t.

将掩码标记的上下文嵌入h^mask的掩码令牌传递给全连接层，然后传递给softmax函数，以获得标记在词汇表上的概率分布；该概率分布用于预测最合适的掩码令牌来代替原掩码令牌。目标是学习预测日志序列中的掩码令牌，其中预测概率分布yʲ_mask中实际日志键的概率越高，则损失值L_MLKP越低。The masked token of the contextual embedding h^mask of the masked token is passed to the fully connected layer and then to the softmax function to obtain the probability distribution of the token over the vocabulary; this probability distribution is used to predict the most appropriate masked token to replace the original masked token. The goal is to learn to predict the masked token in the log sequence, where the higher the probability of the actual log key in the predicted probability distribution yʲ_mask , the lower the loss value L_MLKP .

步骤S9、超球最小化体积。使用球形目标函数来调节正常日志序列的分布，使正常的日志序列应该在嵌入空间中集中且彼此靠近，而异常的日志序列则远离球体的中心。在训练阶段，LogBERT模型学习/调整其权重，以最小化上述损失函数。LogBERT模型通过上述两个任务在正常日志序列进行训练，学习正常日志序列的模式以理解正常日志模式。可以在日志序列中预测掩码令牌时实现较高的预测精度。球形目标函数公式如下所示：Step S9, minimize the volume of the hypersphere. A spherical objective function is used to adjust the distribution of normal log sequences so that normal log sequences should be concentrated and close to each other in the embedding space, while abnormal log sequences are far away from the center of the sphere. During the training phase, the LogBERT model learns/adjusts its weights to minimize the above loss function. The LogBERT model is trained on normal log sequences through the above two tasks, learning the patterns of normal log sequences to understand normal log patterns. Higher prediction accuracy can be achieved when predicting mask tokens in log sequences. The spherical objective function formula is as follows:

； ;

步骤S10、LogBERT模型训练完成后，进行异常日志序列检测。给定一个测试日志序列，首先用掩码令牌（MASK）随机替换一定比率日志密钥，并使用随机屏蔽的日志序列作为LogBERT模型的输入；然后，给定一个掩码令牌，建立一个由g个正常日志键组成的候选集；如果真实的日志键在候选集中，将该日志键视为正常键；如果观察到的日志序列不在LogBERT模型预测的top-g候选集中，就认为该日志序列是异常日志序列，top-g候选集指的是在给定的概率分布中概率最高的前g个元素。然后，当一个日志序列由超过r个异常日志键组成时，将这个日志序列标记为异常。g和r都是超参数，可以根据验证集进行调整；当LogBERT模型预测一个被掩码的日志关键词时，生成一个概率分布，表明每个可能的关键词出现在那个位置的可能性。Step S10: After the LogBERT model training is completed, abnormal log sequence detection is performed. Given a test log sequence, first randomly replace a certain ratio of log keys with mask tokens (MASK), and use the randomly masked log sequence as the input of the LogBERT model; then, given a mask token, establish a candidate set consisting of g normal log keys; if the real log key is in the candidate set, the log key is regarded as a normal key; if the observed log sequence is not in the top-g candidate set predicted by the LogBERT model, the log sequence is considered to be an abnormal log sequence, and the top-g candidate set refers to the top g elements with the highest probability in a given probability distribution. Then, when a log sequence consists of more than r abnormal log keys, the log sequence is marked as abnormal. Both g and r are hyperparameters that can be adjusted according to the validation set; when the LogBERT model predicts a masked log keyword, a probability distribution is generated to indicate the probability of each possible keyword appearing in that position.

以上所述的具体实施方案，进一步详细地说明了本发明的目的、技术方案和技术效果。所应理解的是，以上所述仅为本发明的具体实施方案而已，并非用以限定本发明的范围，任何本领域的技术人员，在不脱离本发明思想和原则的前提下所做出的等同变化与修改，均应属于本发明保护的范围。The specific implementation scheme described above further illustrates the purpose, technical scheme and technical effect of the present invention in detail. It should be understood that the above description is only a specific implementation scheme of the present invention and is not intended to limit the scope of the present invention. Any equivalent changes and modifications made by any technician in the field without departing from the ideas and principles of the present invention should fall within the scope of protection of the present invention.