Movatterモバイル変換


[0]ホーム

URL:


CN116204801A - A method and device for log aggregation - Google Patents

A method and device for log aggregation
Download PDF

Info

Publication number
CN116204801A
CN116204801ACN202310093874.5ACN202310093874ACN116204801ACN 116204801 ACN116204801 ACN 116204801ACN 202310093874 ACN202310093874 ACN 202310093874ACN 116204801 ACN116204801 ACN 116204801A
Authority
CN
China
Prior art keywords
log
phrase
similarity
matching
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310093874.5A
Other languages
Chinese (zh)
Inventor
熊豹
蒋烁淼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Observation Future Information Technology Co ltd
Original Assignee
Shanghai Observation Future Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Observation Future Information Technology Co ltdfiledCriticalShanghai Observation Future Information Technology Co ltd
Priority to CN202310093874.5ApriorityCriticalpatent/CN116204801A/en
Publication of CN116204801ApublicationCriticalpatent/CN116204801A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Classifications

Landscapes

Abstract

The invention provides a method and a device for aggregating logs, wherein the method for aggregating logs comprises the following steps: s1, word segmentation is carried out on a log text to obtain a log phrase after word segmentation; s2, constructing a prefix tree structure with fixed depth, and matching the log phrase to leaf nodes on the prefix tree structure one by one to obtain matching information; and S3, comparing the category phrase information with the leaf node in sequence according to the matching information, calculating the similarity, and judging whether to perform log aggregation according to whether the similarity reaches a threshold value. The method for aggregating logs can quickly aggregate similar log texts by word segmentation matching of the logs.

Description

Translated fromChinese
一种日志聚合的方法及装置A method and device for log aggregation

技术领域technical field

本发明属于软件技术领域,具体而言属于一种日志聚合的方法及装置。The present invention belongs to the technical field of software, in particular to a log aggregation method and device.

背景技术Background technique

在现实场景中,业务程序输出的运行日志往往规模庞大并且类型多变且复杂。我们在查询和查看这些日志时,按时间排序的日志列表会让用户难以快速聚焦找到重要的日志。我们提出一种算法,可以将相似的日志文本聚合到一行,提升查看全部日志的效率,帮助快速定位故障。In real-world scenarios, the running logs output by business programs are often large in scale, varied and complex in type. When we query and view these logs, the log list sorted by time will make it difficult for users to quickly focus on finding important logs. We propose an algorithm that can aggregate similar log texts into one line, improve the efficiency of viewing all logs, and help quickly locate faults.

当前业界最常用的算法是Drain,但该算法的运行效率较差,并且在日志动态长度、包含JSON文本的场景聚合的效果也不好。Currently, the most commonly used algorithm in the industry is Drain, but the running efficiency of this algorithm is poor, and the effect of aggregation in scenarios with dynamic log length and JSON text is not good.

有鉴于此,特提出本发明。In view of this, the present invention is proposed.

发明内容Contents of the invention

有鉴于此,本发明公开了一种日志聚合的方法及装置,通过对日志进行分词匹配,可以快速聚合类似的日志文本。In view of this, the present invention discloses a log aggregation method and device, which can quickly aggregate similar log texts by performing word segmentation and matching on logs.

具体的,本发明是通过以下技术方案实现的:Specifically, the present invention is achieved through the following technical solutions:

第一方面,本发明公开了一种日志聚合的方法,包括如下步骤:In a first aspect, the present invention discloses a method for log aggregation, including the following steps:

S1、将日志文本进行分词,得到分词后的日志词组;S1. Segment the log text to obtain a log phrase after word segmentation;

S2、构建固定深度的前缀树结构,根据所述日志词组逐个匹配到所述前缀树结构上的叶节点,得到匹配信息;S2. Build a fixed-depth prefix tree structure, and match the log phrases to leaf nodes on the prefix tree structure one by one to obtain matching information;

S3、根据匹配信息,依次将所述日志词组和所述叶节点进行类别词组信息对比并计算相似度,根据所述相似度是否达到阈值判断是否进行日志聚合S3. According to the matching information, sequentially compare the log phrase and the leaf node with category phrase information and calculate the similarity, and judge whether to perform log aggregation according to whether the similarity reaches a threshold

其中,叶节点存储最终的日志类别信息。Among them, the leaf node stores the final log category information.

进一步地,所述S3步骤中,所述相似度的计算方法为计算相同词组除以总词组数量记为相似度。Further, in the step S3, the calculation method of the similarity is to calculate the same phrase divided by the total number of phrases and record it as the similarity.

进一步地,所述S2步骤中,所述匹配信息包括匹配成功信息和匹配不成功信息;Further, in the step S2, the matching information includes matching successful information and matching unsuccessful information;

所述匹配成功信息包括:The matching success information includes:

所述日志分词能够匹配到所述叶节点;The log word segmentation can be matched to the leaf node;

所述日志分词无法匹配到所述叶节点,但能够匹配到相邻词组长度的类别;The log segmentation cannot be matched to the leaf node, but can be matched to the category of the length of the adjacent phrase;

所述匹配不成功信息包括:The unsuccessful matching information includes:

所述日志分词即无法匹配到所述叶节点,也无法匹配到相邻词组长度的类别。The log word segmentation cannot be matched to the leaf node, nor can it be matched to the category of the adjacent phrase length.

进一步地,所述日志分词能够匹配到相邻词组长度的类别时,所述相似度的计算方法还包括:通过minhash算法提前对每个日志组计算特征向量,计算日志行的雅卡尔距离,统计两条日志从左往右的相同单词数量,除以日志长度得到相似度。Further, when the log word segmentation can be matched to the category of the adjacent phrase length, the calculation method of the similarity also includes: calculating the feature vector for each log group in advance through the minhash algorithm, calculating the Jacquard distance of the log line, and counting The number of identical words from left to right in the two logs is divided by the length of the log to obtain the similarity.

进一步地,所述日志聚合的方法包括:Further, the log aggregation method includes:

若匹配信息为匹配成功信息时,所述相似度达到匹配的阈值判断为相同的类别;If the matching information is successful matching information, the similarity reaches the matching threshold and is judged to be the same category;

若匹配信息为匹配不成功信息时,创建新日志分类并对所述新日志分类的文本数据进行占位处理。If the matching information is unsuccessful, create a new log category and perform placeholder processing on the text data of the new log category.

第二方面,本发明公开了一种日志聚合的装置,包括:In a second aspect, the present invention discloses a log aggregation device, including:

日志分词模块:将日志文本进行分词,得到分词后的日志词组;Log word segmentation module: segment the log text into words to obtain log phrases after word segmentation;

匹配信息获取模块:构建固定深度的前缀树结构,根据所述日志词组逐个匹配到所述前缀树结构上的叶节点,得到匹配信息;Matching information acquisition module: build a fixed-depth prefix tree structure, match leaf nodes on the prefix tree structure one by one according to the log phrase, and obtain matching information;

日志聚合模块:根据匹配信息,依次将所述日志词组和所述叶节点进行类别词组信息对比并计算相似度,根据所述相似度是否达到阈值判断是否进行日志聚合。Log aggregation module: according to the matching information, sequentially compare the log phrase and the leaf node with category phrase information and calculate the similarity, and judge whether to perform log aggregation according to whether the similarity reaches a threshold.

第三方面,本发明公开了一种计算机可读存储介质,其上存储有计算机程序所述程序被处理器执行时实现如第一方面所述日志聚合的方法的步骤。In a third aspect, the present invention discloses a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the steps of the log aggregation method described in the first aspect are implemented.

第四方面,本发明公开了一种计算机设备,包括存储器,处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如第一方面所述日志聚合的方法的步骤。In a fourth aspect, the present invention discloses a computer device, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the program, the computer program described in the first aspect is implemented. The steps of the log aggregation method.

与现有技术相比,本发明的有益效果在于:Compared with prior art, the beneficial effect of the present invention is:

本发明提出的日志聚合的方法及装置,对于定长、变长、JSON等格式的日志聚合表现都更加优异,同时保持了极高的匹配效率,通过对日志进行聚合,可以快速聚合类似的日志文本,显著提升了数据预处理的效率,对复杂格式的日志分词和聚类的效果更好,对变长的日志聚类效果更好。The log aggregation method and device proposed by the present invention perform better for log aggregation in fixed-length, variable-length, JSON and other formats, while maintaining extremely high matching efficiency. By aggregating logs, similar logs can be quickly aggregated Text, which significantly improves the efficiency of data preprocessing, and has a better effect on word segmentation and clustering of logs in complex formats, and has a better effect on clustering logs of variable length.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating a preferred embodiment and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same parts. In the attached picture:

图1为本发明实施例提供的日志聚合的方法流程图;FIG. 1 is a flowchart of a log aggregation method provided by an embodiment of the present invention;

图2为本发明实施例提供的日志聚合的方法详情示意图;Fig. 2 is a detailed schematic diagram of the log aggregation method provided by the embodiment of the present invention;

图3为本发明实施例提供的日志聚合的装置示意图;FIG. 3 is a schematic diagram of a log aggregation device provided by an embodiment of the present invention;

图4为本发明实施例提供的一种计算机设备的结构示意图。FIG. 4 is a schematic structural diagram of a computer device provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合附图和具体实施方式对本发明的技术方案进行清楚、完整地描述,但是本领域技术人员将会理解,下列所描述的实施例是本发明一部分实施例,而不是全部的实施例,仅用于说明本发明,而不应视为限制本发明的范围。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions of the present invention will be clearly and completely described below in conjunction with the accompanying drawings and specific embodiments, but those skilled in the art will understand that the embodiments described below are some of the embodiments of the present invention, rather than all of them. It is only used to illustrate the present invention and should not be construed as limiting the scope of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

为了更加清晰的对本发明中的技术方案进行阐述,下面以具体实施例的形式进行说明。In order to illustrate the technical solution in the present invention more clearly, the following will be described in the form of specific examples.

实施例Example

参照图1所示,本发明公开了一种日志聚合的方法,包括如下步骤:With reference to shown in Fig. 1, the present invention discloses a kind of log aggregation method, comprises the following steps:

S1、将日志文本进行分词,得到分词后的日志词组;S1. Segment the log text to obtain a log phrase after word segmentation;

S2、构建固定深度的前缀树结构,根据所述日志词组逐个匹配到所述前缀树结构上的叶节点,得到匹配信息;S2. Build a fixed-depth prefix tree structure, and match the log phrases to leaf nodes on the prefix tree structure one by one to obtain matching information;

S3、根据匹配信息,依次将所述日志词组和所述叶节点进行类别词组信息对比并计算相似度,根据所述相似度是否达到阈值判断是否进行日志聚合。S3. According to the matching information, sequentially compare the log phrase and the leaf node with category phrase information and calculate similarity, and judge whether to perform log aggregation according to whether the similarity reaches a threshold.

参阅图2所示,本发明在实际操作过程中,按照如下流程,具体表现为:Referring to shown in Fig. 2, in the actual operation process of the present invention, according to the following process, it is embodied as:

首先对日志进行预处理,将一些常见的日志模式替换为占位符,比如时间、用户ID、IP等,通过预处理将日志中常见的变量替换为相同的占位符,提升相同单词的比例,进而提升最终文本的相似度,在最初的使用中给日志增加一些常用的日志模式占位符,比如:First preprocess the log, replace some common log patterns with placeholders, such as time, user ID, IP, etc., replace common variables in the log with the same placeholders through preprocessing, and increase the proportion of the same words , and then improve the similarity of the final text, and add some commonly used log mode placeholders to the log in the initial use, such as:

1.时间[0-9]{1,}:[0-9]{1,}:[0-9]{1,}.?[0-9]{1,}?1. Time [0-9]{1,}:[0-9]{1,}:[0-9]{1,}.? [0-9]{1,}?

2.IPv4\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}2.IPv4\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}

3.Hex 0x[a-f0-9A-F]+3. Hex 0x[a-f0-9A-F]+

4.数字-?[0-9]{1,}.?[0-9]*4. Numbers -? [0-9]{1,}.? [0-9]*

5.ID[0-9a-f]{4,}5.ID[0-9a-f]{4,}

需要说明的是,时间在日志中输出的格式是可能有非常多种类型,这里只添加了最简单的一种。It should be noted that there may be many types of time output formats in the log, and only the simplest one is added here.

为了避免匹配效率降低,在进行日志相似度运算的时候对必要的单词进行正则匹配来提高匹配效率,具体表现为:In order to avoid the reduction of matching efficiency, regular matching is performed on necessary words when performing log similarity calculations to improve matching efficiency. The specific performance is as follows:

假设当前日志组的模板变量是:Just ALog Template<Number>,现在我们有一个日志Just A Log Template 123需要跟这个模板对比相似度,我们在逐个单词进行对比相似度时,只需要判断日志的最后一个单词是否匹配中<Number>占位符对应的正则表达式,我们不用判断这个单词是否能匹配中其他的占位符,也不用关注其他的单词是什么情况,更不必要提前对这个日志进行完整预处理。Assume that the template variable of the current log group is: Just ALog Template<Number>, now we have a log Just A Log Template 123 that needs to be compared with this template for similarity, when we compare the similarity word by word, we only need to judge the last of the log Whether a word matches the regular expression corresponding to the <Number> placeholder, we don’t need to judge whether this word can match other placeholders, and we don’t need to pay attention to the situation of other words, and there is no need to check the log in advance Complete preprocessing.

对预处理的数据进行分词,并构建固定深度的前缀树结构,在前缀树结构的第一层找到对应单词数量的叶节点;具体地,通过常见的符号前后添加空格,在分词时可以把这些符号拆分成单独的单词,最终将IPv4的地址拆分为<Num ber>:<Number>:<Number>:<Number>,将IPv6地址拆分为<ID>和:子元素,时间格式拆分为数个<Number>和:;按这种逻辑,最终只需维护了三个基本的占位元素,分别是:Segment the preprocessed data, and build a fixed-depth prefix tree structure, find leaf nodes corresponding to the number of words in the first layer of the prefix tree structure; specifically, add spaces before and after common symbols, and these can be used during word segmentation The symbols are split into separate words, and finally the IPv4 address is split into <Number>:<Number>:<Number>:<Number>, and the IPv6 address is split into <ID> and: sub-elements, and the time format is split Divided into several <Number> and:; according to this logic, only three basic placeholder elements are required to be maintained in the end, namely:

1.数字:[-+]?[0-9]+1. Number: [-+]? [0-9]+

2.Hex:0x[a-f0-9A-F]+2. Hex:0x[a-f0-9A-F]+

3.ID:[a-f0-9A-F]{4,}3.ID:[a-f0-9A-F]{4,}

这样除了维护更简单,常见的数据格式都不用维护,而且对JSON文本的支持也更友好了,可以将其中的KV都拆分更加清晰;In this way, in addition to easier maintenance, common data formats do not need to be maintained, and the support for JSON text is also more friendly, and the KV can be split more clearly;

对于在固定长度聚类失败的情况下,在聚合算法中增加一个相邻长度聚类组的相似度判断逻辑,将不同长度的两条日志但相似度较高的日志聚合到同一类别,进行后续的数据处理;In the case of fixed-length clustering failure, add a similarity judgment logic of an adjacent length clustering group to the aggregation algorithm, and aggregate two logs of different lengths but high similarity into the same category for subsequent data processing;

遍历对应所述前缀搜索树指向的所述聚类桶,根据匹配信息,依次将日志词组和叶节点的进行类别词组信息对比并计算相似度,根据相似度是否达到阈值进行日志聚合;具体地,若日志Fenix能够匹配到叶节点,则计算相同词组除以总词组数量计为相似度,当相似度达到配置的阈值时认为是相同的类别,若日志分词无法匹配到叶节点,但能够匹配到相邻词组长度的类别,则通过minhash算法,提前对每个日志组计算特征向量,计算日志行的雅卡尔距离,统计两条日志从左往右的相同单词数量,除以日志长度得到相似度,当相似度达到配置的阈值时认为是相同的类别;若日志分词即无法匹配到叶节点,也无法匹配到相邻词组长度的类别,则创建新日志分类并对新日志分类的文本数据进行占位处理。Traversing the clustering buckets pointed to by the prefix search tree, comparing the category phrase information of the log phrase and the leaf node in turn according to the matching information and calculating the similarity, and performing log aggregation according to whether the similarity reaches a threshold; specifically, If the log Fenix can match the leaf node, calculate the similarity by dividing the same phrase by the total number of phrases. When the similarity reaches the configured threshold, it is considered to be the same category. If the log segmentation cannot match the leaf node, but can match For the category of adjacent phrase length, the minhash algorithm is used to calculate the feature vector for each log group in advance, calculate the Jacquard distance of the log line, count the number of identical words in the two logs from left to right, and divide by the log length to obtain the similarity , when the similarity reaches the configured threshold, it is considered to be the same category; if the log word segmentation cannot match the leaf node or the category of the length of the adjacent phrase, create a new log category and perform text data for the new log category Placeholder processing.

本发明还提供了日志聚合的装置,如图3所示,具体包括:The present invention also provides a log aggregation device, as shown in Figure 3, specifically including:

日志分词模块:将日志文本进行分词,得到拆分后的日志词组;Log word segmentation module: segment the log text to obtain the split log phrases;

匹配信息获取模块:构建固定深度的前缀树结构,根据所述日志词组逐个匹配到所述前缀树结构上的叶节点,得到匹配信息;Matching information acquisition module: build a fixed-depth prefix tree structure, match leaf nodes on the prefix tree structure one by one according to the log phrase, and obtain matching information;

日志聚合模块:根据匹配信息,依次将所述日志词组和所述叶节点进行类别词组信息对比并计算相似度,根据所述相似度是否达到阈值判断是否进行日志聚合。Log aggregation module: according to the matching information, sequentially compare the log phrase and the leaf node with category phrase information and calculate the similarity, and judge whether to perform log aggregation according to whether the similarity reaches a threshold.

该装置主要由上述三个模块构成,通过该系统的搭建很好的实现同时挂载同一个文件系统可实现并行操作的目的。The device is mainly composed of the above three modules. Through the construction of the system, the same file system can be mounted at the same time to achieve the purpose of parallel operation.

具体实施时,以上各个模块可以作为独立的实体来实现,也可以进行任意组合,作为同一或若干个实体来实现,以上各个单元的具体实施可参见前面的方法实施例,在此不再赘述。During specific implementation, each of the above modules may be implemented as an independent entity, or may be combined arbitrarily as the same or several entities. For the specific implementation of each of the above units, please refer to the previous method embodiments, which will not be repeated here.

图4为本发明公开的一种计算机设备的结构示意图。参考图4所示,该计算机设备400,至少包括存储器402和处理器401;所述存储器402通过通信总线403和处理器连接,用于存储所述处理器401可执行的计算机指令,所述处理器401用于从所述存储器402读取计算机指令以实现上述实施例所述日志聚合的方法的步骤。FIG. 4 is a schematic structural diagram of a computer device disclosed in the present invention. 4, the computer device 400 includes at least a memory 402 and a processor 401; the memory 402 is connected to the processor through a communication bus 403, and is used to store computer instructions executable by the processor 401, and the processing The device 401 is configured to read computer instructions from the memory 402 to implement the steps of the log aggregation method in the above-mentioned embodiments.

对于上述装置实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本公开方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。As for the above device embodiments, since they basically correspond to the method embodiments, for relevant parts, please refer to part of the description of the method embodiments. The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. It can be understood and implemented by those skilled in the art without creative effort.

适合于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性存储器、媒介和存储器设备,例如包括半导体存储器设备(例如EPROM、EEPROM和闪存设备)、磁盘(例如内部磁盘或可移动盘)、磁光盘以及CD ROM和DVD-ROM盘。处理器和存储器可由专用逻辑电路补充或并入专用逻辑电路中。Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal disks or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and memory can be supplemented by, or incorporated in, special purpose logic circuitry.

最后应说明的是:虽然本说明书包含许多具体实施细节,但是这些不应被解释为限制任何发明的范围或所要求保护的范围,而是主要用于描述特定发明的具体实施例的特征。本说明书内在多个实施例中描述的某些特征也可以在单个实施例中被组合实施。另一方面,在单个实施例中描述的各种特征也可以在多个实施例中分开实施或以任何合适的子组合来实施。此外,虽然特征可以如上所述在某些组合中起作用并且甚至最初如此要求保护,但是来自所要求保护的组合中的一个或多个特征在一些情况下可以从该组合中去除,并且所要求保护的组合可以指向子组合或子组合的变型。A final note: While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as primarily describing features of particular embodiments of particular inventions. Certain features that are described in this specification in multiple embodiments can also be implemented in combination in a single embodiment. On the other hand, various features that are described in a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may function in certain combinations as described above and even be initially so claimed, one or more features from a claimed combination may in some cases be removed from that combination and the claimed A protected combination can point to a subcombination or a variant of a subcombination.

类似地,虽然在附图中以特定顺序描绘了操作,但是这不应被理解为要求这些操作以所示的特定顺序执行或顺次执行、或者要求所有例示的操作被执行,以实现期望的结果。在某些情况下,多任务和并行处理可能是有利的。此外,上述实施例中的各种系统模块和组件的分离不应被理解为在所有实施例中均需要这样的分离,并且应当理解,所描述的程序组件和系统通常可以一起集成在单个软件产品中,或者封装成多个软件产品。Similarly, while operations are depicted in the figures in a particular order, this should not be construed as requiring that those operations be performed in the particular order shown, or sequentially, or that all illustrated operations be performed, to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system modules and components in the above-described embodiments should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can often be integrated together in a single software product in, or packaged into multiple software products.

由此,主题的特定实施例已被描述。其他实施例在所附权利要求书的范围以内。在某些情况下,权利要求书中记载的动作可以以不同的顺序执行并且仍实现期望的结果。此外,附图中描绘的处理并非必需所示的特定顺序或顺次顺序,以实现期望的结果。在某些实现中,多任务和并行处理可能是有利的。Thus, certain embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

以上所述仅为本发明公开的较佳实施例而已,并不用以限制本公开,凡在本发明公开的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明公开保护的范围之内。The above descriptions are only preferred embodiments disclosed by the present invention, and are not intended to limit the present disclosure. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present disclosure shall be included in the Within the scope of the disclosure of the present invention.

Claims (8)

Translated fromChinese
1.一种日志聚合的方法,其特征在于,包括以下步骤:1. A method for log aggregation, characterized in that, comprising the following steps:S1、将日志文本进行分词,得到分词后的日志词组;S1. Segment the log text to obtain a log phrase after word segmentation;S2、构建固定深度的前缀树结构,根据所述日志词组逐个匹配到所述前缀树结构上的叶节点,得到匹配信息;S2. Build a fixed-depth prefix tree structure, and match the log phrases to leaf nodes on the prefix tree structure one by one to obtain matching information;S3、根据匹配信息,依次将所述日志词组和所述叶节点进行类别词组信息对比并计算相似度,根据所述相似度是否达到阈值判断是否进行日志聚合。S3. According to the matching information, sequentially compare the log phrase and the leaf node with category phrase information and calculate similarity, and judge whether to perform log aggregation according to whether the similarity reaches a threshold.2.根据权利要求1所述的日志聚合的方法,其特征在于,所述S3步骤中,所述相似度的计算方法为计算相同词组除以总词组数量记为相似度。2. The method for log aggregation according to claim 1, characterized in that, in the S3 step, the calculation method of the similarity is to calculate the same phrase divided by the total phrase quantity and record it as the similarity.3.根据权利要求1所述的日志聚合的方法,其特征在于,所述S2步骤中,所述匹配信息包括匹配成功信息和匹配不成功信息;3. The method for log aggregation according to claim 1, characterized in that, in the S2 step, the matching information includes matching successful information and matching unsuccessful information;所述匹配成功信息包括:The matching success information includes:所述日志分词能够匹配到所述叶节点;The log word segmentation can be matched to the leaf node;所述日志分词无法匹配到所述叶节点,但能够匹配到相邻词组长度的类别;The log segmentation cannot be matched to the leaf node, but can be matched to the category of the length of the adjacent phrase;所述匹配不成功信息包括:The unsuccessful matching information includes:所述日志分词即无法匹配到所述叶节点,也无法匹配到相邻词组长度的类别。The log word segmentation cannot be matched to the leaf node, nor can it be matched to the category of the adjacent phrase length.4.根据权利要求2所述的日志聚合的方法,其特征在于,所述日志分词能够匹配到相邻词组长度的类别时,所述相似度的计算方法还包括:通过minhash算法提前对每个日志组计算特征向量,计算日志行的雅卡尔距离,统计两条日志从左往右的相同单词数量,除以日志长度得到相似度。4. The method for log aggregation according to claim 2, wherein when the log word segmentation can be matched to the category of adjacent phrase lengths, the calculation method of the similarity further comprises: advance each The log group calculates the feature vector, calculates the Jacquard distance of the log line, counts the number of identical words in the two logs from left to right, and divides them by the log length to obtain the similarity.5.根据权利要求2所述的日志聚合的方法,其特征在于,所述日志聚合的方法包括:5. The method for log aggregation according to claim 2, wherein the method for log aggregation comprises:若匹配信息为匹配成功信息时,所述相似度达到匹配的阈值判断为相同的类别;If the matching information is successful matching information, the similarity reaches the matching threshold and is judged to be the same category;若匹配信息为匹配不成功信息时,创建新日志分类并对所述新日志分类的文本数据进行占位处理。If the matching information is unsuccessful, create a new log category and perform placeholder processing on the text data of the new log category.6.一种日志聚合的装置,使用如权利要求1-5任一项所述的方法,其特征在于,包括:6. A device for log aggregation, using the method according to any one of claims 1-5, characterized in that it comprises:日志分词模块:将日志文本进行分词,得到分词后的日志词组;Log word segmentation module: segment the log text into words to obtain log phrases after word segmentation;匹配信息获取模块:构建固定深度的前缀树结构,根据所述日志词组逐个匹配到所述前缀树结构上的叶节点,得到匹配信息;Matching information acquisition module: build a fixed-depth prefix tree structure, match leaf nodes on the prefix tree structure one by one according to the log phrase, and obtain matching information;日志聚合模块:根据匹配信息,依次将所述日志词组和所述叶节点进行类别词组信息对比并计算相似度,根据所述相似度是否达到阈值判断是否进行日志聚合。Log aggregation module: according to the matching information, sequentially compare the log phrase and the leaf node with category phrase information and calculate the similarity, and judge whether to perform log aggregation according to whether the similarity reaches a threshold.7.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述程序执行时实现权利要求1-5任一项所述日志聚合的方法的步骤。7. A computer-readable storage medium, on which a computer program is stored, wherein, when the program is executed, the steps of the log aggregation method according to any one of claims 1-5 are implemented.8.一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现如权利要求1-6任一项所述日志聚合的方法的步骤。8. A computer device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor implements any of claims 1-6 when executing the program. The steps of the log aggregation method described in the item.
CN202310093874.5A2023-02-102023-02-10 A method and device for log aggregationPendingCN116204801A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202310093874.5ACN116204801A (en)2023-02-102023-02-10 A method and device for log aggregation

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202310093874.5ACN116204801A (en)2023-02-102023-02-10 A method and device for log aggregation

Publications (1)

Publication NumberPublication Date
CN116204801Atrue CN116204801A (en)2023-06-02

Family

ID=86516750

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202310093874.5APendingCN116204801A (en)2023-02-102023-02-10 A method and device for log aggregation

Country Status (1)

CountryLink
CN (1)CN116204801A (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115329748A (en)*2022-10-142022-11-11北京优特捷信息技术有限公司Log analysis method, device, equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115329748A (en)*2022-10-142022-11-11北京优特捷信息技术有限公司Log analysis method, device, equipment and storage medium

Similar Documents

PublicationPublication DateTitle
CN103561133B (en)A kind of IP address attribution information index method and method for quickly querying
CN104050247B (en)The method for realizing massive video quick-searching
CN108011823B (en)Multi-stage method and device for multi-domain flow table and multi-stage flow table searching method and device
CN107967219A (en)A kind of extensive character string high-speed searching method based on TCAM
CN106909575B (en)Text clustering method and device
CN108959370A (en)The community discovery method and device of entity similarity in a kind of knowledge based map
US20230056760A1 (en)Method and apparatus for processing graph data, device, storage medium, and program product
WO2013138441A1 (en)Systems, methods, and software for computing reachability in large graphs
CN111625617A (en)Data indexing method and device and computer readable storage medium
CN115905309A (en) Similar entity search method, device, computer equipment and readable storage medium
CN117539925A (en) A data processing method, device, medium and equipment
CN100578943C (en) An optimized Huffman decoding method and device
CN112131356B (en)Message keyword matching method and device based on TCAM
CN108614932A (en)Linear flow overlapping community discovery method, system and storage medium based on edge graph
CN109086815B (en) Floating point discretization method in decision tree model based on FPGA
CN116204801A (en) A method and device for log aggregation
CN108304469A (en)Method and apparatus for character string fuzzy matching
Yu et al.Scalable forest hashing for fast similarity search
CN117763077A (en)Data query method and device
CN115733788B (en) OTN network routing search method, system and storage medium based on graph database
CN110046180B (en)Method and device for locating similar examples and electronic equipment
CN109670071B (en) A serialized multi-feature-guided cross-media hash retrieval method and system
CN103957012B (en)A kind of compression method and device of DFA matrixes
CN104008136A (en)Method and device for text searching
CN106933844A (en)Towards the construction method of the accessibility search index of extensive RDF data

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp