CN101201836B

Movatterモバイル変換

Info

Publication number: CN101201836B
Application number: CN2007100710710A
Authority: CN
Inventors: 王继民; 平玲娣; 潘雪增; 陈小平; 陈健; 陆魁军
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2007-09-04
Filing date: 2007-09-04
Publication date: 2010-04-14
Anticipated expiration: 2027-09-04
Also published as: CN101201836A

Abstract

本发明公开了一种基于带记忆确定有限自动机的正则表达式匹配加速方法。它包括正则表达式规则编译器和模式匹配引擎，正则表达式规则编译器先把正则表达式转换为解析树，再分别把解析树转换为带记忆的非确定有限自动机和带记忆的确定有限自动机，模式匹配引擎使用编译器生成的带记忆确定有限自动机实现对模式匹配的加速。本发明的优点是：1)因为直接支持了重复操作符，编译器可以不对重复表达式进行展开，大大降低了编译器开发难度，也降低了编译器的内存占用和编译时间；2)基于同样的原因，编译器生成的规则数据库大小也可大大减小，降低了模式匹配引擎的成本和复杂度。The invention discloses a regular expression matching acceleration method based on a deterministic finite automaton with memory. It includes a regular expression rule compiler and a pattern matching engine. The regular expression rule compiler first converts the regular expression into a parse tree, and then converts the parse tree into a non-deterministic finite automaton with memory and a deterministic finite automaton with memory. Automata, the pattern matching engine uses compiler-generated deterministic finite automata with memory to accelerate pattern matching. The advantages of the present invention are: 1) because the repetition operator is directly supported, the compiler may not expand the repetition expression, which greatly reduces the difficulty of compiler development, and also reduces the memory occupation and compilation time of the compiler; 2) based on the same The size of the rule database generated by the compiler can also be greatly reduced, reducing the cost and complexity of the pattern matching engine.

Description

Translated fromChinese

基于带记忆确定有限自动机的正则表达式匹配加速方法An Acceleration Method for Regular Expression Matching Based on Deterministic Finite Automata with Memory

技术领域technical field

本发明涉及信息处理领域，尤其涉及一种基于带记忆确定有限自动机的正则表达式匹配加速方法。The invention relates to the field of information processing, in particular to a regular expression matching acceleration method based on a deterministic finite automaton with memory.

背景技术Background technique

字符串匹配是信息处理领域最基本操作之一，是许多信息处理应用的基础。字符串匹配是在某个输入字符串(以下称为目标字符串)中找出与给定字符串(以下称为特征字符串)具有特定关系的子串的过程。字符串匹配可以分为字符串精确匹配和字符串模糊匹配两种，其中前者是在目标字符串中找出与特征字符串完全相同的子串，而后者则是在目标字符串中找出与特征字符串相似的特定子串(比如，目标字符串的子串比特征字符串增加、减少或修改一个或数个字符)。字符串的精确匹配应用尤为广泛。String matching is one of the most basic operations in the field of information processing and the basis of many information processing applications. String matching is the process of finding substrings in an input string (hereinafter referred to as target string) that have a specific relationship with a given string (hereinafter referred to as feature string). String matching can be divided into string exact matching and string fuzzy matching. Specific substrings that are similar to the feature string (for example, the substring of the target string is increased, decreased, or modified by one or several characters compared with the feature string). Exact matching of strings is especially widely used.

正则表达式是由普通字符(例如字符a到z)和特殊字符(称为元字符)组成的文字模式，该模式可用于描述具有特定特征的一类字符串。比如“.*abc[0-9]$”描述了这样一类字符串：它的头部具有任意个任意字符，并以abc加一个数字结尾。正则表达式被广泛应用于字符串匹配，用来描述特征字符串。例如，在网关部署的内容过滤程序使用基于正则表达式的模式匹配查找并过滤具有不当内容的网络数据；反病毒程序利用基于正则表达式的特征库对病毒进行扫描；网络入侵检测与防御软件利用基于正则表达式的模式匹配对入侵企图进行识别。A regular expression is a literal pattern composed of ordinary characters (such as the characters a to z) and special characters (called metacharacters), which can be used to describe a class of strings with specific characteristics. For example, ".*abc[0-9]$" describes such a type of string: its head has any number of arbitrary characters, and ends with abc plus a number. Regular expressions are widely used in string matching to describe characteristic strings. For example, the content filtering program deployed on the gateway uses regular expression-based pattern matching to find and filter network data with inappropriate content; anti-virus programs use regular expression-based signature databases to scan for viruses; network intrusion detection and prevention software uses Pattern matching based on regular expressions identifies intrusion attempts.

基于正则表达式的模式匹配是计算密集型的任务，需要大量的CPU计算。例如，反病毒程序ClamAV的特征库目前已经具有超过11万条规则，但它需要在尽可能短的时间内判断出给定文件是否具有特征库中的特征。另外一个典型应用是部署在网关的内容过滤程序，它需要实时地对通过的数据流进行检测，并根据内容等级采取相应的动作。目前利用软件进行正则表达式匹配，吞吐率在几百Mbps到1Gbps之间，该性能在实际应用中会下降数倍到数十倍，还远不能满足实时扫描的需要。Pattern matching based on regular expressions is a computationally intensive task that requires a lot of CPU calculations. For example, the signature database of the anti-virus program ClamAV currently has more than 110,000 rules, but it needs to judge whether a given file has the signature in the signature database in the shortest possible time. Another typical application is the content filtering program deployed on the gateway, which needs to detect the passing data stream in real time and take corresponding actions according to the content level. Currently, software is used for regular expression matching, and the throughput rate is between several hundred Mbps and 1 Gbps. This performance will drop several times to dozens of times in practical applications, and it is far from meeting the needs of real-time scanning.

基于硬件加速的正则表达式匹配方法是解决该矛盾的其中一个办法。利用专用集成电路(ASIC)或现场可编程门阵列(FPGA)实现正则表达式匹配引擎，可同样完成模式匹配，但可根据需要进行专门的优化，成倍地提高性能。The hardware-accelerated regular expression matching method is one of the solutions to this contradiction. Using Application Specific Integrated Circuit (ASIC) or Field Programmable Gate Array (FPGA) to implement regular expression matching engine can also complete pattern matching, but special optimization can be carried out according to needs to double the performance.

正则表达式的硬件加速通常用自动机实现。Hardware acceleration of regular expressions is usually implemented with automata.

重复操作符“(A){n，m}”是正则表达式中常见的一个操作符，它代表子表达式(A)至少重复n次，至多重复m次。重复操作符有三种形式：The repetition operator "(A) {n, m}" is a common operator in regular expressions, which means that the subexpression (A) repeats at least n times and at most m times. The repetition operator has three forms:

(A){n，m}，表示子表达式(A)至少重复n次，至多重复m次。

(A){n, m} means that the subexpression (A) is repeated at least n times and at most m times.

(A){n，}，表示子表达式(A)至少重复n次。

(A){n,}, means that the subexpression (A) is repeated at least n times.

(A){n}，表示子表达式(A)重复n次。

(A){n}, which means that the subexpression (A) is repeated n times.

在通常的模式匹配软件中，“(A){n，m}”操作符是不被直接支持的，通常会转换为连接操作符和选择操作符，如第一种形式“(A){n，m}”会被转换为：In common pattern matching software, the "(A){n, m}" operator is not directly supported, and is usually converted into a connection operator and a selection operator, such as the first form "(A){n ,m}" will be converted to:

(A)(A)...(A)|(A)(A)...(A)(A)|...|(A)(A)...(A)...(A)n个 (n+1)个 m个(A)(A)...(A)|(A)(A)...(A)(A)|...|(A)(A)...(A)...(A )n (n+1) m

第二、三种形式的重复操作符相对简单，可分别展开为：The second and third forms of repetition operators are relatively simple and can be expanded as:

(A)(A)...(A)·(A)*n个(A)(A)...(A)·(A)*n

以及as well as

(A)(A)..(A)n个(A)(A)..(A)n

上述的转换方法虽然可以对重复操作符进行支持，但是，在重复次数多、范围大的时候，会使正则表达式的状态数急剧增加，造成对规则编译器和模式匹配引擎的巨大负担。比如“a{1，3072}”展开后会得到1+2+...+3072＝4720128个状态！在重复的子表达式复杂时会形成状态爆炸。本发明针对这个问题提出了使用带记忆的状态机对重复操作符进行直接支持。Although the above conversion method can support the repetition operator, when the number of repetitions is large and the range is large, the number of states of the regular expression will increase sharply, causing a huge burden on the rule compiler and pattern matching engine. For example, after "a{1, 3072}" is expanded, 1+2+...+3072=4720128 states will be obtained! State explosions are formed when repeated subexpressions are complex. Aiming at this problem, the present invention proposes to use a state machine with memory to directly support the repetition operator.

确定有限自动机和非确定有限自动机的定义、性能描述，以及从正则表达式构建非确定有限自动机的Thompson算法和从非确定有限自动机构造确定有限自动机的ε-闭包算法的描述，请参考Alfred V.Aho，Ravi Sethi和Jeffrey D.Ullman所著的《编译器：原理、技术与工具》一书；正则表达式的操作符及其意义，请参考Jeffrey E.F.Friedl所著的《掌握正则表达式》一书；软件Lex和Yacc的使用，请参考John Levine，Tony Mason，Doug Brown所著的《Lex与Yacc》一书。The definition and performance description of deterministic finite automata and non-deterministic finite automata, as well as the description of the Thompson algorithm for constructing non-deterministic finite automata from regular expressions and the ε-closure algorithm for constructing deterministic finite automata from non-deterministic finite automata , please refer to the book "Compiler: Principles, Techniques and Tools" by Alfred V.Aho, Ravi Sethi and Jeffrey D.Ullman; for the operators and meanings of regular expressions, please refer to "Compilers" by Jeffrey E.F.Friedl Master Regular Expressions" book; for the use of software Lex and Yacc, please refer to the book "Lex and Yacc" written by John Levine, Tony Mason, and Doug Brown.

发明内容Contents of the invention

本发明的目的是提供一种基于带记忆确定有限自动机的正则表达式匹配加速方法。The purpose of the present invention is to provide a method for accelerating regular expression matching based on finite automata with memory.

基于带记忆确定有限自动机的正则表达式匹配加速方法包括正则表达式规则编译器和模式匹配引擎，正则表达式规则编译器先把正则表达式转换为解析树，再分别把解析树转换为带记忆的非确定有限自动机和带记忆的确定有限自动机，模式匹配引擎使用编译器生成的带记忆确定有限自动机实现对模式匹配的加速。The regular expression matching acceleration method based on deterministic finite automata with memory includes a regular expression rule compiler and a pattern matching engine. The regular expression rule compiler first converts the regular expression into a parse tree, and then converts the parse tree into Nondeterministic finite automata with memory and deterministic finite automata with memory. The pattern matching engine uses deterministic finite automata with memory generated by the compiler to accelerate pattern matching.

其特征在于所述的正则表达式规则编译器先把正则表达式转换为解析树，再分别把解析树转换为带记忆的非确定有限自动机和带记忆的确定有限自动机：借助Lex和Yacc软件实现正则表达式上下文无关文法，对正则表达式规则语法进行解析，并在解析过程中根据匹配文法建立相应相应节点类型的子树，最终形成完整的解析树；在Thompson算法支持操作符的基础上，增加重复操作符({n，m})，其对应的非确定状态机与所重复表达式的非确定状态机相同，但增加了重复范围参数，使用该算法将解析树转换为带记忆的非确定有限自动机；对于不含重复操作符的非确定有限自动机，按ε-闭包算法生成确定有限自动机，而对于含有重复操作符的非确定有限自动机，则先用带有特殊标记的简单字符替换重复操作符，按ε-闭包算法把它转换为确定有限自动机，另外单独把被替换的部分按ε-闭包算法转换为另一个确定有限自动机；解析树由对应的正则表达式经解析生成，为一种每个节点分支不超过2的树，树中非叶节点为操作符，叶子节点为字符或集合，操作符包括连接操作符(“·”)、选择操作符(“|”)、重复操作符(“{n，m}”)、Kleen闭包操作符(“*”)，叶子节点是单个字符、集合、集合的补集、代表集合的字符或代表集合补集的字符，集合与补集的转义字符序列符合IEEE POSIX 1003.2标准。It is characterized in that the regular expression rule compiler first converts the regular expression into a parsing tree, and then converts the parsing tree into a non-deterministic finite automaton with memory and a definite finite automaton with memory: by means of Lex and Yacc The software realizes the regular expression context-free grammar, parses the regular expression rule grammar, and builds the subtree corresponding to the corresponding node type according to the matching grammar during the parsing process, and finally forms a complete parsing tree; the basis of the Thompson algorithm supporting operators Above, add the repetition operator ({n, m}), the corresponding non-deterministic state machine is the same as the non-deterministic state machine of the repeated expression, but add the repetition range parameter, use this algorithm to convert the parse tree into memory The non-deterministic finite automata of ; for non-deterministic finite automata without repetition operators, generate deterministic finite automata according to the ε-closure algorithm, and for non-deterministic finite automata containing repetition operators, first use A special marked simple character replaces the repetition operator, converts it into a definite finite automaton according to the ε-closure algorithm, and separately converts the replaced part into another definite finite automaton according to the ε-closure algorithm; the parse tree consists of The corresponding regular expression is generated by parsing. It is a tree with no more than 2 branches per node. The non-leaf nodes in the tree are operators, and the leaf nodes are characters or sets. Operators include connection operators ("·"), Selection operator ("|"), repetition operator ("{n, m}"), Kleen closure operator ("*"), the leaf node is a single character, a set, the complement of a set, a character representing a set Or a character representing the complement of a set, whose escape character sequences conform to the IEEE POSIX 1003.2 standard.

所述的编译器：编译器以正则表达式规则文件为输入，输出正则表达式数据库，它首先对规则文件进行语法检查并把它转换为带记忆的非确定有限自动机，之后再转换为带记忆的确定有限自动机，带记忆的确定有限自动机存储在规则数据库中。The compiler: the compiler takes the regular expression rule file as input and outputs the regular expression database. It first checks the syntax of the rule file and converts it into a non-deterministic finite automaton with memory, and then converts it into a non-deterministic finite automaton with memory. Deterministic finite automata with memory, Deterministic finite automata with memory are stored in a rule database.

所述的模式匹配引擎使用编译器生成的带记忆确定有限自动机实现对模式匹配的加速：模式匹配引擎读入编译器生成的带记忆确定有限自动机和输入字符串，把输入字符串跟每个有限自动机进行匹配，匹配的位置保存在匹配上下文中；当一个有限自动机匹配完成，则根据该自动机类型确定下一步动作：如果该自动机不包含重复操作符，且没有后续自动机，则报告一个匹配；如果该自动机不包含重复操作符，但有后续有限自动机，则根据贪婪模式选项确定是否报告一个匹配，并生成一个后续有限自动机的匹配上下文；如果该自动机包含重复操作符，则增加匹配次数，并根据匹配次数决定下一步动作：如果匹配次数小于重复范围下限，则把匹配位置调整到自动机初始状态，继续匹配；如果匹配次数大于重复范围上限，则生成后续状态机的匹配上下文或报告一个匹配；如果匹配次数位于重复范围内，则把匹配位置调整到自动机初始状态，继续匹配；模式匹配引擎可以由现场可编程门阵列或专用集成电路实现，它实现了正则表达式数据库中的带记忆确定有限自动机，可接受输入数据，判断是否存在与库中正则表达式的匹配。The pattern matching engine uses the finite automaton with memory generated by the compiler to accelerate the pattern matching: the pattern matching engine reads the finite automaton with memory generated by the compiler and the input string, and compares the input string with each A finite automaton is matched, and the matching position is saved in the matching context; when a finite automaton is matched, the next action is determined according to the type of the automaton: if the automaton does not contain a repetition operator, and there is no subsequent automaton , then a match is reported; if the automaton does not contain a repetition operator, but there is a subsequent finite automaton, then it is determined whether to report a match according to the greedy mode option, and a matching context for the subsequent finite automata is generated; if the automaton contains Repeat the operator, increase the number of matches, and decide the next action based on the number of matches: if the number of matches is less than the lower limit of the repetition range, adjust the matching position to the initial state of the automaton and continue matching; if the number of matches is greater than the upper limit of the repetition range, generate The matching context of the subsequent state machine or report a match; if the number of matches is within the repeated range, adjust the matching position to the initial state of the automaton and continue to match; the pattern matching engine can be implemented by a field programmable gate array or an application-specific integrated circuit. A deterministic finite automaton with memory in the regular expression database is realized, which accepts input data and judges whether there is a match with the regular expression in the database.

本发明的优点是：1)因为直接支持了重复操作符，编译器可以不对重复表达式进行展开，大大降低了编译器开发难度，也降低了编译器的内存占用和编译时间；2)基于同样的原因，编译器生成的规则数据库大小也可大大减小，降低了模式匹配引擎的成本和复杂度。The advantages of the present invention are: 1) because the repetition operator is directly supported, the compiler may not expand the repetition expression, which greatly reduces the difficulty of compiler development, and also reduces the memory occupation and compilation time of the compiler; 2) based on the same The size of the rule database generated by the compiler can also be greatly reduced, reducing the cost and complexity of the pattern matching engine.

附图说明Description of drawings

图1是基本操作符及其对应的解析树；Figure 1 shows the basic operators and their corresponding parse trees;

图2是规则“(a|([0-9])){1，30}c*d”对应的解析树；Figure 2 is the parsing tree corresponding to the rule "(a|([0-9])){1, 30}c*d";

图3是叶子节点(单字符)对应的非确定有限自动机；Fig. 3 is the non-deterministic finite automaton corresponding to leaf node (single character);

图4是叶子节点(集合)对应的非确定有限自动机；Fig. 4 is a non-deterministic finite automaton corresponding to a leaf node (set);

图5是连接操作符对应的非确定有限自动机；Figure 5 is the non-deterministic finite automaton corresponding to the connection operator;

图6是选择操作符对应的非确定有限自动机；Figure 6 is a non-deterministic finite automaton corresponding to the selection operator;

图7是Kleen闭包操作符对应的非确定有限自动机；Figure 7 is the non-deterministic finite automaton corresponding to the Kleen closure operator;

图8是规则“(a|([0-9])){1，30}c*d”对应的按照Thompson算法转换成的非确定有限自动机；Figure 8 is a non-deterministic finite automaton converted according to the Thompson algorithm corresponding to the rule "(a|([0-9])){1, 30}c*d";

图9是规则“(a|([0-9])){1，30}c*d”对应的解析树按改进算法转换成的非确定有限自动机；Figure 9 is a non-deterministic finite automaton converted from the parse tree corresponding to the rule "(a|([0-9])){1, 30}c*d" according to the improved algorithm;

图10是规则“(a|([0-9])){1，30}c*d”对应的非确定有限自动机按照经典ε-闭包算法转换成的确定有限自动机；Figure 10 is a deterministic finite automaton converted from the non-deterministic finite automata corresponding to the rule "(a|([0-9])){1, 30}c*d" according to the classic ε-closure algorithm;

图11是规则“(a|([0-9])){1，30}c*d”对应的非确定有限自动机按改进算法转换成的确定有限自动机；Figure 11 is the deterministic finite automaton converted from the non-deterministic finite automata corresponding to the rule "(a|([0-9])){1, 30}c*d" according to the improved algorithm;

图12是本发明中模式匹配引擎结构图；Fig. 12 is a structural diagram of a pattern matching engine in the present invention;

图13是本发明模式匹配引擎中上下文调度流程图；Fig. 13 is a flow chart of context scheduling in the pattern matching engine of the present invention;

图14是本发明模式匹配引擎的主要执行流程图。Fig. 14 is a main execution flowchart of the pattern matching engine of the present invention.

具体实施方式Detailed ways

基于带记忆确定有限自动机的正则表达式匹配加速方法：其核心是使用带记忆的确定有限自动机对重复操作符(A{n，m})进行直接支持，这样在基本不降低匹配性能的条件下，对于存在大量重复的规则，可大幅降低生成确定有限自动机(DFA)的规模，降低存储代价。同时，也可简化编译器的设计，大幅减少规则处理时间。要使用带记忆的确定有限自动机，规则编译器和模式匹配引擎必须同时对它提供支持。Regular expression matching acceleration method based on deterministic finite automata with memory: its core is to use deterministic finite automata with memory to directly support the repetition operator (A{n, m}), which basically does not reduce the matching performance Under the condition, for rules with a large number of repetitions, the scale of generating deterministic finite automata (DFA) can be greatly reduced, and the storage cost can be reduced. At the same time, it can also simplify the design of the compiler and greatly reduce the rule processing time. To use a deterministic finite automaton with memory, both the rule compiler and the pattern matching engine must support it.

所述的规则描述文件：规则描述文件中包含了需要查找的特征字符串(正则表达式)。规则描述文件中可以有任意条规则，每条规则由以下部分组成：唯一的标识，用于跟其他规则进行区分；规则体，描述了一条正则表达式；规则选项，指定匹配规则时的选项，比如是否忽略字母大小写。The rule description file: the rule description file contains the character strings (regular expressions) to be searched. There can be any rule in the rule description file, and each rule consists of the following parts: a unique identifier, used to distinguish it from other rules; a rule body, describing a regular expression; rule options, specifying options for matching rules, For example, whether to ignore letter case.

规则编译器：规则编译器完成规则到规则数据库的转换。在读取规则描述文件后，对其中的规则进行语法检查，然后，把符合语法定义的规则转换为解析树，并进行预处理和必要的优化；再把解析树转换为带记忆的非确定有限自动机和确定有限自动机，并把所有确定有限自动机写入规则数据库供模式匹配引擎使用。Rule Compiler: The rule compiler completes the transformation of the rules into the rule database. After reading the rule description file, check the syntax of the rules in it, and then convert the rules that meet the syntax definition into a parse tree, and perform preprocessing and necessary optimization; then convert the parse tree into a non-deterministic finite tree with memory automata and deterministic finite automata, and write all deterministic finite automata into the rule database for use by the pattern matching engine.

模式匹配引擎：实现正则表达式的模式匹配。读入编译器生成的规则数据库，以待匹配的数据为输入，匹配完成后输出匹配结果。结果包括以下数据：是否有匹配；如果有匹配，匹配的特征字符串标识、匹配的起始(结束)位置、实际匹配长度。Pattern matching engine: implement pattern matching of regular expressions. Read in the rule database generated by the compiler, take the data to be matched as input, and output the matching result after the matching is completed. The result includes the following data: whether there is a match; if there is a match, the character string identifier of the match, the start (end) position of the match, and the actual match length.

基于带记忆确定有限自动机的正则表达式匹配加速方法的具体步骤如下：The specific steps of the regular expression matching acceleration method based on the deterministic finite automata with memory are as follows:

1)准备正则表达式规则文件。1) Prepare a regular expression rule file.

正则表达式规则文件中包含了正则表达式规则，以及相关的其他信息。每行描述一条正则表达式规则及其相关信息，例如：Regular expression rule files contain regular expression rules and other related information. Each line describes a regular expression rule and its related information, for example:

Id＝1001，Rule＝”abc[0-9]$”，Option＝””Id＝1002，Rule＝”http://[0-9a-z]+\.google\.[a-z]+[\-/％.0-9a-z]*/images\？)(.*)(&？)(safe＝[^&]*”，Option＝”i”Id=1001, Rule=”abc[0-9]$”, Option=””Id=1002, Rule=”http://[0-9a-z]+\.google\.[a-z]+[\ -/%.0-9a-z]*/images\?)(.*)(&?)(safe=[^&]*", Option="i"

上面是一个规则文件的例子，其中有两条规则，每条规则由三个部分构成，第一部分为规则标识符，由Id＝xxx给出；第二部分为规则体，由Rule＝”...”给出；第三部分为相关选项，由Option＝”...”给出。The above is an example of a rule file, in which there are two rules, each rule consists of three parts, the first part is the rule identifier, given by Id=xxx; the second part is the rule body, given by Rule=".. .” is given; the third part is related options, which are given by Option=”…”.

规则体部分描述了特征正则表达式，其语法符合IEEE POSIX 1003.2标准。The rule body part describes the characteristic regular expression, and its syntax conforms to the IEEE POSIX 1003.2 standard.

2)用规则编译器把规则文件编译为规则数据库。2) Use a rule compiler to compile the rule file into a rule database.

规则编译器完成规则文件到规则数据库的转换工作。该工作分四个步骤完成：The rule compiler completes the transformation from the rule file to the rule database. The job is done in four steps:

A.对规则文件进行语法检查，同时生成解析树。A. Check the syntax of the rule file and generate a parse tree at the same time.

正则表达式的语法可以由上下文无关文法描述，因此可以使用Lex和Yacc编译工具生成语法解析器(Parser)，对正则表达式的语法进行检查。例如正则表达式顶层语法用Yacc可表述为：The grammar of a regular expression can be described by a context-free grammar, so Lex and Yacc compiler tools can be used to generate a grammar parser (Parser) to check the grammar of a regular expression. For example, the top-level grammar of regular expressions can be expressed in Yacc as:

在进行语法检查的同时，也可利用Yacc的动作部分定义生成语法解析树(Parse Tree)。解析树为一种每个节点分支不超过2的树，树中非叶节点为操作符，叶子节点为字符或集合。操作符包括连接操作符(“·”)、选择操作符(“|”)、重复操作符(“{n，m}”)、Kleen闭包操作符(“*”)。叶子节点可以是单个字符(如“a”)、集合(如“[a-z]”、“[abc123]”)、集合的补集(如“[^a-z]”、“[^abc123]”)、代表集合的字符(如“\d”，它代表“[0-9]”)或代表集合补集的字符(如“\D”，它代表“[^0-9]”)等。集合与补集的转义字符序列符合IEEE POSIX 1003.2标准。连接操作符(“·”)、选择操作符(“|”)、重复操作符(“{n，m}”)、Kleen闭包操作符(“*”)生成解析树的过程分别如图1所示。While checking the syntax, you can also use the action part definition of Yacc to generate a parse tree (Parse Tree). The parse tree is a tree with no more than 2 branches per node. The non-leaf nodes in the tree are operators, and the leaf nodes are characters or sets. Operators include concatenation operator ("·"), selection operator ("|"), repetition operator ("{n, m}"), Kleen closure operator ("*"). A leaf node can be a single character (such as "a"), a set (such as "[a-z]", "[abc123]"), a complement of a set (such as "[^a-z]", "[^abc123]"), A character representing a set (such as "\d", which represents "[0-9]") or a character representing the complement of a set (such as "\D", which represents "[^0-9]"), etc. Escape character sequences for sets and complements conform to the IEEE POSIX 1003.2 standard. The process of generating the parse tree by connection operator ("·"), selection operator ("|"), repetition operator ("{n, m}"), and Kleen closure operator ("*") is shown in Figure 1 shown.

规则“(a|([0-9])){1，30}c*d”对应的解析树如图2所示。图中“{}”节点代表“{n，m}”重复节点，“[]”节点代表集合节点。具体的内容(重复节点中的n和m以及集合节点中集合的具体元素)由节点相关参数给出，即“{}”节点具有参数(1，30)，“[]”节点具有参数(0，1，2，3，4，5，6，7，8，9)。The parse tree corresponding to the rule "(a|([0-9])){1, 30}c*d" is shown in Figure 2. The "{}" node in the figure represents the "{n, m}" repeated node, and the "[]" node represents the collection node. The specific content (n and m in the repeating node and the specific elements of the collection in the collection node) are given by the node-related parameters, that is, the "{}" node has parameters (1, 30), and the "[]" node has parameters (0 , 1, 2, 3, 4, 5, 6, 7, 8, 9).

B.对解析树进行必要的预处理和优化。B. Perform necessary preprocessing and optimization on the parse tree.

到目前为止，解析树中还包含了一些在后面步骤中不能直接支持的元素，因此必须通过一定的处理使之变成可直接支持的元素。主要的预处理有：So far, the parse tree also contains some elements that cannot be directly supported in the later steps, so certain processing must be done to make them directly supportable elements. The main preprocessing are:

“+”操作符的处理，“+”操作符在后面步骤中不直接支持，需要转换为“*”，如“a+”需转换为“aa*”。该预处理可通过在解析树上增加节点并重构解析树完成。

The processing of the "+" operator. The "+" operator is not directly supported in the following steps and needs to be converted to "*". For example, "a+" needs to be converted to "aa*". This preprocessing can be done by adding nodes to the parse tree and reconstructing the parse tree.

预定义字符集合的处理。通过转义字符定义的字符集合不被直接支持，需要把其中的元素提取出来，明确指定。如“\d”转换为“[0-9]”，“\D”转换为“[^0-9]”“[:alphanum:]”转换为“[0-9a-zA-Z]”。该预处理可直接修改节点参数实现。

Handling of predefined character sets. The character set defined by escape characters is not directly supported, and the elements need to be extracted and specified explicitly. For example, "\d" is converted to "[0-9]", "\D" is converted to "[^0-9]" and "[:alphanum:]" is converted to "[0-9a-zA-Z]". This preprocessing can be realized by modifying the node parameters directly.

顶层选择操作符的分割。如果解析树的根为选择节点“|”则该解析树可分割为两棵解析树而不影响匹配结果。该过程循环执行直到根节点不再是“|”。如“A|B|C”处理后变为三棵解析树，分别为“A”、“B”、“C”。除预处理外，还需进行优化操作。优化操作的目的是在不改变解析树语义的情况下降低表达式的存储或匹配代价。 Splitting of top-level selection operators. If the root of the parse tree is the selection node "|", the parse tree can be split into two parse trees without affecting the matching result. This process loops until the root node is no longer "|". For example, "A|B|C" becomes three parse trees after processing, namely "A", "B", and "C". In addition to preprocessing, optimization operations are also required. The purpose of optimization operations is to reduce the storage or matching cost of expressions without changing the semantics of the parse tree.

C.把解析树转换为带记忆的非确定有限自动机。C. Convert the parse tree to a non-deterministic finite automaton with memory.

把解析树转换为非确定有限自动机，借助改进的Thompson算法。除Thompson算法原来支持的操作符外，另外增加对重复操作符的支持。对解析树进行一趟后续遍历，即可完成非确定有限自动机的生成。Convert parse tree to non-deterministic finite automaton with improved Thompson's algorithm. In addition to the operators originally supported by the Thompson algorithm, additional support for repetition operators is added. A non-deterministic finite automaton can be generated by one subsequent traversal of the parse tree.

对于不同类型节点，其状态机均由子树的状态机推导得出。For different types of nodes, their state machines are derived from the state machines of the subtrees.

◆节点类型为叶子节点(单个字符)时，对应非确定有限自动机如图3所示。◆When the node type is a leaf node (single character), the corresponding non-deterministic finite automaton is shown in Figure 3.

◆节点类型为叶子节点(集合)时，对应非确定有限自动机如图4所示。◆When the node type is a leaf node (set), the corresponding non-deterministic finite automaton is shown in Figure 4.

◆节点类型为连接操作符，假设左右子树对应起始状态分别为I₁、I₂；对应终结状态分别为F₁、F₂，对应非确定有限自动机如图5所示。◆The node type is a connection operator, assuming_that the left and right subtrees correspond to the initial states I₁ and_I₂ respectively;

◆节点类型为选择操作符，假设左右子树对应起始状态分别为I₁、I₂；对应终结状态分别为F₁、F₂，对应非确定有限自动机如图6所示。◆The node type is a selection operator. Assume_that the left and right subtrees correspond to the initial states I₁ and_{I 2}_respectively ;

◆节点类型为Kleen闭包操作符，假设子树对应起始状态为I₁；对应终结状态为F₁，对应非确定有限自动机如图7所示。◆The node type is the Kleen closure operator, assuming that the subtree corresponds to the initial state I₁ ; the corresponding final state is F₁ , and the corresponding non-deterministic finite automaton is shown in Figure 7.

◆节点类型为重复操作符时，生成的非确定有限自动机与子树的非确定有限自动机相同，但需标上特殊标记，并存储重复范围。◆When the node type is a repetition operator, the generated non-deterministic finite automaton is the same as the non-deterministic finite automaton of the subtree, but it needs to be marked with a special mark and store the repeated range.

另外，在遍历过程中，每处理到一个新的重复节点，就生成一个新的非确定有限自动机，并记录相关非确定有限自动机间的连接关系。In addition, during the traversal process, every time a new repeated node is processed, a new non-deterministic finite automaton is generated, and the connection relationship between related non-deterministic finite automata is recorded.

规则“(a|([0-9])){1，30}c*d”对应的解析树按照Thompson算法转换成的非确定有限自动机如图8所示；同一规则按改进算法转换成的非确定有限自动机如图9所示，图中“EP”代表ε边，椭圆形和圆形节点代表状态，带箭头的边表示状态转换，边所附的字符为转换字符。如果箭头为圆头，则转换条件为集合，集合元素由关联参数给出。The non-deterministic finite automaton transformed into the parse tree corresponding to the rule "(a|([0-9])){1, 30}c*d" according to the Thompson algorithm is shown in Figure 8; the same rule is transformed into The non-deterministic finite automaton of , as shown in Figure 9, "EP" in the figure represents ε edge, oval and circular nodes represent states, edges with arrows represent state transitions, and characters attached to edges are transition characters. If the arrow is a round head, the conversion condition is a set, and the set elements are given by the associated parameters.

D.把带记忆的非确定有限自动机转换为带记忆的确定有限自动机。D. Convert non-deterministic finite automata with memory to deterministic finite automata with memory.

把非确定有限自动机转换为确定有限自动机，使用改进的ε-闭包方法。它经典ε-闭包方法不同之处有：对于生成的自动机，增加参数，记录其类型是否需要重复执行；如果需要，还要记录其重复范围。生成的状态机为“满状态机”，即在每个节点都存在对应于0-255输入字符的共256条转换边。如果在非确定有限自动机中不存在某个特定的转换，则在确定有限自动机中增加一条这样的转换，转入的状态为新增的特殊状态——REJECT(拒绝状态，表明匹配失败)。另外，由于前一步骤中，以重复操作符为边界，把解析树分为多个部分，分别转换为非确定有限自动机，并记录了自动机之间的关系，在该步骤中，用ε-闭包方法把每个非确定有限自动机转换为确定有限自动机，同时也许记录自动机之间的连接关系。Convert the non-deterministic finite automata into deterministic finite automata, using the improved ε-closure method. Its difference from the classic ε-closure method is as follows: for the generated automaton, add parameters, record whether its type needs to be repeated; if necessary, record its repeated range. The generated state machine is a "full state machine", that is, there are a total of 256 transition edges corresponding to 0-255 input characters at each node. If a specific transition does not exist in the non-deterministic finite automaton, add such a transition in the deterministic finite automaton, and the state transferred is the newly added special state——REJECT (refusal state, indicating that the matching failed) . In addition, because in the previous step, the parsing tree is divided into multiple parts with the repetition operator as the boundary, and converted into non-deterministic finite automata respectively, and the relationship between the automata is recorded. In this step, ε -The closure method converts each non-deterministic finite automaton into a deterministic finite automaton, while possibly recording the connections between the automata.

规则“(a|([0-9])){1，30}c*d”对应的非确定有限自动机按照经典ε-闭包算法转换成的确定有限自动机如图10所示；同一规则按改进算法转换成的确定有限自动机如图11所示。图中椭圆形和圆形节点代表状态，带箭头的边表示状态转换，边所附的字符为转换字符。The nondeterministic finite automata corresponding to the rule “(a|([0-9])){1, 30}c*d” is converted into a deterministic finite automaton according to the classic ε-closure algorithm as shown in Figure 10; the same Figure 11 shows the deterministic finite automaton converted from the rules according to the improved algorithm. The oval and circular nodes in the figure represent states, the edges with arrows represent state transitions, and the characters attached to the edges are transition characters.

3)使用模式匹配引擎对输入字符串进行匹配3) Use a pattern matching engine to match the input string

模式匹配引擎完成输入字符串与特征字符串(正则表达式生成的状态自动机)之间的匹配。模式匹配引擎可以由现场可编程门阵列(FPGA)或专用集成电路(ASIC)实现，它实现了正则表达式数据库中的带记忆确定有限自动机，可接受输入数据，判断是否存在与库中正则表达式的匹配。模式匹配引擎的结构如图12所示，图中The pattern matching engine completes the matching between the input string and the feature string (state automaton generated by regular expression). The pattern matching engine can be implemented by a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC). It implements a finite automaton with memory in the regular expression database, accepts input data, and judges whether there is a regular expression in the database. Expression matches. The structure of the pattern matching engine is shown in Figure 12, in the figure

(1)规则库，规则数据库在模式匹配引擎中的保存形式，其中记录了编译后的确定有限自动机，各自动机之间的连接关系，以及以重复操作符为根的确定有限自动机的重复次数等。(1) Rule base, the storage form of the rule database in the pattern matching engine, which records the compiled definite finite automaton, the connection relationship between each motive, and the repetition of the definite finite automaton rooted at the repetition operator times etc.

(2)执行引擎，模式匹配的执行单元，它负责读入待匹配数据，进行快速分类，创建匹配上下文，以及匹配上下文在各个状态机之间的跳转等。(2) Execution engine, the execution unit of pattern matching, which is responsible for reading in the data to be matched, performing quick classification, creating matching context, and jumping the matching context between various state machines, etc.

(3)快速分类引擎，根据输入数据快速确定相关的确定有限自动机，并由执行引擎创建相应的匹配上下文。(3) Fast classification engine, which quickly determines the relevant definite finite automaton according to the input data, and creates the corresponding matching context by the execution engine.

(4)匹配上下文，记录了当前正在匹配的状态机，当前所处的状态等信息，如果是处于重复操作符为根的状态机，还记录已经匹配的次数。(4) Matching context, which records information such as the state machine currently being matched and the current state it is in. If it is a state machine whose root is a repeat operator, it also records the number of times it has been matched.

(5)规则数据库，规则编译器对规则进行编译后形成的、模式匹配引擎可以读入的二进制文件，其中保存了各个确定有限自动机的状态、各个状态的跳转关系等。如果是重复操作符为根的状态机，还记录重复范围参数。(5) Rule database, which is a binary file formed after the rule compiler compiles the rules and can be read by the pattern matching engine, in which the state of each definite finite automaton and the jump relationship of each state are saved. If it is a state machine whose root is the repetition operator, also record the repetition scope parameter.

(6)输入/输出数据，模式匹配引擎与外界交换的信息。输入数据主要是指待匹配的数据流，也可能包括部分配置模式匹配引擎的数据；输出数据是指模式匹配引擎的匹配结果，包括匹配是否成功，匹配位置、匹配次数等。(6) Input/output data, information exchanged between the pattern matching engine and the outside world. The input data mainly refers to the data stream to be matched, and may also include some data for configuring the pattern matching engine; the output data refers to the matching result of the pattern matching engine, including whether the match is successful, the matching position, the number of matches, etc.

模式匹配的执行流程分别如图13和图14所示。其中图13是上下文调度流程，图14是匹配引擎的主要执行流程。The execution flow of pattern matching is shown in Figure 13 and Figure 14 respectively. Figure 13 is the context scheduling process, and Figure 14 is the main execution process of the matching engine.

Claims

Translated fromChinese

1.一种基于带记忆确定有限自动机的正则表达式匹配加速方法，其特征在于通过正则表达式规则编译器先把正则表达式转换为解析树，再分别把解析树转换为带记忆的非确定有限自动机和带记忆的确定有限自动机，模式匹配引擎使用编译器生成的带记忆确定有限自动机实现对模式匹配的加速，所述通过正则表达式规则编译器先把正则表达式转换为解析树，再分别把解析树转换为带记忆的非确定有限自动机和带记忆的确定有限自动机是：借助Lex和Yacc软件实现正则表达式上下文无关文法，对正则表达式规则语法进行解析，并在解析过程中根据匹配文法建立相应节点类型的子树，最终形成完整的解析树；在Thompson算法支持操作符的基础上，增加重复操作符({n，m})，其对应的非确定状态机与所重复表达式的非确定状态机相同，但增加了重复范围参数，使用该算法将解析树转换为带记忆的非确定有限自动机；对于不含重复操作符的非确定有限自动机，按ε-闭包算法生成确定有限自动机，而对于含有重复操作符的非确定有限自动机，则先用带有特殊标记的简单字符替换重复操作符，按ε-闭包算法把它转换为确定有限自动机，另外单独把被替换的部分按ε-闭包算法转换为另一个确定有限自动机；解析树由对应的正则表达式经解析生成，为一种每个节点分支不超过2的树，树中非叶节点为操作符，操作符包括连接操作符“·”、选择操作符“|”重复操作符“{n，m}”、Kleen闭包操作符“*”，叶子节点是单个字符、集合、集合的补集、代表集合的字符或代表集合补集的字符，集合与补集的转义字符序列符合IEEE POSIX 1003.2标准；所述的模式匹配引擎使用正则表达式规则编译器生成的带记忆确定有限自动机实现对模式匹配的加速是：模式匹配引擎读入正则表达式规则编译器生成的带记忆确定有限自动机和输入字符串，把输入字符串跟每个有限自动机进行匹配，匹配的位置保存在匹配上下文中；当一个有限自动机匹配完成，则根据该有限自动机类型确定下一步动作：如果该有限自动机不包含重复操作符，且没有后续有限自动机，则报告一个匹配；如果该有限自动机不包含重复操作符，但有后续有限自动机，则根据贪婪模式选项确定是否报告一个匹配，并生成一个后续有限自动机的匹配上下文；如果该有限自动机包含重复操作符，则增加匹配次数，并根据匹配次数决定下一步动作：如果匹配次数小于重复范围下限，则把匹配位置调整到有限自动机初始状态，继续匹配；如果匹配次数大于重复范围上限，则生成后续有限自动机的匹配上下文或报告一个匹配；如果匹配次数位于重复范围内，则把匹配位置调整到有限自动机初始状态，继续匹配；模式匹配引擎可以由现场可编程门阵列或专用集成电路实现，它实现了正则表达式数据库中的带记忆确定有限自动机，可接受输入数据，判断是否存在与库中正则表达式的匹配。1. A regular expression matching acceleration method based on band memory deterministic finite automata, it is characterized in that regular expression is converted into parsing tree first by regular expression rule compiler, then parsing tree is converted into the non- Definite finite automata and definite finite automata with memory, the pattern matching engine uses the definite finite automaton with memory generated by the compiler to accelerate the pattern matching, and the regular expression rule compiler first converts the regular expression into The parsing tree, and then converting the parsing tree into a non-deterministic finite automaton with memory and a deterministic finite automaton with memory is: with the help of Lex and Yacc software, the regular expression context-free grammar is realized, and the regular expression rule grammar is analyzed. And in the parsing process, the subtrees of the corresponding node types are established according to the matching grammar, and finally a complete parsing tree is formed; on the basis of the Thompson algorithm supporting operators, the repetition operator ({n, m}) is added, and its corresponding non-deterministic The state machine is the same as the non-deterministic state machine for the repeated expression, but with the addition of the repetition range parameter, using this algorithm to convert the parse tree into a non-deterministic finite automaton with memory; for non-deterministic finite automata without the repetition operator , according to the ε-closure algorithm to generate a deterministic finite automaton, and for a non-deterministic finite automaton containing a repetition operator, first replace the repetition operator with a simple character with a special mark, and convert it according to the ε-closure algorithm In order to determine the finite automaton, the replaced part is separately converted into another definite finite automaton according to the ε-closure algorithm; the parse tree is generated by parsing the corresponding regular expression, and it is a branch of each node with no more than 2 In the tree, the non-leaf nodes in the tree are operators, and the operators include the connection operator "·", the selection operator "|", the repetition operator "{n, m}", the Kleen closure operator "*", and the leaf nodes is a single character, a set, a set's complement, a character representing a set, or a character representing a set's complement, and the escape character sequences of sets and complements conform to the IEEE POSIX 1003.2 standard; the pattern matching engine described is compiled using regular expression rules The definite finite automaton with memory generated by the compiler realizes the acceleration of pattern matching: the pattern matching engine reads the deterministic finite automaton with memory and the input string generated by the regular expression rule compiler, and combines the input string with each finite automatic The matched position is saved in the matching context; when a finite automaton is matched, the next action is determined according to the type of the finite automaton: if the finite automaton does not contain repeated operators, and there is no subsequent finite automaton , then report a match; if the finite automaton does not contain a repetition operator, but has a subsequent finite automaton, then determine whether to report a match according to the greedy mode option, and generate a matching context for the subsequent finite automata; if the finite automaton If the machine contains repetition operators, increase the number of matches, and decide the next action according to the number of matches: if the number of matches is less than the lower limit of the repetition range, adjust the matching position to the initial state of the finite automaton and continue matching; If the number of times is greater than the upper limit of the repetition range, then generate the matching context of the subsequent finite automaton or report a match; if the number of matches is within the repetition range, adjust the matching position to the initial state of the finite automaton and continue matching; the pattern matching engine can be controlled by the field Realized by programming gate array or application-specific integrated circuit, it implements a deterministic finite automaton with memory in the regular expression database, accepts input data, and judges whether there is a match with the regular expression in the library.

2.根据权利要求1所述的一种基于带记忆确定有限自动机的正则表达式匹配加速方法，其特征在于所述的正则表达式规则编译器：正则表达式规则编译器以正则表达式规则文件为输入，输出正则表达式数据库，首先对规则文件进行语法检查并把它转换为带记忆的非确定有限自动机，之后再转换为带记忆的确定有限自动机，带记忆的确定有限自动机存储在正则表达式数据库中。2. a kind of regular expression matching acceleration method based on band memory to determine finite automata according to claim 1, is characterized in that described regular expression rule compiler: regular expression rule compiler uses regular expression rule The file is input, and the regular expression database is output. First, the rule file is checked for syntax and converted to a non-deterministic finite automaton with memory, and then converted to a deterministic finite automaton with memory, and a deterministic finite automaton with memory. Stored in the regular expression database.