Movatterモバイル変換

Thompson's construction

From Wikipedia, the free encyclopedia

Algorithm to transform a regular expression into a finite automaton

Incomputer science,Thompson's constructionalgorithm, also called theMcNaughton–Yamada–Thompson algorithm,^[1] is a method of transforming aregular expression into an equivalentnondeterministic finite automaton (NFA).^[2] This NFA can be used tomatch strings against the regular expression. This algorithm is credited toKen Thompson.

Regular expressions and nondeterministic finite automata are two representations offormal languages. For instance,text processing utilities use regular expressions to describe advanced search patterns, but NFAs are better suited for execution on a computer. Hence, this algorithm is of practical interest, since it cancompile regular expressions into NFAs. From a theoretical point of view, this algorithm is a part of the proof that they both accept exactly the same languages, that is, theregular languages.

An NFA can be made deterministic by thepowerset construction and then beminimized to get an optimal automaton corresponding to the given regular expression. However, an NFA may also beinterpreted directly.

To decide whether two given regular expressions describe the same language, each can be converted into an equivalent minimaldeterministic finite automaton via Thompson's construction,powerset construction, andDFA minimization. If, and only if, the resulting automata agreeup to renaming of states, the regular expressions' languages agree.

The algorithm

[edit]

The algorithm worksrecursively by splitting an expression into its constituent subexpressions, from which the NFA will be constructed using a set of rules.^[3] More precisely, from a regular expressionE, the obtained automatonA with the transition functionΔ^{[clarification needed]} respects the following properties:

A has exactly one initial stateq₀, which is not accessible from any other state. That is, for any stateq and any lettera, $\Delta (q,a)$ does not containq₀.
A has exactly one final stateq_f, which is not co-accessible from any other state. That is, for any lettera, $\Delta (q_{f},a)=\emptyset$ .
Letc be the number of concatenation of the regular expressionE and lets be the number of symbols apart from parentheses — that is,|,*,a andε. Then, the number of states ofA is2s −c (linear in the size ofE).
The number of transitions leaving any state is at most two.
Since an NFA ofm states and at moste transitions from each state can match a string of lengthn in timeO(emn), a Thompson NFA can do pattern matching in linear time, assuming a fixed-size alphabet.^[4]^{[better source needed]}

Rules

[edit]

The following rules are depicted according to Aho et al. (2007),^[1] p. 122. In what follows,N(s) andN(t) are the NFA of the subexpressionss andt, respectively.

Theempty-expression ε is converted to

Asymbola of the input alphabet is converted to

Theunion expressions|t is converted to

Stateq goes via ε either to the initial state ofN(s) orN(t). Their final states become intermediate states of the whole NFA and merge via two ε-transitions into the final state of the NFA.

Theconcatenation expressionst is converted to

The initial state ofN(s) is the initial state of the whole NFA. The final state ofN(s) becomes the initial state ofN(t). The final state ofN(t) is the final state of the whole NFA.

TheKleene star expressions^* is converted to

An ε-transition connects initial and final state of the NFA with the sub-NFAN(s) in between. Another ε-transition from the inner final to the inner initial state ofN(s) allows for repetition of expressions according to the star operator.

Theparenthesized expression (s) is converted toN(s) itself.

With these rules, using theempty expression andsymbol rules as base cases, it is possible to prove withstructural induction that any regular expression may be converted into an equivalent NFA.^[1]

Example

[edit]

Two examples are now given, a small informal one with the result, and a bigger with a step by step application of the algorithm.

Small Example

[edit]

Example of`(ε|a*b)` using Thompson's construction, step by step

The picture below shows the result of Thompson's construction on(ε|a*b). The purple oval corresponds toa, the teal oval corresponds toa*, the green oval corresponds tob, the orange oval corresponds toa*b, and the blue oval corresponds toε.

Application of the algorithm

[edit]

NFA obtained from regular expression`(0|(1(01*(00)*0)*1)*)*`

As an example, the picture shows the result of Thompson's construction algorithm on the regular expression(0|(1(01*(00)*0)*1)*)* that denotes the set of binary numbers that are multiples of 3:

{ ε, "0", "00", "11", "000", "011", "110", "0000", "0011", "0110", "1001", "1100", "1111", "00000", ... }.

The upper right part shows the logical structure (syntax tree) of the expression, with "." denoting concatenation (assumed to have variable arity); subexpressions are nameda-q for reference purposes. The left part shows the nondeterministic finite automaton resulting from Thompson's algorithm, with theentry andexit state of each subexpression colored inmagenta andcyan, respectively.An ε as transition label is omitted for clarity — unlabelled transitions are in fact ε transitions.The entry and exit state corresponding to the root expressionq is the start and accept state of the automaton, respectively.

The algorithm's steps are as follows:

q:	start converting Kleene star expression									`(0\|(1(01(00)0)1))*`
	b:	start converting union expression								`0\|(1(01(00)0)1)`
		a:	convert symbol							`0`
		p:	start converting Kleene star expression							`(1(01(00)0)1)`
			d:	start converting concatenation expression						`1(01(00)0)*1`
				c:	convert symbol					`1`
				n:	start converting Kleene star expression					`(01(00)0)*`
					f:	start converting concatenation expression				`01(00)0`
						e:	convert symbol			`0`
						h:	start converting Kleene star expression			`1*`
							g:	convert symbol		`1`
						h:	finished converting Kleene star expression			`1*`
						l:	start converting Kleene star expression			`(00)*`
							j:	start converting concatenation expression		`00`
								i:	convert symbol	`0`
								k:	convert symbol	`0`
							j:	finished converting concatenation expression		`00`
						l:	finished converting Kleene star expression			`(00)*`
						m:	convert symbol			`0`
					f:	finished converting concatenation expression				`01(00)0`
				n:	finished converting Kleene star expression					`(01(00)0)*`
				o:	convert symbol					`1`
			d:	finished converting concatenation expression						`1(01(00)0)*1`
		p:	finished converting Kleene star expression							`(1(01(00)0)1)`
	b:	finished converting union expression								`0\|(1(01(00)0)1)`
q:	finished converting Kleene star expression									`(0\|(1(01(00)0)1))*`

An equivalent minimal deterministic automaton is shown below.

Relation to other algorithms

[edit]

Thompson's is one of several algorithms for constructing NFAs from regular expressions;^[5] an earlier algorithm was given by McNaughton and Yamada.^[6] Converse to Thompson's construction,Kleene's algorithm transforms a finite automaton into a regular expression.

Glushkov's construction algorithm is similar to Thompson's construction, once the ε-transitions are removed.

Use in string pattern matching

[edit]

Regular expressions are often used to specify patterns that software is then asked to match. Generating an NFA by Thompson's construction, and using an appropriate algorithm to simulate it, it is possible to create pattern-matching software with performance that is⁠ $O(mn)$ ⁠, wherem is the length of the regular expression andn is the length of the string being matched. This is much better than is achieved by many popular programming-language implementations;^[7] however, it is restricted to purely regular expressions and does not supportpatterns for non-regular languages like backreferences.

References

[edit]

^^a ^b ^cAlfred Vaino Aho;Monica S. Lam;Ravi Sethi;Jeffrey D. Ullman (2007)."3.7.4 Construction of an NFA from a Regular Expression"(print).Compilers : Principles, Techniques, & Tools (2nd ed.). Boston, MA, USA: Pearson Addison-Wesley. p. 159–163.ISBN 9780321486813.
^Louden, Kenneth C. (1997)."2.4.1 From a Regular Expression to an NFA"(print).Compiler construction : Principles and Practice (3rd ed.). 20 Park Plaza Boston, MA 02116-4324, US: PWS Publishing Company. pp. 64–69.ISBN 978-0-534-93972-4.{{cite book}}: CS1 maint: location (link)
^Ken Thompson (Jun 1968)."Programming Techniques: Regular expression search algorithm".Communications of the ACM.11 (6):419–422.doi:10.1145/363347.363387.S2CID 21260384.
^Xing, Guangming."Minimized Thompson NFA"(PDF).
^Watson, Bruce W. (1995).A taxonomy of finite automata construction algorithms(PDF) (Technical report).Eindhoven University of Technology. Computing Science Report 93/43.
^R. McNaughton, H. Yamada (Mar 1960). "Regular Expressions and State Graphs for Automata".IEEE Transactions on Electronic Computers.9 (1):39–47.Bibcode:1960IRTEC...9...39M.doi:10.1109/TEC.1960.5221603.
^Cox, Russ."Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)".swtchboard. Retrieved25 February 2025.

v t e Strings
String metric	Approximate string matching Bitap algorithm Damerau–Levenshtein distance Edit distance Gestalt pattern matching Hamming distance Jaro–Winkler distance Lee distance Levenshtein automaton Levenshtein distance Wagner–Fischer algorithm
String-searching algorithm	Apostolico–Giancarlo algorithm Boyer–Moore string-search algorithm Boyer–Moore–Horspool algorithm Knuth–Morris–Pratt algorithm Rabin–Karp algorithm Raita algorithm Trigram search Two-way string-matching algorithm Zhu–Takaoka string matching algorithm
Multiple string searching	Aho–Corasick Commentz-Walter algorithm
Regular expression	Comparison of regular-expression engines Regular grammar Thompson's construction Nondeterministic finite automaton
Sequence alignment	BLAST Hirschberg's algorithm Needleman–Wunsch algorithm Smith–Waterman algorithm
Data structure	DAFSA Substring index Suffix array Suffix automaton Suffix tree Compressed suffix array LCP array FM-index Generalized suffix tree Rope Ternary search tree Trie
Other	Parsing Pattern matching Compressed pattern matching Longest common subsequence Longest common substring Sequential pattern mining Sorting String rewriting systems String operations

Retrieved from "https://en.wikipedia.org/w/index.php?title=Thompson%27s_construction&oldid=1334552188"

Category:

Finite-state machines

Hidden categories:

[8]ページ先頭