Movatterモバイル変換

[0]ホーム

Jump to content

Chomsky normal form

Edit links

From Wikipedia, the free encyclopedia

Notation for context-free formal grammars

Not to be confused withconjunctive normal form.

Informal language theory, acontext-free grammar,G, is said to be inChomsky normal form (first described byNoam Chomsky)^[1] if all of itsproduction rules are of the form:^[2]^[3]

A →BC, or

A →a, or

S → ε,

whereA,B, andC arenonterminal symbols, the lettera is aterminal symbol (a symbol that represents a constant value),S is the start symbol, and ε denotes theempty string. Also, neitherB norC may be thestart symbol, and the third production rule can only appear if ε is inL(G), the language produced by the context-free grammarG.^[4]^{: 92–93, 106}

Every grammar in Chomsky normal form is context-free, and conversely, every context-free grammar can be transformed into anequivalent one^{[note 1]} which is in Chomsky normal form and has a size no larger than the square of the original grammar's size.

Converting a grammar to Chomsky normal form

[edit]

To convert a grammar to Chomsky normal form, a sequence of simple transformations is applied in a certain order; this is described in most textbooks onautomata theory.^[4]^: 87–94^[5]^[6]^[7]The presentation here follows Hopcroft, Ullman (1979), but is adapted to use the transformation names from Lange, Leiß (2009).^[8]^{[note 2]} Each of the following transformations establishes one of the properties required for Chomsky normal form.

START: Eliminate the start symbol from right-hand sides

[edit]

Introduce a new start symbolS₀, and a new rule

S₀ →S,

whereS is the previous start symbol.This does not change the grammar's produced language, andS₀ will not occur on any rule's right-hand side.

TERM: Eliminate rules with nonsolitary terminals

[edit]

To eliminate each rule

A →X₁ ...a ...X_n

with a terminal symbola being not the only symbol on the right-hand side, introduce, for every such terminal, a new nonterminal symbolN_a, and a new rule

N_a →a.

Change every rule

A →X₁ ...a ...X_n

A →X₁ ...N_a ...X_n.

If several terminal symbols occur on the right-hand side, simultaneously replace each of them by its associated nonterminal symbol.This does not change the grammar's produced language.^[4]^: 92

BIN: Eliminate right-hand sides with more than 2 nonterminals

[edit]

Replace each rule

A →X₁X₂ ...X_n

with more than 2 nonterminalsX₁,...,X_n by rules

A →X₁A₁,

A₁ →X₂A₂,

... ,

A_n-2 →X_n-1X_n,

whereA_i are new nonterminal symbols.Again, this does not change the grammar's produced language.^[4]^: 93

DEL: Eliminate ε-rules

[edit]

An ε-rule is a rule of the form

A → ε,

whereA is notS₀, the grammar's start symbol.

To eliminate all rules of this form, first determine the set of all nonterminals that derive ε.Hopcroft and Ullman (1979) call such nonterminalsnullable, and compute them as follows:

If a ruleA → ε exists, thenA is nullable.
If a ruleA →X₁ ...X_n exists, and every singleX_i is nullable, thenA is nullable, too.

Obtain an intermediate grammar by replacing each rule

A →X₁ ...X_n

by all versions with some nullableX_i omitted.By deleting in this grammar each ε-rule, unless its left-hand side is the start symbol, the transformed grammar is obtained.^[4]^: 90

For example, in the following grammar, with start symbolS₀,

S₀ →AbB |C

B →AA |AC

C →b |c

A →a | ε

the nonterminalA, and hence alsoB, is nullable, while neitherC norS₀ is.Hence the following intermediate grammar is obtained:^{[note 3]}

S₀ →AbB |AbB |AbB |AbB | C

B →AA |AA |AA |AεA | AC |AC

C →b |c

A →a | ε

In this grammar, all ε-rules have been "inlined at the call site".^{[note 4]}In the next step, they can hence be deleted, yielding the grammar:

S₀ →AbB |Ab |bB |b | C

B →AA |A | AC |C

C →b |c

A →a

This grammar produces the same language as the original example grammar, viz. {ab,aba,abaa,abab,abac,abb,abc,b,ba,baa,bab,bac,bb,bc,c}, but has no ε-rules.

UNIT: Eliminate unit rules

[edit]

A unit rule is a rule of the form

A →B,

whereA,B are nonterminal symbols.To remove it, for each rule

B →X₁ ...X_n,

whereX₁ ...X_n is a string of nonterminals and terminals, add rule

A →X₁ ...X_n

unless this is a unit rule which has already been (or is being) removed. The skipping of nonterminal symbolB in the resulting grammar is possible due toB being a member of the unit closure of nonterminal symbolA.^[9]

Order of transformations

[edit]

Mutual preservation
of transformation results
Y X	START	TERM	BIN	DEL	UNIT
TransformationXalways preserves (Y) resp.may destroy (N) the result ofY:
START
TERM
BIN
DEL
UNIT				(Y)^*
^*UNIT preserves the result ofDEL ifSTART had been called before.

When choosing the order in which the above transformations are to be applied, it has to be considered that some transformations may destroy the result achieved by other ones. For example,START will re-introduce a unit rule if it is applied afterUNIT. The table shows which orderings are admitted.

Moreover, the worst-case bloat in grammar size^{[note 5]} depends on the transformation order. Using |G| to denote the size of the original grammarG, the size blow-up in the worst case may range from |G|² to 2^{2 |G|}, depending on the transformation algorithm used.^[8]^: 7 The blow-up in grammar size depends on the order betweenDEL andBIN. It may be exponential whenDEL is done first, but is linear otherwise.UNIT can incur a quadratic blow-up in the size of the grammar.^[8]^: 5 The orderingsSTART,TERM,BIN,DEL,UNIT andSTART,BIN,DEL,UNIT,TERM lead to the least (i.e. quadratic) blow-up.

Example

[edit]

The following grammar, with start symbolExpr, describes a simplified version of the set of all syntactical valid arithmetic expressions in programming languages likeC orAlgol60. Bothnumber andvariable are considered terminal symbols here for simplicity, since in acompiler front end their internal structure is usually not considered by theparser. The terminal symbol "^" denotedexponentiation in Algol60.

Expr	→Term	\|ExprAddOpTerm	\|AddOpTerm
Term	→Factor	\|TermMulOpFactor
Factor	→Primary	\|Factor ^Primary
Primary	→number	\|variable	\| (Expr )
AddOp	→ +	\| −
MulOp	→ *	\| /

In step "START" of theabove conversion algorithm, just a ruleS₀→Expr is added to the grammar.After step "TERM", the grammar looks like this:

S₀	→Expr
Expr	→Term	\|ExprAddOpTerm	\|AddOpTerm
Term	→Factor	\|TermMulOpFactor
Factor	→Primary	\|FactorPowOpPrimary
Primary	→number	\|variable	\|OpenExprClose
AddOp	→ +	\| −
MulOp	→ *	\| /
PowOp	→ ^
Open	→ (
Close	→ )

After step "BIN", the following grammar is obtained:

S₀	→Expr
Expr	→Term	\|ExprAddOp_Term	\|AddOpTerm
Term	→Factor	\|TermMulOp_Factor
Factor	→Primary	\|FactorPowOp_Primary
Primary	→number	\|variable	\|OpenExpr_Close
AddOp	→ +	\| −
MulOp	→ *	\| /
PowOp	→ ^
Open	→ (
Close	→ )
AddOp_Term	→AddOpTerm
MulOp_Factor	→MulOpFactor
PowOp_Primary	→PowOpPrimary
Expr_Close	→ExprClose

Since there are no ε-rules, step "DEL" does not change the grammar.After step "UNIT", the following grammar is obtained, which is in Chomsky normal form:

S₀	→number	\|variable	\|OpenExpr_Close	\|FactorPowOp_Primary	\|TermMulOp_Factor	\|ExprAddOp_Term	\|AddOpTerm
Expr	→number	\|variable	\|OpenExpr_Close	\|FactorPowOp_Primary	\|TermMulOp_Factor	\|ExprAddOp_Term	\|AddOpTerm
Term	→number	\|variable	\|OpenExpr_Close	\|FactorPowOp_Primary	\|TermMulOp_Factor
Factor	→number	\|variable	\|OpenExpr_Close	\|FactorPowOp_Primary
Primary	→number	\|variable	\|OpenExpr_Close
AddOp	→ +	\| −
MulOp	→ *	\| /
PowOp	→ ^
Open	→ (
Close	→ )
AddOp_Term	→AddOpTerm
MulOp_Factor	→MulOpFactor
PowOp_Primary	→PowOpPrimary
Expr_Close	→ExprClose

TheN_a introduced in step "TERM" arePowOp,Open, andClose.TheA_i introduced in step "BIN" areAddOp_Term,MulOp_Factor,PowOp_Primary, andExpr_Close.

Alternative definition

[edit]

Chomsky reduced form

[edit]

Another way^[4]^: 92^[10] to define the Chomsky normal form is:

Aformal grammar is inChomsky reduced form if all of its production rules are of the form:

A\rightarrow \,BC

A\rightarrow \,a

where $A {\displaystyle A}$ , $B {\displaystyle B}$ and $C {\displaystyle C}$ are nonterminal symbols, and $a {\displaystyle a}$ is aterminal symbol. When using this definition, $B {\displaystyle B}$ or $C {\displaystyle C}$ may be the start symbol. Only those context-free grammars which do not generate theempty string can be transformed into Chomsky reduced form.

Floyd normal form

[edit]

In a letter where he proposed a termBackus–Naur form (BNF),Donald E. Knuth implied a BNF "syntax in which all definitions have such a form may be said to be in 'Floyd Normal Form'",

\langle A\rangle ::=\,\langle B\rangle \mid \langle C\rangle

\langle A\rangle ::=\,\langle B\rangle \langle C\rangle

\langle A\rangle ::=\,a

where $\langle A\rangle$ , $\langle B\rangle$ and $\langle C\rangle$ are nonterminal symbols, and $a {\displaystyle a}$ is a terminal symbol,becauseRobert W. Floyd found any BNF syntax can be converted to the above one in 1961.^[11] But he withdrew this term, "since doubtless many people have independently used this simple fact in their own work, and the point is only incidental to the main considerations of Floyd's note."^[12] While Floyd's note cites Chomsky's original 1959 article, Knuth's letter does not.

Application

[edit]

Besides its theoretical significance, CNF conversion is used in some algorithms as a preprocessing step, e.g., theCYK algorithm, abottom-up parsing for context-free grammars, and its variant probabilistic CKY.^[13]

Notes

[edit]

^that is, one that produces the samelanguage
^For example, Hopcroft, Ullman (1979) mergedTERM andBIN into a single transformation.
^indicating a kept and omitted nonterminalN byN andN, respectively
^If the grammar had a ruleS₀ → ε, it could not be "inlined", since it had no "call sites". Therefore it could not be deleted in the next step.
^i.e. written length, measured in symbols

References

[edit]

^Chomsky, Noam (1959). "On Certain Formal Properties of Grammars".Information and Control.2 (2):137–167.doi:10.1016/S0019-9958(59)90362-6. Here: Sect.6, p.152ff.
^D'Antoni, Loris."Page 7, Lecture 9: Bottom-up Parsing Algorithms"(PDF).CS536-S21 Intro to Programming Languages and Compilers. University of Wisconsin-Madison.Archived(PDF) from the original on 2021-07-19.
^Sipser, Michael (2006).Introduction to the theory of computation (2nd ed.). Boston: Thomson Course Technology. Definition 2.8.ISBN 0-534-95097-3.OCLC 58544333.
^^a ^b ^c ^d ^e ^fHopcroft, John E.; Ullman, Jeffrey D. (1979).Introduction to Automata Theory, Languages and Computation. Reading, Massachusetts: Addison-Wesley Publishing.ISBN 978-0-201-02988-8.
^Hopcroft, John E.; Motwani, Rajeev; Ullman, Jeffrey D. (2006).Introduction to Automata Theory, Languages, and Computation (3rd ed.). Addison-Wesley.ISBN 978-0-321-45536-9. Section 7.1.5, p.272
^Rich, Elaine (2007). "11.8 Normal Forms".Automata, Computability, and Complexity: Theory and Applications(PDF) (1st ed.). Prentice-Hall. p. 169.ISBN 978-0132288064. Archived fromthe original(PDF) on 2023-01-17.
^Wegener, Ingo (1993).Theoretische Informatik - Eine algorithmenorientierte Einführung. Leitfäden und Mongraphien der Informatik (in German). Stuttgart: B. G. Teubner.ISBN 978-3-519-02123-0. Section 6.2 "Die Chomsky-Normalform für kontextfreie Grammatiken", p. 149–152
^^a ^b ^cLange, Martin; Leiß, Hans (2009)."To CNF or not to CNF? An Efficient Yet Presentable Version of the CYK Algorithm"(PDF).Informatica Didactica.8.Archived(PDF) from the original on 2011-07-19.
^Allison, Charles D. (2022).Foundations of Computing: An Accessible Introduction to Automata and Formal Languages. Fresh Sources, Inc. p. 176.ISBN 9780578944173.
^Hopcroft et al. (2006)^{[page needed]}
^Floyd, Robert W. (1961)."Note on mathematical induction in phrase structure grammars"(PDF).Information and Control.4 (4):353–358.doi:10.1016/S0019-9958(61)80052-1.Archived(PDF) from the original on 2021-03-05. Here: p.354
^Knuth, Donald E. (December 1964)."Backus Normal Form vs. Backus Naur Form".Communications of the ACM.7 (12):735–736.doi:10.1145/355588.365140.S2CID 47537431.
^Jurafsky, Daniel; Martin, James H. (2008).Speech and Language Processing (2nd ed.). Pearson Prentice Hall. p. 465.ISBN 978-0-13-187321-6.

Movatterモバイル変換

Chomsky normal form

Converting a grammar to Chomsky normal form

START: Eliminate the start symbol from right-hand sides

TERM: Eliminate rules with nonsolitary terminals

BIN: Eliminate right-hand sides with more than 2 nonterminals

DEL: Eliminate ε-rules

UNIT: Eliminate unit rules

Order of transformations

Example

Alternative definition

Chomsky reduced form

Floyd normal form

Application

See also

Notes

References

Further reading