Removing epsilon productions from context free grammars

Background

epsilon productions are very useful to express many grammars in a compact way. For example, take these simple function call productions in some imaginary C-like language:

func_call:: identifier '(' arguments_opt ')'arguments_opt:: arguments_list | epsarguments_list:: argument | argument ',' arguments_list

When composing grammars by hand, simplicity matters. It's very useful to be able to look atarguments_opt and know that it's an optional list of arguments. The same non-terminal can be reused in several other productions.

However, epsilon productions pose a problem for several algorithms that act on grammars. Therefore, prior to running these algorithms, epsilon productions have to be removed. Fortunately, this can be done relatively effortlessly in an automatic way.

Here I want to present an algorithm and a simple implementation for epsilon production removal.

The algorithm

Intuitively, it's quite simple to remove epsilon productions. Consider the grammar for function calls presented above. Theargument_opt nonterminal infunc_call is just a short way of saying that there either is an argument list inside those parens or nothing. In other words, it can be rewritten as follows:

func_call:: identifier '(' arguments_opt ')'          | identifier '(' ')'arguments_opt:: arguments_listarguments_list:: argument | argument ',' arguments_list

This duplication of productions forfunc_call will have to be repeated for every other production that hadarguments_opt in it. This grammar looks somewhat strange, asarguments_opt is now identical toarguments_list. It is correct, however.

A more interesting case occurs when the epsilon production is in a nonterminal that appears more than once in some other production[1]. Consider:

B:: A z AA:: a | eps

When we remove the epsilon production fromA, we have to duplicate the productions that haveA in them, but the production forB has twoA. Since either of theA instances in the production can be empty, the only proper way to do this is go over all the combinations:

B:: z | A z | z A | A z AA:: a

In the general case, ifA appearsk times in some production, this production will be replicated2^k times, each time with a different combination[2].

This leads us to the algorithm:

Pick a nonterminalA with an epsilon production
Remove that epsilon production
For each production containingA: Replicate it2^k times wherek is the number ofA instances in the production, such that all combinations ofA being there or not will be represented.
If there are still epsilon productions in the grammar, go back to step 1.

A couple of points to pay attention to:

It's obvious that a step of the algorithm can create new epsilon productions[3]. This is handled correctly, as it works iteratively until all epsilon productions are removed.
The only place where an epsilon production cannot be removed is at the start symbol. If the grammar can generate an empty string, we can't ruin that. A special case will have to handle this case.

Implementation

Here's an implementation of this algorithm in Python:

fromcollectionsimport defaultdictclassCFG(object):def__init__(self):self.prod = defaultdict(list)self.start =Nonedefset_start_symbol(self, start):""" Set the start symbol of the grammar.        """self.start = startdefadd_prod(self, lhs, rhs):""" Add production to the grammar. 'rhs' can            be several productions separated by '|'.            Each production is a sequence of symbols            separated by whitespace.            Empty strings are interpreted as an eps-production.            Usage:                grammar.add_prod('NT', 'VP PP')                grammar.add_prod('Digit', '1|2|3|4')                # Optional Digit: digit or eps                grammar.add_prod('Digit_opt', Digit |')        """# The internal data-structure representing productions.# maps a nonterminal name to a list of productions, each# a list of symbols. An empty list [] specifies an# eps-production.#        prods = rhs.split('|')for prodin prods:self.prod[lhs].append(prod.split())defremove_eps_productions(self):""" Removes epsilon productions from the grammar.            The algorithm:            1. Pick a nonterminal p_eps with an epsilon production            2. Remove that epsilon production            3. For each production containing p_eps, replace it               with several productions such that all the               combinations of p_eps being there or not will be               represented.            4. If there are still epsilon productions in the               grammar, go back to step 1            The replication can be demonstrated with an example.            Suppose that A contains an epsilon production, and            we've found a production B:: [A, k, A]            Then this production of B will be replaced with these:            [A, k], [k], [k, A], [A, k, A]        """whileTrue:# Find an epsilon production#            p_eps, index =self._find_eps_production()# No epsilon productions? Then we're done...#if p_epsisNone:break# Remove the epsilon production#delself.prod[p_eps][index]# Now find all the productions that contain the# production that removed.# For each such production, replicate it with all# the combinations of the removed production.#for lhsinself.prod:                prods = []for lhs_prodinself.prod[lhs]:                    num_p_eps = lhs_prod.count(p_eps)if num_p_eps ==0:                        prods.append(lhs_prod)else:                        prods.extend(self._create_prod_combinations(                            prod=lhs_prod,                            nt=p_eps,                            count=num_p_eps))# Remove duplicates#                prods = sorted(prods)                prods = [prods[i]for iinxrange(len(prods))if i ==0or prods[i] != prods[i-1]]self.prod[lhs] = prodsdef_find_eps_production(self):""" Finds an epsilon production in the grammar. If such            a production is found, returns the pair (lhs, index):            the name of the non-terminal that has an epsilon            production and its index in lhs's list of productions.            If no epsilon productions were found, returns the            pair (None, None).            Note: eps productions in the start symbol will be            ignored, because we don't want to remove them.        """for lhsinself.prod:ifnotself.startisNoneand lhs ==self.start:continuefor i, pinenumerate(self.prod[lhs]):iflen(p) ==0:return lhs, ireturnNone,Nonedef_create_prod_combinations(self, prod, nt, count):""" prod:                A production (list) that contains at least one                instance of 'nt'            nt:                The non-terminal which should be replicated            count:                The amount of times 'nt' appears in 'lhs_prod'.                Assumed to be >= 1            Returns the generated list of productions.        """# The combinations are a kind of a powerset. Membership# in a powerset can be checked by using the binary# representation of a number.# There are 2^count possibilities in total.#        numset =1 << count        new_prods = []for iinxrange(numset):            nth_nt =0            new_prod = []for sin prod:if s == nt:if i & (1 << nth_nt):                        new_prod.append(s)                    nth_nt +=1else:                    new_prod.append(s)            new_prods.append(new_prod)return new_prods

And here are the results with some of the sample grammars presented earlier in the article:

cfg = CFG()cfg.add_prod('identifier','( arguments_opt )')cfg.add_prod('arguments_opt','arguments_list | ')cfg.add_prod('arguments_list','argument | argument , arguments_list')cfg.remove_eps_productions()for pin cfg.prod:print p,':: ', [' '.join(pr)for prin cfg.prod[p]]

Produces:

func_call ::  ['identifier ( )', 'identifier ( arguments_opt )']arguments_list ::  ['argument', 'argument , arguments_list']arguments_opt ::  ['arguments_list']

As expected. And:

cfg = CFG()cfg.add_prod('B','A z A')cfg.add_prod('A','a | ')cfg.remove_eps_productions()for pin cfg.prod:print p,':: ', [' '.join(pr)for prin cfg.prod[p]]

Produces:

A ::  ['a']B ::  ['A z', 'A z A', 'z', 'z A']

The implementation isn't tuned for efficiency, but for simplicity. Luckily, CFGs are usually small enough to make the runtime of this implementation manageable. Note that the preservation of epsilon productions in the start rule is implemented in the_find_eps_production method.

[1]	From here on, lowercase letters early in the alphabet (a, b, c...) are terminals. Early uppercase letters (A, B, C...) are nonterminals, and letters late in the alphabet (z, y, x...) are arbitrary strings of terminals and nonterminals.

[2]	If this sounds like generating apowerset, you're right.

[3]	Consider the productions:

A:: a | epsB:: b | A

After removing the epsilon production fromA we'll have:

A:: aB:: b | A | eps

For comments, please send me an email.

Movatterモバイル変換

Background

The algorithm

Implementation