epsilon productions are very useful to express many grammars in a compact way. For example, take these simple function call productions in some imaginary C-like language:
func_call:: identifier '(' arguments_opt ')'arguments_opt:: arguments_list | epsarguments_list:: argument | argument ',' arguments_listWhen composing grammars by hand, simplicity matters. It's very useful to be able to look atarguments_opt and know that it's an optional list of arguments. The same non-terminal can be reused in several other productions.
However, epsilon productions pose a problem for several algorithms that act on grammars. Therefore, prior to running these algorithms, epsilon productions have to be removed. Fortunately, this can be done relatively effortlessly in an automatic way.
Here I want to present an algorithm and a simple implementation for epsilon production removal.
Intuitively, it's quite simple to remove epsilon productions. Consider the grammar for function calls presented above. Theargument_opt nonterminal infunc_call is just a short way of saying that there either is an argument list inside those parens or nothing. In other words, it can be rewritten as follows:
func_call:: identifier '(' arguments_opt ')' | identifier '(' ')'arguments_opt:: arguments_listarguments_list:: argument | argument ',' arguments_listThis duplication of productions forfunc_call will have to be repeated for every other production that hadarguments_opt in it. This grammar looks somewhat strange, asarguments_opt is now identical toarguments_list. It is correct, however.
A more interesting case occurs when the epsilon production is in a nonterminal that appears more than once in some other production[1]. Consider:
B:: A z AA:: a | eps
When we remove the epsilon production fromA, we have to duplicate the productions that haveA in them, but the production forB has twoA. Since either of theA instances in the production can be empty, the only proper way to do this is go over all the combinations:
B:: z | A z | z A | A z AA:: a
In the general case, ifA appearsk times in some production, this production will be replicated2^k times, each time with a different combination[2].
This leads us to the algorithm:
A couple of points to pay attention to:
Here's an implementation of this algorithm in Python:
fromcollectionsimport defaultdictclassCFG(object):def__init__(self):self.prod = defaultdict(list)self.start =Nonedefset_start_symbol(self, start):""" Set the start symbol of the grammar. """self.start = startdefadd_prod(self, lhs, rhs):""" Add production to the grammar. 'rhs' can be several productions separated by '|'. Each production is a sequence of symbols separated by whitespace. Empty strings are interpreted as an eps-production. Usage: grammar.add_prod('NT', 'VP PP') grammar.add_prod('Digit', '1|2|3|4') # Optional Digit: digit or eps grammar.add_prod('Digit_opt', Digit |') """# The internal data-structure representing productions.# maps a nonterminal name to a list of productions, each# a list of symbols. An empty list [] specifies an# eps-production.# prods = rhs.split('|')for prodin prods:self.prod[lhs].append(prod.split())defremove_eps_productions(self):""" Removes epsilon productions from the grammar. The algorithm: 1. Pick a nonterminal p_eps with an epsilon production 2. Remove that epsilon production 3. For each production containing p_eps, replace it with several productions such that all the combinations of p_eps being there or not will be represented. 4. If there are still epsilon productions in the grammar, go back to step 1 The replication can be demonstrated with an example. Suppose that A contains an epsilon production, and we've found a production B:: [A, k, A] Then this production of B will be replaced with these: [A, k], [k], [k, A], [A, k, A] """whileTrue:# Find an epsilon production# p_eps, index =self._find_eps_production()# No epsilon productions? Then we're done...#if p_epsisNone:break# Remove the epsilon production#delself.prod[p_eps][index]# Now find all the productions that contain the# production that removed.# For each such production, replicate it with all# the combinations of the removed production.#for lhsinself.prod: prods = []for lhs_prodinself.prod[lhs]: num_p_eps = lhs_prod.count(p_eps)if num_p_eps ==0: prods.append(lhs_prod)else: prods.extend(self._create_prod_combinations( prod=lhs_prod, nt=p_eps, count=num_p_eps))# Remove duplicates# prods = sorted(prods) prods = [prods[i]for iinxrange(len(prods))if i ==0or prods[i] != prods[i-1]]self.prod[lhs] = prodsdef_find_eps_production(self):""" Finds an epsilon production in the grammar. If such a production is found, returns the pair (lhs, index): the name of the non-terminal that has an epsilon production and its index in lhs's list of productions. If no epsilon productions were found, returns the pair (None, None). Note: eps productions in the start symbol will be ignored, because we don't want to remove them. """for lhsinself.prod:ifnotself.startisNoneand lhs ==self.start:continuefor i, pinenumerate(self.prod[lhs]):iflen(p) ==0:return lhs, ireturnNone,Nonedef_create_prod_combinations(self, prod, nt, count):""" prod: A production (list) that contains at least one instance of 'nt' nt: The non-terminal which should be replicated count: The amount of times 'nt' appears in 'lhs_prod'. Assumed to be >= 1 Returns the generated list of productions. """# The combinations are a kind of a powerset. Membership# in a powerset can be checked by using the binary# representation of a number.# There are 2^count possibilities in total.# numset =1 << count new_prods = []for iinxrange(numset): nth_nt =0 new_prod = []for sin prod:if s == nt:if i & (1 << nth_nt): new_prod.append(s) nth_nt +=1else: new_prod.append(s) new_prods.append(new_prod)return new_prods
And here are the results with some of the sample grammars presented earlier in the article:
cfg = CFG()cfg.add_prod('identifier','( arguments_opt )')cfg.add_prod('arguments_opt','arguments_list | ')cfg.add_prod('arguments_list','argument | argument , arguments_list')cfg.remove_eps_productions()for pin cfg.prod:print p,':: ', [' '.join(pr)for prin cfg.prod[p]]
Produces:
func_call :: ['identifier ( )', 'identifier ( arguments_opt )']arguments_list :: ['argument', 'argument , arguments_list']arguments_opt :: ['arguments_list']
As expected. And:
cfg = CFG()cfg.add_prod('B','A z A')cfg.add_prod('A','a | ')cfg.remove_eps_productions()for pin cfg.prod:print p,':: ', [' '.join(pr)for prin cfg.prod[p]]
Produces:
A :: ['a']B :: ['A z', 'A z A', 'z', 'z A']
The implementation isn't tuned for efficiency, but for simplicity. Luckily, CFGs are usually small enough to make the runtime of this implementation manageable. Note that the preservation of epsilon productions in the start rule is implemented in the_find_eps_production method.

| [1] | From here on, lowercase letters early in the alphabet (a, b, c...) are terminals. Early uppercase letters (A, B, C...) are nonterminals, and letters late in the alphabet (z, y, x...) are arbitrary strings of terminals and nonterminals. |
| [2] | If this sounds like generating apowerset, you're right. |
| [3] | Consider the productions: |
A:: a | epsB:: b | A
After removing the epsilon production fromA we'll have:
A:: aB:: b | A | eps
For comments, please send me an email.