Incomputer science, anLL parser is atop-down parser for a restrictedcontext-free language. It parses the input fromLeft to right, performingLeftmost derivation of the sentence.
An LL parser is called an LL(k) parser if it usesktokens oflookahead when parsing a sentence. A grammar is called anLL(k) grammar if an LL(k) parser can be constructed from it. A formal language is called an LL(k) language if it has an LL(k) grammar. The set of LL(k) languages is properly contained in that of LL(k+1) languages, for eachk ≥ 0.[1] A corollary of this is that not all context-free languages can be recognized by an LL(k) parser.
An LL parser is called LL-regular (LLR) if it parses anLL-regular language.[clarification needed][2][3][4] The class ofLLR grammars contains every LL(k) grammar for everyk. For every LLR grammar there exists an LLR parser that parses the grammar in linear time.[citation needed]
Two nomenclative outlier parser types are LL(*) and LL(finite). A parser is called LL(*)/LL(finite) if it uses the LL(*)/LL(finite) parsing strategy.[5][6] LL(*) and LL(finite) parsers are functionally closer toPEG parsers. An LL(finite) parser can parse an arbitrary LL(k) grammar optimally in the amount of lookahead and lookahead comparisons. The class of grammars parsable by the LL(*) strategy encompasses somecontext-sensitive languages due to the use of syntactic and semantic predicates and has not been identified. It has been suggested that LL(*) parsers are better thought of asTDPL parsers.[7]Against the popular misconception, LL(*) parsers are not LLR in general, and are guaranteed by construction to perform worse on average (super-linear against linear time) and far worse in the worst-case (exponential against linear time).
LL grammars, particularly LL(1) grammars, are of great practical interest, as parsers for these grammars are easy to construct, and manycomputer languages are designed to be LL(1) for this reason.[8] LL parsers may be table-based,[9][citation needed] i.e. similar toLR parsers, but LL grammars can also be parsed byrecursive descent parsers. According to Waite and Goos (1984),[10] LL(k) grammars were introduced by Stearns and Lewis (1969).[11]
For a givencontext-free grammar, the parser attempts to find theleftmost derivation.Given an example grammarG:
the leftmost derivation for is:
Generally, there are multiple possibilities when selecting a rule to expand the leftmost non-terminal. In step 2 of the previous example, the parser must choose whether to apply rule 2 or rule 3:
To be efficient, the parser must be able to make this choice deterministically when possible, withoutbacktracking. For some grammars, it can do this by peeking on the unread input (without reading). In our example, if the parser knows that the next unread symbol is(, the only correct rule that can be used is 2.
Generally, an LL(k) parser can look ahead atk symbols. However, given a grammar, the problem of determining if there exists a LL(k) parser for somek that recognizes it isundecidable. For eachk, there is a language that cannot be recognized by an LL(k) parser, but can be by anLL(k + 1) parser.
We can use the above analysis to give the following formal definition:
LetG be a context-free grammar andk ≥ 1. We say thatG is LL(k), if for any two leftmost derivations:
the following condition holds: the prefix of the string of length equals the prefix of the string of lengthk implies.
In this definition, is the start symbol and any non-terminal. The already derived input, and yet unread and are strings of terminals. The Greek letters, and represent any string of both terminals and non-terminals (possibly empty). The prefix length corresponds to the lookahead buffer size, and the definition says that this buffer is enough to distinguish between any two derivations of different words.
The LL(k) parser is adeterministic pushdown automaton with the ability to peek on the nextk input symbols without reading. This peek capability can be emulated by storing the lookahead buffer contents in the finite state space, since both buffer and input alphabet are finite in size. As a result, this does not make the automaton more powerful, but is a convenient abstraction.
The stack alphabet is, where:
The parser stack initially contains the starting symbol above the EOI:[ S$ ]. During operation, the parser repeatedly replaces the symbol on top of the stack:
If the last symbol to be removed from the stack is the EOI, the parsing is successful; the automaton accepts via an empty stack.
The states and the transition function are not explicitly given; they are specified (generated) using a more convenientparse table instead. The table provides the following mapping:
If the parser cannot perform a valid transition, the input is rejected (empty cells). To make the table more compact, only the non-terminal rows are commonly displayed, since the action is the same for terminals.
To explain an LL(1) parser's workings we will consider the following small LL(1) grammar:
and parse the following input:
An LL(1) parsing table for a grammar has a row for each of the non-terminals and a column for each terminal (including the special terminal, represented here as$, that is used to indicate the end of the input stream).
Each cell of the table may point to at most one rule of the grammar (identified by its number). For example, in the parsing table for the above grammar, the cell for the non-terminal 'S' and terminal '(' points to the rule number 2:
| ( | ) | a | + | $ | |
|---|---|---|---|---|---|
| S | 2 | — | 1 | — | — |
| F | — | — | 3 | — | — |
The algorithm to construct a parsing table is described in a later section, but first let's see how the parser uses the parsing table to process its input.
In each step, the parser reads the next-available symbol from the input stream, and the top-most symbol from the stack. If the input symbol and the stack-top symbol match, the parser discards them both by moving to the next input symbol and popping the top-most symbol off the stack. This is repeated until the input symbol and top-most symbol on the stack do not match.
Thus, in its first step, the parser reads the input symbol '(' and the stack-top symbol 'S'. The parsing table instruction comes from the column headed by the input symbol '(' and the row headed by the stack-top symbol 'S'; this cell contains '2', which instructs the parser to apply rule (2). The parser has to rewrite 'S' to '( S+ F)' on the stack by removing 'S' from stack and pushing ')', 'F', '+', 'S', '(' onto the stack, and this writes the rule number 2 to the output. The stack then becomes:
[(, S,+, F,),$ ]
In the second step, the parser removes the '(' from its input stream and from its stack, since they now match. The stack now becomes:
[ S,+, F,),$ ]
Now the parser has an 'a' on its input stream and an 'S' as its stack top. The parsing table instructs it to apply rule (1) from the grammar and write the rule number 1 to the output stream. The stack becomes:
[ F,+, F,),$ ]
The parser now has an 'a' on its input stream and an 'F' as its stack top. The parsing table instructs it to apply rule (3) from the grammar and write the rule number 3 to the output stream. The stack becomes:
[a,+, F,),$ ]
The parser now has an 'a' on the input stream and an 'a' at its stack top. Because they are the same, it removes it from the input stream and pops it from the top of the stack. The parser then has an '+' on the input stream and '+' is at the top of the stack meaning, like with 'a', it is popped from the stack and removed from the input stream. This results in:
[ F,),$ ]
In the next three steps the parser will replace 'F' on the stack by 'a', write the rule number 3 to the output stream and remove the 'a' and ')' from both the stack and the input stream. The parser thus ends with '$' on both its stack and its input stream.
In this case the parser will report that it has accepted the input string and write the following list of rule numbers to the output stream:
This is indeed a list of rules for aleftmost derivation of the input string, which is:
Below follows a C++ implementation of a table-based LL parser for the example language:
#include<iostream>#include<map>#include<stack>enumSymbols{// the symbols:// Terminal symbols:TS_L_PARENS,// (TS_R_PARENS,// )TS_A,// aTS_PLUS,// +TS_EOS,// $, in this case corresponds to '\0'TS_INVALID,// invalid token// Non-terminal symbols:NTS_S,// SNTS_F// F};/*Converts a valid token to the corresponding terminal symbol*/Symbolslexer(charc){switch(c){case'(':returnTS_L_PARENS;case')':returnTS_R_PARENS;case'a':returnTS_A;case'+':returnTS_PLUS;case'\0':returnTS_EOS;// end of stack: the $ terminal symboldefault:returnTS_INVALID;}}intmain(intargc,char**argv){usingnamespacestd;if(argc<2){cout<<"usage:\n\tll '(a+a)'"<<endl;return0;}// LL parser table, maps < non-terminal, terminal> pair to actionmap<Symbols,map<Symbols,int>>table;stack<Symbols>ss;// symbol stackchar*p;// input buffer// initialize the symbols stackss.push(TS_EOS);// terminal, $ss.push(NTS_S);// non-terminal, S// initialize the symbol stream cursorp=&argv[1][0];// set up the parsing tabletable[NTS_S][TS_L_PARENS]=2;table[NTS_S][TS_A]=1;table[NTS_F][TS_A]=3;while(ss.size()>0){if(lexer(*p)==ss.top()){cout<<"Matched symbols: "<<lexer(*p)<<endl;p++;ss.pop();}else{cout<<"Rule "<<table[ss.top()][lexer(*p)]<<endl;switch(table[ss.top()][lexer(*p)]){case1:// 1. S → Fss.pop();ss.push(NTS_F);// Fbreak;case2:// 2. S → ( S + F )ss.pop();ss.push(TS_R_PARENS);// )ss.push(NTS_F);// Fss.push(TS_PLUS);// +ss.push(NTS_S);// Sss.push(TS_L_PARENS);// (break;case3:// 3. F → ass.pop();ss.push(TS_A);// abreak;default:cout<<"parsing table defaulted"<<endl;return0;}}}cout<<"finished parsing"<<endl;return0;}
fromenumimportEnumfromcollections.abcimportGeneratorclassTerm(Enum):passclassRule(Enum):pass# All constants are indexed from 0classTerminal(Term):LPAR=0RPAR=1A=2PLUS=3END=4INVALID=5def__str__(self):returnf"T_{self.name}"classNonTerminal(Rule):S=0F=1def__str__(self):returnf"N_{self.name}"# Parse tabletable=[[1,-1,0,-1,-1,-1],[-1,-1,2,-1,-1,-1]]RULES=[[NonTerminal.F],[Terminal.LPAR,NonTerminal.S,Terminal.PLUS,NonTerminal.F,Terminal.RPAR,],[Terminal.A],]stack=[Terminal.END,NonTerminal.S]deflexical_analysis(input_string:str)->Generator[Terminal]:print("Lexical analysis")forcininput_string:matchc:case"a":yieldTerminal.Acase"+":yieldTerminal.PLUScase"(":yieldTerminal.LPARcase")":yieldTerminal.RPARcase_:yieldTerminal.INVALIDyieldTerminal.ENDdefsyntactic_analysis(tokens:list[Terminal])->None:print("tokens:",end=" ")print(*tokens,sep=", ")print("Syntactic analysis")position=0whilestack:svalue=stack.pop()token=tokens[position]ifisinstance(svalue,Term):ifsvalue==token:position+=1print("pop",svalue)iftoken==Terminal.END:print("input accepted")else:raiseValueError("bad term on input:",str(token))elifisinstance(svalue,Rule):print(f"{svalue= !s},{token= !s}")rule=table[svalue.value][token.value]print(f"{rule= }")forrinreversed(RULES[rule]):stack.append(r)print("stacks:",end=" ")print(*stack,sep=", ")if__name__=="__main__":inputstring="(a+a)"syntactic_analysis(list(lexical_analysis(inputstring)))
Outputs:
Lexical analysistokens: T_LPAR, T_A, T_PLUS, T_A, T_RPAR, T_ENDSyntactic analysissvalue = N_S, token = T_LPARrule = 1stacks: T_END, T_RPAR, N_F, T_PLUS, N_S, T_LPARpop T_LPARstacks: T_END, T_RPAR, N_F, T_PLUS, N_Ssvalue = N_S, token = T_Arule = 0stacks: T_END, T_RPAR, N_F, T_PLUS, N_Fsvalue = N_F, token = T_Arule = 2stacks: T_END, T_RPAR, N_F, T_PLUS, T_Apop T_Astacks: T_END, T_RPAR, N_F, T_PLUSpop T_PLUSstacks: T_END, T_RPAR, N_Fsvalue = N_F, token = T_Arule = 2stacks: T_END, T_RPAR, T_Apop T_Astacks: T_END, T_RPARpop T_RPARstacks: T_ENDpop T_ENDinput acceptedstacks:
As can be seen from the example, the parser performs three types of steps depending on whether the top of the stack is a nonterminal, a terminal or the special symbol$:
These steps are repeated until the parser stops, and then it will have either completely parsed the input and written aleftmost derivation to the output stream or it will have reported an error.
In order to fill the parsing table, we have to establish what grammar rule the parser should choose if it sees a nonterminalA on the top of its stack and a symbola on its input stream.It is easy to see that such a rule should be of the formA →w and that the language corresponding tow should have at least one string starting witha.For this purpose we define theFirst-set ofw, written here asFi(w), as the set of terminals that can be found at the start of some string inw, plus ε if the empty string also belongs tow.Given a grammar with the rulesA1 →w1, ...,An →wn, we can compute theFi(wi) andFi(Ai) for every rule as follows:
The result is theleast fixed point solution to the following system:
where, for sets of wordsU andV, the truncated product is defined by, and w:1 denotes the initial length-1 prefix of words w of length 2 or more, orw, itself, if w has length 0 or 1.
Unfortunately, the First-sets are not sufficient to compute the parsing table.This is because a right-hand sidew of a rule might ultimately be rewritten to the empty string.So the parser should also use the ruleA →w ifε is inFi(w) and it sees on the input stream a symbol that could followA. Therefore, we also need theFollow-set ofA, written asFo(A) here, which is defined as the set of terminalsa such that there is a string of symbolsαAaβ that can be derived from the start symbol. We use$ as a special terminal indicating end of input stream, andS as start symbol.
Computing the Follow-sets for the nonterminals in a grammar can be done as follows:
This provides the least fixed point solution to the following system:
Now we can define exactly which rules will appear where in the parsing table.IfT[A,a] denotes the entry in the table for nonterminalA and terminala, then
Equivalently:T[A,a] contains the ruleA →w for eacha ∈Fi(w)·Fo(A).
If the table contains at most one rule in every one of its cells, then the parser will always know which rule it has to use and can therefore parse strings without backtracking.It is in precisely this case that the grammar is called anLL(1) grammar.
The construction for LL(1) parsers can be adapted to LL(k) fork > 1 with the following modifications:
where an input is suffixed byk end-markers$, to fully account for thek lookahead context. This approach eliminates special cases for ε, and can be applied equally well in the LL(1) case.
Until the mid-1990s, it was widely believed thatLL(k) parsing[clarify] (fork > 1) was impractical,[12]: 263–265 since the parser table would haveexponential size ink in the worst case. This perception changed gradually after the release of thePurdue Compiler Construction Tool Set around 1992, when it was demonstrated that manyprogramming languages can be parsed efficiently by an LL(k) parser without triggering the worst-case behavior of the parser. Moreover, in certain cases LL parsing is feasible even with unlimited lookahead. By contrast, traditional parser generators likeyacc useLALR(1) parser tables to construct a restrictedLR parser with a fixed one-token lookahead.
As described in the introduction, LL(1) parsers recognize languages that have LL(1) grammars, which are a special case of context-free grammars; LL(1) parsers cannot recognize all context-free languages. The LL(1) languages are a proper subset of the LR(1) languages, which in turn are a proper subset of all context-free languages. In order for a context-free grammar to be an LL(1) grammar, certain conflicts must not arise.
LetA be a non-terminal. FIRST(A) is defined as the set of terminals that can appear in the first position of any string derived fromA. FOLLOW(A) is the union over:[13]
There are two main types of LL(1) conflicts:
The FIRST sets of two different grammar rules for the same non-terminal intersect.An example of an LL(1) FIRST/FIRST conflict:
S -> E | E 'a'E -> 'b' | ε
FIRST(E) = {b, ε} and FIRST(Ea) = {b,a}, so when the table is drawn, there is conflict under terminalb of production ruleS.
Left recursion will cause a FIRST/FIRST conflict with all alternatives.
E -> E '+' term | alt1 | alt2
The FIRST and FOLLOW set of a grammar rule overlap. With anempty string (ε) in the FIRST set, it is unknown which alternative to select.An example of an LL(1) conflict:
S -> A 'a' 'b'A -> 'a' | ε
The FIRST set ofA is {a, ε}, and the FOLLOW set is {a}.
A common left-factor is "factored out".
A -> X | X Y Z
becomes
A -> X BB -> Y Z | ε
Can be applied when two alternatives start with the same symbol like a FIRST/FIRST conflict.
Another example (more complex) using above FIRST/FIRST conflict example:
S -> E | E 'a'E -> 'b' | ε
becomes (merging into a single non-terminal)
S -> 'b' | ε | 'b' 'a' | 'a'
then through left-factoring, becomes
S -> 'b' E | EE -> 'a' | ε
Substituting a rule into another rule to remove indirect or FIRST/FOLLOW conflicts.Note that this may cause a FIRST/FIRST conflict.
For a general method, seeremoving left recursion.A simple example for left recursion removal:The following production rule has left recursion on E
E -> E '+' TE -> T
This rule is nothing but list of Ts separated by '+'. In aregular expression form T ('+' T)*.So the rule could be rewritten as
E -> T ZZ -> '+' T ZZ -> ε
Now there is no left recursion and no conflicts on either of the rules.
However, not all context-free grammars have an equivalent LL(k)-grammar, e.g.:
S -> A | BA -> 'a' A 'b' | εB -> 'a' B 'b' 'b' | ε
It can be shown that there does not exist any LL(k)-grammar accepting the language generated by this grammar.
{{cite journal}}: CS1 maint: multiple names: authors list (link)