For the past year or so, I've been following the posts onMarpa with interest, but I never got around towriting anything with it, because honestly, the docs seemed a little bit opaque and incomplete to me.
Then, the other day, I saw Jeffrey's post aboutlexing with Marpa and I took it as a challenge. You see, I've never written a lexer. I've written grammars using "lexer-free" parser builders likeParse::RecDescent andRegexp::Grammars, but when it came to writing anything that required a lexer, I was paralyzed. It seemed to me that lexing was frequently ambiguous, and dealing with that ambiguity was a black art that I couldn't understand.
But I knew that Marpa could handle parsing ambiguities, and the docs told me thatambiguous lexing was tameable too. After a few false starts, I figured out how to work thecharacter-per-earleme model (which is sadly lacking in examples) and started writing the parts I would need to port myTAP reference parser to Marpa.
For those who want to see the code, it's onthe marpa branch of TAP::Spec::Parser git, which passes the (admittedly rather small) test suite. But my purpose in this post is to describe the lexing technique, which is pretty simple and, I think, nicely generic.
The token table begins at line 203, and looks like this:
my %tokens = ( # ... 'TODO' => [ qr/\GTODO/i, 'TODO' ], 'SKIP' => [ qr/\GSKIP/i, 'SKIP' ], 'ok' => [ qr/\Gok/i, 'ok' ], 'not ok' => [ qr/\Gnot ok/i, 'not ok' ], # ... 'Positive_Integer' => [ qr/\G([1-9][0-9]*)/, sub { 0 + $1 } ], 'Number_Of_Tests' => [ qr/\G(\d+)/, sub { 0 + $1 } ],);
Each entry in this table has a key, which corresponds to one of the terminal names in the grammar, a match rule (which is an anchored regex), and optionally a rule for deriving a token value. To match a token, the lexer checks whether its regex matches at the currentpos()
of the input, and if so, it records the name of the token, the number of characters matched, and the token value (which is executed if it's a coderef, otherwise taken verbatim).
To scan and parse an input string,parse_line
initializes a Marpa recognizer, sets the lexing position to be the beginning of the string, and then does the following:
In steps 1 and 2 it's possible that there are either no expected tokens, or no matched tokens, at the current character position. This isnot an error, because it's possible that we're just in the middle of a token, and we need to scan to the end of it (which is accomplished by advancing the lexer and callingcomplete_earleme
).
In step 4, it's possible that Marpa will announce that the parse is "exhausted". Thisis an error, or at least an exceptional condition, as it means the parse can't succeed with the input it's been given. InTAP::Spec::Parser
's line parser we do something a little bit weird, which is to convert any such failure into a "Junk Line" token, but in general it's an opportunity to eitherput on the ruby slippers or report an error. In either case, the expected tokens from theterminals_expected
method come in handy. In addition, you could also add a flag that makeslex
attempt to matchany token at the current location, instead of only "expected" ones, which would allow generating nice "Expected 'x', got 'y'" type error messages.
P.S.TAP::Spec::Parser
marpa branch uses two Marpa parsers. At first, this was done to make TAP's "junk line" handling easier, but it's also good for error reporting. Since TAP is a line-oriented protocol, the "line parser" turns a line of input into an object that's used as an input token to the "stream parser", and every EOL is a commit point. Since the "stream" parser expects lines as tokens, and doesn't know anything about theinsides of lines, it means that out-of-sequence TAP generates errors like "Expected 'Test_Result', found 'Plan'" instead of errors like "Expected 'Status', found '1..'". I think that's a pretty cool side effect.