timtadh/lexmachinePublic

NotificationsYou must be signed in to change notification settings
Fork28
Star410

Lex machinary for go.

License

View license

410 stars 28 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
.github/workflows		.github/workflows
dfa		dfa
examples		examples
frontend		frontend
inst		inst
lexc		lexc
machines		machines
queue		queue
.activate		.activate
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG		CHANGELOG
LICENSE		LICENSE
README.md		README.md
doc.go		doc.go
go.mod		go.mod
go.sum		go.sum
grammar		grammar
lexer.go		lexer.go
lexer_test.go		lexer_test.go

Repository files navigation

`lexmachine` - Lexical Analysis Framework for Golang

By Tim Henderson

What?

lexmachine is a full lexical analysis framework for the Go programminglanguage. It supports a restricted but usable set of regular expressionsappropriate for writing lexers for complex programming languages. The frameworkalso supports sub lexers and non-regular lexing through an "escape hatch" whichallows the users to consume any number of further bytes after a match. So if youwant to support nested C-style comments or other paired structures you can do soat the lexical analysis stage.

Subscribe to themailinglist to getannouncement of major changes, new versions, and important patches.

Goal

lexmachine intends to be the best, fastest, and easiest to use lexicalanalysis system for Go.

Documentation

Tutorial
How It Works
Narrative Documentation
Usinglexmachine withgoyaccRequired reading if you want to uselexmachine with the standard yaccimplementation for Go (or its derivatives).

What is in Box

lexmachine includes the following components

A parser for restricted set of regular expressions.
A abstract syntax tree (AST) for regular expressions.
A backpatching code generator which compiles the AST to (NFA) machine code.
Both DFA (Deterministic Finite Automata) and a NFA (Non-deterministic FiniteAutomata) simulation based lexical analysis engines. Lexical analysisengines work in a slightly different way from a normal regular expressionengine as they tokenize a stream rather than matching one string.
Match objects which include start and end column and line numbers of thelexemes as well as their associate token name.
A declarative "DSL" for specifying the lexers.
An "escape hatch" which allows one to match non-regular tokens by consumingany number of further bytes after the match.

Narrative Documentation

lexmachine splits strings into substrings and categorizes each substring. Incompiler design, the substrings are referred to aslexemes and thecategories are referred to astoken types or justtokens. The categories aredefined bypatterns which are specified usingregularexpressions. The process of splitting up a string issometimes calledtokenization,lexical analysis, orlexing.

Defining a Lexer

The set of patterns (regular expressions) used totokenize (split up andcategorize) is called alexer. Lexer's are first class objects inlexmachine. They can be defined once and re-used over and over-again totokenize multiple strings. After the lexer has been defined it will be compiled(either explicitly or implicitly) into either a Non-deterministic FiniteAutomaton (NFA) or Deterministic Finite Automaton (DFA). The automaton is thenused (and re-used) to tokenize strings.

Creating a new Lexer

lexer:=lexmachine.NewLexer()

Adding a pattern

Let's pretend we want a lexer which only recognizes one category: strings whichmatch the word "wild" capitalized or not (eg. Wild, wild, WILD, ...). Thatexpression is denoted:[Ww][Ii][Ll][Dd]. Patterns are added using theAddfunction:

lexer.Add([]byte(`[Ww][Ii][Ll][Dd]`),func(s*lexmachine.Scanner,m*machines.Match) (interface{},error) {return0,nil})

Add takes two arguments: the pattern and a call back function called alexingaction. The action allows you, the programmer, to transform the low levelmachines.Match object (fromgithub.com/lexmachine/machines) into a objectmeaningful for your program. As an example, let's define a few token types, anda token object. Then we will construct appropriate action functions.

Tokens:= []string{"WILD","SPACE","BANG",}TokenIds:=make(map[string]int)fori,tok:=rangeTokens {TokenIds[tok]=i}

Now that we have defined a set of three tokens (WILD, SPACE, BANG), let's createa token object:

typeTokenstruct {TokenTypeintLexemestringMatch*machines.Match}

Now let's make a helper function which takes aMatch and a token type andcreates a Token.

funcNewToken(tokenTypestring,m*machines.Match)*Token {return&Token{TokenType:TokenIds[tokenType],// defined aboveLexeme:string(m.Bytes),Match:m,}}

Now we write an action for the previous pattern

lexer.Add([]byte(`[Ww][Ii][Ll][Dd]`),func(s*lexmachine.Scanner,m*machines.Match) (interface{},error) {returnNewToken("WILD",m),nil})

Writing the action functions can get tedious, a good idea is to create a helperfunction which produces these action functions:

functoken(tokenTypestring)func(*lexmachine.Scanner,*machines.Match) (interface{},error) {returnfunc(s*lexmachine.Scanner,m*machines.Match) (interface{},error) {returnNewToken(tokenType,m),nil}}

Then adding patterns for our 3 tokens is concise:

lexer.Add([]byte(`[Ww][Ii][Ll][Dd]`),token("WILD"))lexer.Add([]byte(` `),token("SPACE"))lexer.Add([]byte(`!`),token("BANG"))

Built-in Token Type

Many programs use similar representations for tokens.lexmachine provides acompletely optionalToken object you can use in lieu of writing your own.

typeTokenstruct {TypeintValueinterface{}Lexeme      []byteTCintStartLineintStartColumnintEndLineintEndColumnint}

Here is an example for constructing a lexer Action which turns a machines.Matchstruct into a token using the scanners Token helper function.

functoken(namestring,tokenIdsmap[string]int) lex.Action {returnfunc(s*lex.Scanner,m*machines.Match) (interface{},error) {returns.Token(tokenIds[name],string(m.Bytes),m),nil    }}

Adding Multiple Patterns

When constructing a lexer for a complex computer language often tokens havepatterns which overlap -- multiple patterns could match the same strings. Toaddress this problem lexical analysis engines follow 2 rules when choosing whichpattern to match:

Pick the pattern which matches the longest prefix of unmatched text.
Break ties by picking the pattern which appears earlier in the user suppliedlist.

For example, let's pretend we are writing a lexer for Python. Python has a bunchof keywords in it such asclass anddef. However, it also has identifierswhich match the pattern[A-Za-z_][A-Za-z0-9_]*. That pattern also matches thekeywords, if we were to define the lexer as:

lexer.Add([]byte(`[A-Za-z_][A-Za-z0-9_]*`),token("ID"))lexer.Add([]byte(`class`),token("CLASS"))lexer.Add([]byte(`def`),token("DEF"))

Then, the keywords class and def would never be found because the ID token wouldtake precedence. The correct way to solve this problem is by putting thekeywords first:

lexer.Add([]byte(`class`),token("CLASS"))lexer.Add([]byte(`def`),token("DEF"))lexer.Add([]byte(`[A-Za-z_][A-Za-z0-9_]*`),token("ID"))

Skipping Patterns

Sometimes it is advantageous to not emit tokens for certain patterns and toinstead skip them. Commonly this occurs for whitespace and comments. To skip apattern simply have the actionreturn nil, nil:

lexer.Add([]byte("( |\t|\n)"),func(scan*Scanner,match*machines.Match) (interface{},error) {// skip white spacereturnnil,nil},)lexer.Add([]byte("//[^\n]*\n"),func(scan*Scanner,match*machines.Match) (interface{},error) {// skip white spacereturnnil,nil},)

Compiling the Lexer

lexmachine uses the theory offinite statemachinesto efficiently tokenize text. So what is a finite state machine? A finite statemachine is a mathematical construct which is made up of a set of states, with alabeled starting state, and accepting states. There is a transition functionwhich moves from one state to another state based on an input character. Ingeneral, in lexing there are two usual types of state machines used:Non-deterministic and Deterministic.

Before a lexer (like the ones described above) can be used it must be compiledinto either a Non-deterministic Finite Automaton (NFA) or aDeterministicFinite Automaton(DFA).The difference between the two (from a practical perspective) isconstructiontime andmatch efficiency.

Construction time is the amount of time it takes to turn a set of regularexpressions into a state machine (also called a finite state automaton). For anNFA it is O(r) whichr is the length of the regular expression. However, forDFA it could be as bad as O(2^r) but in practical terms it is rarely worsethan O(r^3). The DFA's inlexmachine are also automaticallyminimized whichreduces the amount of memory they consume which takes O(r*log(log(r))) steps.

However, construction time is an upfront cost. If your program is tokenizingmultiple strings it is less important than match efficiency. Let's say a stringhas lengthn. An NFA can tokenize such a string in O(n*r) steps while a DFAcan tokenize the string in O(n). For larger languagesr becomes asignificant overhead.

By default,lexmachine uses a DFA. To explicitly invoke compilation callCompile:

err:=lexer.Compile()iferr!=nil {// handle err}

To explicitly compile a DFA (in case of changes to the default behavior ofCompile):

err:=lexer.CompileDFA()iferr!=nil {// handle err}

To explicitly compile a NFA:

err:=lexer.CompileNFA()iferr!=nil {// handle err}

Tokenizing a String

To tokenize (lex) a string construct aScanner object using the lexer. Thiswill compile the lexer if it has not already been compiled.

scanner,err:=lexer.Scanner([]byte("some text to lex"))iferr!=nil {// handle err}

The scanner object is an iterator which yields the next token (or error) bycalling theNext() method:

fortok,err,eos:=scanner.Next();!eos;tok,err,eos=scanner.Next() {ifui,is:=err.(*machines.UnconsumedInput);is {// skip the error via:// scanner.TC = ui.FailTC//returnerr}elseiferr!=nil {returnerr}fmt.Println(tok)}

Let's break down that first line:

fortok,err,eos:=scanner.Next();

TheNext() method returns three things, the token (tok) if there is one, anerror (err) if there is one, andeos which is a boolean which indicates ifthe End Of String (EOS) has been reached.

;!eos;

Iteration proceeds until the EOS has been reached.

;tok,err,eos=scanner.Next() {

The update block callsNext() again to get the next token. In each iterationof the loop the first thing a clientmust do is check for an error.

iferr!=nil {returnerr}

This prevents an infinite loop on an unexpected character or other bad token. Toskip bad tokens check to see if theerr is a*machines.UnconsumedInputobject and reset the scanners text counter (scanner.TC) to point to the end ofthe failed token.

ifui,is:=err.(*machines.UnconsumedInput);is {scanner.TC=ui.FailTCcontinue}

Finally, a client can make use of the token produced by the scanner (if therewas no error:

fmt.Println(tok)

Dealing with Non-regular Tokens

lexmachine like most lexical analysis frameworks primarily deals with patternswhich are represented by regular expressions. However, sometimes a languagehas a token which is "non-regular." A pattern is non-regular if there is noregular expression (or finite automata) which can express the pattern. Forinstance, if you wanted to define a pattern which matches only consecutivebalanced parentheses:(),()()(),((()()))()(), ... You would quickly findthere is no regular expression which can express this language. The reason issimple: finite automata cannot "count" or keep track of how many openingparentheses it has seen.

This problem arises in many programming languages when dealing with nested"c-style" comments. Supporting the nesting means solving the "balancedparenthesis" problem. Luckily,lexmachine provides an "escape-hatch" to dealwith these situations in theAction functions. All actions receive a pointerto theScanner. The scanner (as discussed above) has a public modifiable fieldcalledTC which stands for text counter. Any action canmodify the textcounter to point at the desired position it would like the scanner to resumescanning from.

An example of using this feature for tokenizing nested "c-style" comments isbelow:

lexer.Add([]byte("/\\*"),func(scan*Scanner,match*machines.Match) (interface{},error) {fortc:=scan.TC;tc<len(scan.Text);tc++ {ifscan.Text[tc]=='\\' {// the next character is skippedtc++}elseifscan.Text[tc]=='*'&&tc+1<len(scan.Text) {ifscan.Text[tc+1]=='/' {// set the text counter to point to after the// end of the comment. This will cause the// scanner to resume after the comment instead// of picking up in the middle.scan.TC=tc+2// don't return a token to skip the commentreturnnil,nil}}}returnnil,fmt.Errorf("unclosed comment starting at %d, (%d, %d)",match.TC,match.StartLine,match.StartColumn)},)

Regular Expressions

Lexmachine (like most lexical analysis frameworks) usesRegularExpressions to specify thepatterns to match when splitting the string up into categorizedtokens.For a more advanced introduction to regular expressions engines see Russ Cox'sarticles. To learn more about how regularexpressions are used totokenize string take a look at Alex Aiken'svideolectures on the subject. Finally, Ahoet al.give a through treatment of the subject in theDragonBook Chapter 3.

A regular expression is apattern whichmatches a set of strings. It is madeup ofcharacters such asa orb, characters with special meanings (such as. which matches any character), and operators. The regular expressionabcmatches exactly one stringabc.

Character Expressions

In lexmachine most characters (eg.a,b or#) represent themselves. Somehave special meanings (as detailed below in operators). However, all characterscan be represented by prefixing the character with a\.

Any Character

. matches any character.

Special Characters

\ use\\ to match
newline use\n to match
carriage return use\r to match
tab use\t to match
. use\. to match
operators: {|,+,*,?,(,),[,],^} prefix with a\ tomatch.

Character Classes

Sometimes it is advantages to match a variety of characters. For instance, ifyou want to ignore capitalization for the workCapitol you could write theexpression[Cc]apitol which would match bothCapitol orcapitol. There aretwo forms of character ranges:

[abcd] matches all the letters inside the[] (eg. that pattern matchesthe stringsa,b,c,d).
[a-d] matches the range of characters between the character before the dash(a) and the character after the dash (d) (eg. that pattern matchesthe stringsa,b,c,d).

These two forms may be combined:

For instance,[a-zA-Z123] matches the strings {a,b, ...,z,A,B,...Z,1,2,3}

Inverted Character Classes

Sometimes it is easier to specify the characters you don't want to match thanthe characters you do. For instance, you want to match any character but a lowercase one. This can be achieved using an inverted class:[^a-z]. An invertedclass is specified by putting a^ just after the opening bracket.

Built-in Character Classes

\d =[0-9] (the digit class)
\D =[^0-9] (the not a digit class)
\s =[ \t\n\r\f] (the space class). where \f is a form feed (note: \f isnot a special sequence in lexmachine, if you want to specify the form feedcharacter (ascii 0x0c) use []byte{12}.
\S =[^ \t\n\r\f] (the not a space class)
\w =[0-9a-zA-Z_] (the letter class)
\W =[^0-9a-zA-Z_] (the not a letter class)

Operators

The pipe operator| indicates alternative choices. For instance theexpressiona|b matches either the stringa or the stringb but notaborba or the empty string.
The parenthesis operator() groups a subexpression together. For instancethe expressiona(b|c)d matchesabd oracd but notabcd.
The star operator* indicates the "starred" subexpression should match zeroor more times. For instance,a* matches the empty string,a,aa,aaaand so on.
The plus operator+ indicates the "plussed" subexpression should match oneor more times. For instance,a+ matchesa,aa,aaa and so on.
The maybe operator? indicates the "questioned" subexpression should matchzero or one times. For instance,a? matches the empty string anda.

Grammar

The canonical grammar is found in the handwritten recursive descentparser.This section should be considered documentation not specification.

Note: e stands for the empty string

Regex -> AlternationAlternation -> AtomicOps Alternation'Alternation' -> `|` AtomicOps Alternation'              | eAtomicOps -> AtomicOp AtomicOps           | eAtomicOp -> Atomic          | Atomic OpsOps -> Op Ops     | eOp -> `+`    | `*`    | `?`Atomic -> Char        | GroupGroup -> `(` Alternation `)`Char -> CHAR      | CharClassCharClass -> `[` Range `]`           | `[` `^` Range `]`Range -> CharClassItem Range'Range' -> CharClassItem Range'        | eCharClassItem -> BYTE              -> BYTE `-` BYTECHAR -> matches any character except '|', '+', '*', '?', '(', ')', '[', ']', '^'        unless escaped. Additionally '.' is returned as the wildcard character        which matches any character. Built-in character classes are also handled        here.BYTE -> matches any byte

History

This library was started when I was teaching EECS 337Compiler Design andImplementation at Case Western Reserve University in Fall of 2014. It wrote twocompilers one was "hidden" from the students as the language implemented wastheir project language. The other wastcelwhich was written initially as an example of how to do type checking. Thatcompiler was later expanded to explain AST interpretation, intermediate codegeneration, and x86 code generation.

Complete Example

Using the Lexer

package mainimport ("fmt""log")import ("github.com/timtadh/lexmachine""github.com/timtadh/lexmachine/machines")funcmain() {s,err:=Lexer.Scanner([]byte(`digraph {  rankdir=LR;  a [label="a" shape=box];  c [<label>=<<u>C</u>>];  b [label="bb"];  a -> c;  c -> b;  d -> c;  b -> a;  b -> e;  e -> f;}`))iferr!=nil {log.Fatal(err)    }fmt.Println("Type    | Lexeme     | Position")fmt.Println("--------+------------+------------")fortok,err,eof:=s.Next();!eof;tok,err,eof=s.Next() {ifui,is:=err.(*machines.UnconsumedInput);is{// to skip bad token do:// s.TC = ui.FailTClog.Fatal(err)// however, we will just fail the program        }elseiferr!=nil {log.Fatal(err)        }token:=tok.(*lexmachine.Token)fmt.Printf("%-7v | %-10v | %v:%v-%v:%v\n",Tokens[token.Type],string(token.Lexeme),token.StartLine,token.StartColumn,token.EndLine,token.EndColumn)    }}

Lexer Definition

package mainimport ("fmt""strings")import ("github.com/timtadh/lexmachine""github.com/timtadh/lexmachine/machines")varLiterals []string// The tokens representing literal stringsvarKeywords []string// The keyword tokensvarTokens []string// All of the tokens (including literals and keywords)varTokenIdsmap[string]int// A map from the token names to their int idsvarLexer*lexmachine.Lexer// The lexer object. Use this to construct a Scanner// Called at package initialization. Creates the lexer and populates token lists.funcinit() {initTokens()varerrerrorLexer,err=initLexer()iferr!=nil {panic(err)}}funcinitTokens() {Literals= []string{"[","]","{","}","=",",",";",":","->","--",}Keywords= []string{"NODE","EDGE","GRAPH","DIGRAPH","SUBGRAPH","STRICT",}Tokens= []string{"COMMENT","ID",}Tokens=append(Tokens,Keywords...)Tokens=append(Tokens,Literals...)TokenIds=make(map[string]int)fori,tok:=rangeTokens {TokenIds[tok]=i}}// Creates the lexer object and compiles the NFA.funcinitLexer() (*lexmachine.Lexer,error) {lexer:=lexmachine.NewLexer()for_,lit:=rangeLiterals {r:="\\"+strings.Join(strings.Split(lit,""),"\\")lexer.Add([]byte(r),token(lit))}for_,name:=rangeKeywords {lexer.Add([]byte(strings.ToLower(name)),token(name))}lexer.Add([]byte(`//[^\n]*\n?`),token("COMMENT"))lexer.Add([]byte(`/\*([^*]|\r|\n|(\*+([^*/]|\r|\n)))*\*+/`),token("COMMENT"))lexer.Add([]byte(`([a-z]|[A-Z]|[0-9]|_)+`),token("ID"))lexer.Add([]byte(`[0-9]*\.[0-9]+`),token("ID"))lexer.Add([]byte(`"([^\\"]|(\\.))*"`),func(scan*lexmachine.Scanner,match*machines.Match) (interface{},error) {x,_:=token("ID")(scan,match)t:=x.(*lexmachine.Token)v:=t.Value.(string)t.Value=v[1 :len(v)-1]returnt,nil})lexer.Add([]byte("( |\t|\n|\r)+"),skip)lexer.Add([]byte(`\<`),func(scan*lexmachine.Scanner,match*machines.Match) (interface{},error) {str:=make([]byte,0,10)str=append(str,match.Bytes...)brackets:=1match.EndLine=match.StartLinematch.EndColumn=match.StartColumnfortc:=scan.TC;tc<len(scan.Text);tc++ {str=append(str,scan.Text[tc])match.EndColumn+=1ifscan.Text[tc]=='\n' {match.EndLine+=1}ifscan.Text[tc]=='<' {brackets+=1}elseifscan.Text[tc]=='>' {brackets-=1}ifbrackets==0 {match.TC=scan.TCscan.TC=tc+1match.Bytes=strx,_:=token("ID")(scan,match)t:=x.(*lexmachine.Token)v:=t.Value.(string)t.Value=v[1 :len(v)-1]returnt,nil}}returnnil,fmt.Errorf("unclosed HTML literal starting at %d, (%d, %d)",match.TC,match.StartLine,match.StartColumn)},)err:=lexer.Compile()iferr!=nil {returnnil,err}returnlexer,nil}// a lexmachine.Action function which skips the match.funcskip(*lexmachine.Scanner,*machines.Match) (interface{},error) {returnnil,nil}// a lexmachine.Action function with constructs a Token of the given token type by// the token type's name.functoken(namestring) lexmachine.Action {returnfunc(s*lexmachine.Scanner,m*machines.Match) (interface{},error) {returns.Token(TokenIds[name],string(m.Bytes),m),nil}}

About

Lex machinary for go.

Releases4

v0.2.3 Latest

Jul 15, 2022

+ 3 releases

Packages

No packages published

Contributors4

Languages

Go100.0%

Movatterモバイル変換

License

timtadh/lexmachine

Folders and files

Latest commit

History

Repository files navigation

lexmachine - Lexical Analysis Framework for Golang

What?

Goal

Documentation

What is in Box

Narrative Documentation

Defining a Lexer

Creating a new Lexer

Adding a pattern

Built-in Token Type

Adding Multiple Patterns

Skipping Patterns

Compiling the Lexer

Tokenizing a String

Dealing with Non-regular Tokens

Regular Expressions

Character Expressions

Any Character

Special Characters

Character Classes

Inverted Character Classes

Built-in Character Classes

Operators

Grammar

History

Complete Example

Using the Lexer

Lexer Definition

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases4

Packages0

Uh oh!

Contributors4

Uh oh!

Languages

`lexmachine` - Lexical Analysis Framework for Golang

Packages