- Notifications
You must be signed in to change notification settings - Fork0
haxscramper/hparse
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
deprecated, writing parsers by hand wins for all of my use cases, so don’t plan to maintain/generic parser generator library.
Collection of various utilities related to parsing, ranging from statically typed wrapper on top ofscanf
to full parser generator, automatic tree-sitter wrappers etc.
Pure nim parser generator supports EBNF notation, tree actions, template rules, compile/runtime parsing and automatic parse tree generation.
Tree-sitter wrapper generator currently WIP, but can already generate user-fiendly interfaces for existing grammars.
nimble install hparse
hts-wrapgen
command-line utility comes instealled withhparse
, but it need additional dependencies in order to actually generate grammar.
For official installation instructions for tree-sitter’sgetting started page.
tree-sitter-cli
isavailable, and can be installed usingsudo pacman -S tree-sitter
I couldn’t find any conclusive instructions about installingtree-sitter-cli
on other distributions, so this is a most general approach that ‘should work’.
If you are already familliar withnpm
, have it installed etc. - you can skip this part. For people like me who have no idea on setting it up:
sudo apt-get -qq install -y npmexport NPM_PACKAGES=$HOME/.npm-packagesnpm configset prefix$NPM_PACKAGESexport PATH="$PATH:$NPM_PACKAGES/bin"
npm install tree-sitter-cli -g
cd /tmpwget https://github.com/tree-sitter/tree-sitter/archive/0.17.3.tar.gztar -xvf 0.17.3.tar.gz&&cd tree-sitter-0.17.3sudo make installexport LD_LIBRARY_PATH=/usr/local/lib
NOTE: last line might not be necessary, but when testing installation indebian
container I run into issues wheretree-sitter
library would not be detected without it.
After all necessary dependencies have been installed, you can generate wrapper for arbitrary grammar file usinghts-wrapgen grammar <arguments>
. Most commonly used switches are
--grammarJs=<grammar file.js>
to set input grammar file--langPrefix=<language-name>
You can also use--parserUser=nimCode.nim
to immediately compiler test nim file. It is not necessary to always usehts-wrapgen
when compiling code that uses generated parser, but it has some built-in heuristics to provide better error messages for common errors (mostly related to linking).
hts-wrapgen
generates wrapper file for your language, named<language-name>_wrapper.nim
. Wrapper consists of severaldistict
wrapper types, enum listing all possile node kinds, and several helper procs. Generated API is made to resemble that ofstd/macros
as close as possible -len
, indexation via[]
,kind()
proc and so on.
By defaulthts-wrapgen
compiles generated parser source into static library which fill be then saved as<language-name>_parser.o
alongside your grammar file.
To build final application you need to link it withtree-sitter
library. To do this from command-line you need to use--passl:-ltree-sitter
. Or use{.passl: "-ltree-sitter".}
pragma in your code.
- Input grammar file, usually called
grammar.js
. Note that it is not mandatory to name it like this, it is perfectly fine to havecpp-grammar.js
for example. Grammar fileexample,tutorial on writing grammars.
To compile tree-sitter grammar and generate wrappers run
hts-wrapgen grammarFromFile \ --grammar=your-grammar.js --lang=language
Create temporary directory for you project - during compilation a lot of auxiliary files are created, most of which are not needed later on.
For purposes of demonstrationcpp
grammar will be used. You can get all necessary files by either manually downloadinggrammar.js
andscanner.cc
fromtree-sitter-cpp repository on github or using wget
wget https://raw.githubusercontent.com/tree-sitter/tree-sitter-cpp/master/src/scanner.ccwget https://raw.githubusercontent.com/tree-sitter/tree-sitter-cpp/master/grammar.js
To compile cpp grammmar tree-sitter needs to havetree-sitter-c
js
module installed. It can be obtained usingnpm install tree-sitter-c
- tree-sitter runtime
- Parser generated by tree-sitter is not fully standalone and you program is need to be linked against
tree-sitter
library. If linkerfails withundefined reference to `ts_parser_new'
(or reference to similar function) most likely you need to pass linker flags via either{.passl: "-ltree-sitter".}
in your code or--passl:-ltree-sitter
- external scanners
- External scanners allow you to write custom C code which runs during the lexing process in order to handle lexical rules that cannot be described by regular expressions. This functions are written in separate
scanner.c
file (example for C++ parser) that has to be compiled and then linked with final application. If you haveerrors likeundefined reference to `tree_sitter_cpp_external_scanner_destroy'
(note word ‘external’) this is indication that you must also link compiled scanner code (for example using{.passl: "cppscanner.o".}
(note: name of the linked object file will be different depending on your language name)) - c++ stdlib
- some scanner implementations might be written in C++ and therefore depend on C++ runtime to operate. If you get compilation errors like
undefined reference to `std::__cxx11::basic_
then you need to pass{.passl: "-lstdc++".}
- parser runtime
- all other linking errors are most likely related to missing linking with compiled parser and can be solved by adding
{.passl: "cppparser.o".}
(note: name of the linked object file will be different depending on your language name)
Some of the gree-sitter grammars can be tried out online in interactiveplayground
Wrapper generates heterogeneous AST with defined[]
operator,len
andkind
procs, which means it can be used withpattern matching. Until I finish implementation you can import it formhmisc/macros/matching
and use like
case tree[0]: ofDeclaration[@dtype,.._]:echo"first is declaration with type", dtype.strVal()
It requires to enable{.experimental: "caseStmtMacros".}
- Generate tree-sitter grammar files using EBNF notation from pure nim parser generator. Grammar is already available at runtimeas value so it is simple matter of converting it into javascript code.
- Implement custom scanners in nim. You already compile nim code to C, so why spend 10x effort writing things in C for scanners when you can just do the same in nim. It is already possible to do this, but process could be streamlined even more.
Statically typed wrapper on top ofscanf
, supports all matcher syntax (e.g$w
,$i
etc) as well as custom matcher procedures. Tupleml
is injected in the scope with following types for each capture:
i, o, b, h
: intf
: float*, +
: string${matcherProc}
: if matcher proc has signature(s: string, arg: var T, start: int): int
then tuple field will be of typeT
import hparse/tscanffuncmatcher1(s:string, arg:varseq[string], start:int):int=# ^^^^^^^^^^^# type of the captured variable arg=@["##","$$"]return s.len- startiftscanf("12x12---%1,1,1,1","$ix$i$+%${matcher1}"):echo ml[0]/2," =", ml[2],"", ml[3]assertdeclared(ml)asserttype(ml[3])isseq[string]# ^^^^^^^^^^^# Resulting field type in tupleelse:assertnotdeclared(ml)echo"does not match"
6.0 = --- @["##", "$$"]
Simple addition toparseutils
library from stdlib - separate string on tokens based on character sets.
import hparse/tokenizeimport hpprinttypeLispPart=enum lpPunct lpQuoted lpIdent lpIntLitpprint"(hello '(world) 22)".tokenize({ {'(',')'} : lpPunct, {'0'..'9'} : lpIntLit, {'\'','a'..'z','A'..'Z',')','('} : lpQuoted, {'a'..'z','A'..'Z'} : lpIdent})
- (lpPunct, "(")- (lpQuoted, "hello")- (lpQuoted, "'(world)")- (lpIntLit, "22")- (lpPunct, ")")
Lisp notation for regex description. reimplementation of emacs-lisp rx macro.
(print (rx (and"a""E") (or"()""{}")))
aE\(?:()\|{}\)
Parser generator focuses onsimplicity andease of use. Concrete implementation details of particular parser algorithms are hidden as much as possible - you write grammar and provide input tokens, and get a tree. Whole API can be described as
let grammar= makeGrammar:# grammar definitionlet parser= new<Algorithm-name>Parser(grammar)var stream=# create token streamlet tree= parser.parse(stream)
Grammar described using EBNF notation, with only exception being use of prefix notation - e.g. for zero-or-more you need to write*E
.
Example of very simple grammar:
const defaultCategory= catNoCategorylet grammar= initGrammar[NoCategory,string]: A ::=*("hello"&"world")
More complex example (with result tree)
exampleGrammarConst(grammar):List ::=!"["&Elements&!"]"Elements ::=Element&@*(@(!","&Element))Element ::="i"|Listlet parser=exampleParser(grammar)var stream="[i,i,[i,i,i],i]".mapIt($it).makeTokens().makeStream()let tree= parser.parse(stream)echo tree.treeRepr()
+-> List +-> Elements +-> Element +-> 'i' +-> Element +-> 'i' +-> Element | +-> List | +-> Elements | +-> Element +-> 'i' | +-> Element +-> 'i' | +-> Element +-> 'i' +-> Element +-> 'i'
Note:<string>
means a string literal, like “|????”
*
zero-or-more+
one-or-more?
optional&
concatenation|
alternativeNonterminal ::= ...
declare new nontemrinal. Identifiermust be uppercased.<string>
token literal. Default category is used<string>.prCat
or<string>.cat
token literal with lexeme<string>
and categoryprCat
. Prefix is automatically inferred on grammar construction and can be omitted.[[ expr ]]
token with lexeme predicate.[ ... ]
option
!
drop@
splice-discard^
promote^@
splice-promote
!*
drop zero-or-more elements!+
drop one-or-more@+
splice one-or-more@*
splice zero-or-more!?
drop optional^?
promote optional
Invalid combinations:*!
,+!
,*@
,+@
,*^@
,+^@
,+^
,*^
Result of parser generator is aparse tree
- very representation of original source code and contains all helper symbols (punctuation, brackets, precedence levels etc). All of this cruft is necessary to correctly recognize input sequence of tokens, but completely irrelevant afterwards - in nested list grammar onlyElements
are actually necessary, everything else can be thrown away immediately.Tree actions are intended for this exact purpose - dropping unnecessary parts of the parse tree, flattening out nested parts etc. Right now there is five type of tree actions (four implemented).
Completely remove subtree element
echoecompare(@["a","b","c"])do: A ::="a"&"b"&"c"do: A ::="a"&!"b"&"c"
+-> A +-> A +-> 'a' +-> 'a' +-> 'b' +-> 'c' +-> 'c'
Add subnode elements in parent tree. Subtree head is removed.
echoecompare(@["-","+","+","+","-"])do: A ::="-"&*"+"&"-"do: A ::="-"&@*"+"&"-"
+-> A +-> A +-> '-' +-> '-' +-> [ [ ... ] ] +-> '+' | +-> '+' +-> '+' | +-> '+' +-> '+' | +-> '+' +-> '-' +-> '-'
Splice all node node elements and replace parent node. NOTE: this replaces onlyparent node - in expression likeE ::= A & B
parent node forB
is concatenation - not nonterminal head.
echoecompare(@["-","+","+","+"])do: A ::="-"& B B ::=*"+"do: A ::="-"&^@B B ::=*"+"
+-> A +-> A +-> '-' +-> B +-> B +-> '-' +-> '+' +-> '+' +-> '+' +-> '+' +-> '+' +-> '+'
Move part of the tree into separate list
echoecompare(@["-","z","e"])do: A ::="-"&"z"&"e"do: A ::="-"& {"z"&"e" }
+-> A +-> A +-> '-' +-> '-' +-> 'z' +-> [ [ ... ] ] +-> 'e' +-> 'z' +-> 'e'
Some patterns often occur in grammar construction - list with delimiters, kv pairs etc. Even though grammar is pretty simple, writing something likeElement & @*(@(!"," & Element))
over and over again is not really fun. Parse templates are designed to solve this issue.
Parse template is a function that will be executed to produce part of the pattern. In this example we generate template rule for comma-separated list of strings.
proccsvList(str:string):Patt[NoCategory,string]=andP(makeExpNoCat(str).tok(),zeroP(andP(makeExpNoCat(",").tok().addAction(taDrop),makeExpNoCat(str).tok() ).addAction(taSpliceDiscard) ).addAction(taSpliceDiscard))echocsvList("@").exprRepr()echoeparse(@["@",",","@"], A ::=%csvList("@"))
{'@' & @*(@{!',' & '@'})}+-> A +-> '@' +-> '@'
DSL syntax is%functionName(..<list-of-arguments>..)
. For codegen-based parsers (recursiveLL(1)
andLL(*)
) function MUST be executable at compile-time. In all other cases grammar construction happens at runtime. In example aboveLL(*)
parser was used.
Token is has three generic parameters, referred to asC
,L
andI
throughout codebase.
- First one is ‘category’ for token. It is expected (but not mandatory) to be an enum. Category is usuall things like punctuation, identifier, string/int literal, etc. If you don’t need token category use
NoCategory
enum.A - Second parameter - ‘lexeme’. It is can be absolutely anything (
void
included). This field stores ‘all other’ information about token - integer/string value for literals for example. - Last parameter ‘information’. Similar to lexeme - but made for storing additional ‘metainformation’ for token: position in source code, order in original token stream etc. THis information is NOT used in parsing.
For example of custom token category/lexeme seemanual.org
Token is accepted if lexeme predicate evaluates to ‘true’. Predicate is placed in double square braces =[[ expr ]]
. Depending on syntax of the expression different actions are performed.
- if it is
Infix
,Call
orDotExpr
(ex:it in ["a", "B"]
,startsWith(it, "..")
) whole expression is wrapped into predicate functionproc(it: L): bool {.noSideEffect.} = <your-expression>
. - otherwise it is passed to
makeExpTokenPredUsr(cat: C, val: <your-expression-type>
)
import strutils, strformatconst defaultCategory= catNoCategoryfuncmakeExpTokenPredUsr( cat:NoCategory, valset:bool):ExpectedToken[NoCategory,string]=result=makeExpTokenPred[NoCategory,string]( catNoCategory,# Expected token category&"[{valset}]",# string representation of expected token predicate# (for pretty-printing)proc(str:string):bool= valset# Construct predicate yourself )initGrammarConst[NoCategory,string](grammar): A ::=*(B| C) B ::= [[ it.startsWith("@") ]]# ^^^^^^^^^^^^^^^^^^# |# Copied to predicate directly C ::= [[true ]]# Fallback nonterminal# ^^^^# |# Passed to `makeExpTokenPredUsr`let parser=newLLStarParser[NoCategory,string,void](grammar)var stream=@["@ident","#comment","@ident"].makeTokens().makeStream()let tree= parser.parse(stream)echo tree.treeRepr()
Large part of the design is described indevnotes.org, all functions and types are documented in the source code. If you have any additional questions feel free to join mydiscord server and ask questions there.
I’m not an expert on parsing algorithms and related things, so I tried to design it in a way that wouldactually abstract things and make it easy to understand the API.
Not supporting syntactic predicates allows use of multiple parsing algorithms for the same grammar, ranging from restrictive but fastLL(1)
to something like earley parser.
The parser abstracts notion of token and is not tied to any lexer implementation - if you want to can just split string on spaces and call it a lexer. Or you can do some heuristics in lexer and assign category based on context. Or something else, I don’t know now.
The whole grammar is availableas a value, which means it is possible to easily do all sorts of preprocessing, error detection (like using undeclared nonterminal, left recursion detection and so on).
Tree actions and template rules provide small, but hopefully useful subset of syntactic actions. Advantage - it is possible to know how exactly the tree will look like. Generating statically typed case object for a grammar is possible.
Parser generator was originally intended to work in conjunction with term rewriting system. You write grammar in EBNF notation, dropping all cruft immediately (using splice-discard and drop rules) and then declaratively transform tree into something else.
Parser generator is currently work-in-progress. All advertized features are implemented, but number of supported algorithms is lacking - fully supported is only backtrackingLL(*)
. Codegen and table-drivenLL(1)
are partially supported (have some weird bugs). Some work has been done on addingSLR
andEarley
parser.
Parser generator has relatively clean and documented internal API, designed to make implementation of new algorithms as simple as possible (most of details are abstracted).
All sorts of contributions are welcome - issues, unit tests, documentation updates etc.
In addition there are several things that I wasn’t able to implement myself. If you are interested to solve one of there problems it will be especially useful.
If you have any question about implementation details, API etc. you can join mydiscord server.
WARNING: another issue I ran into - when actually running compiled & linked library in container, it just dies with, and I have no idea how to fix it.
__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:440440../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
Only recursive descent parsers can accept EBNF notation as-is. Every other one requires conversion from EBNF to BNF (implemented, tested). The problem is - this trasnformation changes shape of the parsed tree. For exampleA ::= *(E)
is converted toA ::= E1
andE1 ::= Ɛ | E E1
- recursion is replaced with iteration.
const defaultCategory= catNoCategoryinitGrammarConst[NoCategory,string](grammar): A ::="hello"&*(B)&"world" B ::="!!"var toks=@["hello","!!","!!","!!","world"].makeTokens().makeStream()let grammarVal=block:let tmp= grammar tmp.toGrammar()echo"Original grammar"echo grammarVal.exprRepr()echo"---\n"echo"Grammar converter to BNF"echo grammarVal.toBNF().exprRepr()echo"---\n"echo"Recursive descent tree"let parser1=newLLStarParser[NoCategory,string,void](grammar)let tree1= parser1.parse(toks)echo tree1.treeRepr()echo"---\n"toks.revertTo(0)echo"Table-driven parser tree without structure fixup"let parser2=newLL1TableParser( grammarVal, dofixup=false, retainGenerated=true)let tree2= parser2.parse(toks)echo tree2.treeRepr()echo"---\n"toks.revertTo(0)echo"Table-driven parser tree with fixup"let parser3=newLL1TableParser(grammarVal, dofixup=true)let tree3= parser3.parse(toks)echo tree3.treeRepr()echo"---\n"
Original grammarA ::= {'hello' & *(<B>) & 'world'}B ::= '!!'---Grammar converter to BNFA ::=.0 | 'hello' & <A0_1> & 'world'B ::=.0 | '!!'A0_1 ::=.0 | Ɛ.1 | <B> & <@A0_1>---Recursive descent tree+-> A +-> 'hello' +-> [ [ ... ] ] | +-> B +-> '!!' | +-> B +-> '!!' | +-> B +-> '!!' +-> 'world'---Table-driven parser tree without structure fixup+-> A +-> 'hello' +-> A0_1 | +-> B +-> '!!' | +-> A0_1 | +-> B +-> '!!' | +-> A0_1 | +-> B +-> '!!' +-> 'world'---Table-driven parser tree with fixup+-> A +-> 'hello' +-> [ [ ... ] ] | +-> B +-> '!!' | +-> B +-> '!!' | +-> B +-> '!!' +-> 'world'---
Instead of*(B)
new ruleA0_1
is introduced, with two possible alternatives: either empty production (Ɛ
) orB
, followed byA0_1
again. How this conversion affects parse tree can be seen in the output: instead of simple list of elements you get deeply nested tree ofA0_1
. This is fixed automatically when convertingEBNF
grammar toBNF
by adding ‘splice’ rule on every use of newly generated pattern.
It kind of works (not really tested though), but I’m yet to figure how to preserve original tree actions. For example, when converting something like@*(@{!',' & <Element>})}
to BNF it gets flattened out, and it is not clear how to first splice things in!',' & <Element>
, and then splice it again.
- [ ] support
`<token-literal>`
in grammar - [ ] generate errors on unknown nonterminals used in production
- [ ] Unit test for nimscript and js
- [ ] Error reporting. Right now it is basically non-existent
Right now parse tree is ‘stringly typed’ - nonterminal heads are described usingstring
and all subnodes are placed in the samesubnodes: seq[ParseTree[...]]
.
Grammar DSL contains all necessary information to construct case object with selector enum as well as order all fields (LL(*)
parser uses constant grammar to generate set of mutally recursive functions). Tree actions could provide almost all necessary information for field types and ordering.
Possible mapping from grammar to constructed object
Nterm ::= ...
->of ptrNterm: <fields>
E1 & E2 & E3
->tuple[e1: <type-of-E1>, ... ]
*E1
and+E1
->seq[<type-of-E1>]
?E1
->Optional[<type-of-E1>]
E1 | E2
->case idx: [<number-of-alternatives>]
and each alternative gets it’s own field. Case objects can be nested so this is not a problem.<token>
->tok: Token[...]
There are several questions related to possible use cases, ease of use etc.
- [ ] Determenistic and intuitive names for fields.
- [ ] How fields should be named? It is not possible to have same-named fields in nim case objects.
Right nowParseTree[C, L, I]
is hardcoded into all parsers - I don’t think it will be enough for all use cases.
- It is required to make separate type of parse tree defined for each grammar is
- Inegration with
nimtrs
- construct term instead of parse tree andmaybe run rewriting actions immediately.
I’m like, 40% sure that I’m not sure about what it is, but it looked nice when I saw it last time. It is related to prolog andnimtrs already implements large portions (no clauses and backtracking but full support of unification and all auxiliary functions for working with terms and environments).
DSL for this library useshmisc/hexceptions to generatemuch better compilation errors in case of malformed DSL.
let tree="h".exampleParse: A ::=!@*("h")echo tree.treeRepr()
Unexpected prefix: '!@*' 2 let tree = "h".exampleParse: 5:8 A ::= !@*("h") ^^^ | Incorrect prefix combinationRaised in grammar_dsl.nim:112 [CodeError:ObjectType]
NOTE: output is not colored in readme (because githubfails to support this basic featuresince 2014), but it is colored by default terminal (controlled by using-d:plainStdout
compilation flag).
About
Collection of parser utilities for nim - compile/runtime parser generator.