This repository was archived by the owner on Feb 15, 2022. It is now read-only.

haxscramper/hparsePublic archive

NotificationsYou must be signed in to change notification settings
Fork0
Star8

Collection of parser utilities for nim - compile/runtime parser generator.

8 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
devnotes.org		devnotes.org
hparse.nimble		hparse.nimble
index.rst		index.rst
manual.org		manual.org
readme.org		readme.org
todo.org		todo.org

Repository files navigation

deprecated, writing parsers by hand wins for all of my use cases, so don’t plan to maintain/generic parser generator library.

Collection of various utilities related to parsing, ranging from statically typed wrapper on top ofscanf to full parser generator, automatic tree-sitter wrappers etc.

Pure nim parser generator supports EBNF notation, tree actions, template rules, compile/runtime parsing and automatic parse tree generation.

Tree-sitter wrapper generator currently WIP, but can already generate user-fiendly interfaces for existing grammars.

Installation and setup

Pure library

nimble install hparse

tree-sitter

hts-wrapgen command-line utility comes instealled withhparse, but it need additional dependencies in order to actually generate grammar.

For official installation instructions for tree-sitter’sgetting started page.

`tree-sitter-cli` installation

Arch linux

tree-sitter-cli isavailable, and can be installed usingsudo pacman -S tree-sitter

Other distros

I couldn’t find any conclusive instructions about installingtree-sitter-cli on other distributions, so this is a most general approach that ‘should work’.

Npm configuration

If you are already familliar withnpm, have it installed etc. - you can skip this part. For people like me who have no idea on setting it up:

sudo apt-get -qq install -y npmexport NPM_PACKAGES=$HOME/.npm-packagesnpm configset prefix$NPM_PACKAGESexport PATH="$PATH:$NPM_PACKAGES/bin"

Installing command-line tool

npm install tree-sitter-cli -g

Building tree-sitter library

cd /tmpwget https://github.com/tree-sitter/tree-sitter/archive/0.17.3.tar.gztar -xvf 0.17.3.tar.gz&&cd tree-sitter-0.17.3sudo make installexport LD_LIBRARY_PATH=/usr/local/lib

NOTE: last line might not be necessary, but when testing installation indebian container I run into issues wheretree-sitter library would not be detected without it.

Using wrapper generator

After all necessary dependencies have been installed, you can generate wrapper for arbitrary grammar file usinghts-wrapgen grammar <arguments>. Most commonly used switches are

--grammarJs=<grammar file.js> to set input grammar file
--langPrefix=<language-name>

You can also use--parserUser=nimCode.nim to immediately compiler test nim file. It is not necessary to always usehts-wrapgen when compiling code that uses generated parser, but it has some built-in heuristics to provide better error messages for common errors (mostly related to linking).

Using generated wrappers

hts-wrapgen generates wrapper file for your language, named<language-name>_wrapper.nim. Wrapper consists of severaldistict wrapper types, enum listing all possile node kinds, and several helper procs. Generated API is made to resemble that ofstd/macros as close as possible -len, indexation via[],kind() proc and so on.

By defaulthts-wrapgen compiles generated parser source into static library which fill be then saved as<language-name>_parser.o alongside your grammar file.

To build final application you need to link it withtree-sitter library. To do this from command-line you need to use--passl:-ltree-sitter. Or use{.passl: "-ltree-sitter".} pragma in your code.

Tree-sitter wrapper

prerequisites

Input grammar file, usually calledgrammar.js. Note that it is not mandatory to name it like this, it is perfectly fine to havecpp-grammar.js for example. Grammar fileexample,tutorial on writing grammars.

setup guide & compilation process

To compile tree-sitter grammar and generate wrappers run

hts-wrapgen grammarFromFile \  --grammar=your-grammar.js  --lang=language

example

Create temporary directory for you project - during compilation a lot of auxiliary files are created, most of which are not needed later on.

For purposes of demonstrationcpp grammar will be used. You can get all necessary files by either manually downloadinggrammar.js andscanner.cc fromtree-sitter-cpp repository on github or using wget

wget https://raw.githubusercontent.com/tree-sitter/tree-sitter-cpp/master/src/scanner.ccwget https://raw.githubusercontent.com/tree-sitter/tree-sitter-cpp/master/grammar.js

To compile cpp grammmar tree-sitter needs to havetree-sitter-cjs module installed. It can be obtained usingnpm install tree-sitter-c

possible errors

linker errors

tree-sitter runtime: Parser generated by tree-sitter is not fully standalone and you program is need to be linked againsttree-sitter library. If linkerfails withundefined reference to `ts_parser_new' (or reference to similar function) most likely you need to pass linker flags via either{.passl: "-ltree-sitter".} in your code or--passl:-ltree-sitter
external scanners: External scanners allow you to write custom C code which runs during the lexing process in order to handle lexical rules that cannot be described by regular expressions. This functions are written in separatescanner.c file (example for C++ parser) that has to be compiled and then linked with final application. If you haveerrors likeundefined reference to `tree_sitter_cpp_external_scanner_destroy' (note word ‘external’) this is indication that you must also link compiled scanner code (for example using{.passl: "cppscanner.o".} (note: name of the linked object file will be different depending on your language name))
c++ stdlib: some scanner implementations might be written in C++ and therefore depend on C++ runtime to operate. If you get compilation errors likeundefined reference to `std::__cxx11::basic_ then you need to pass{.passl: "-lstdc++".}
parser runtime: all other linking errors are most likely related to missing linking with compiled parser and can be solved by adding{.passl: "cppparser.o".} (note: name of the linked object file will be different depending on your language name)

Usage in applications

Some of the gree-sitter grammars can be tried out online in interactiveplayground

Wrapper generates heterogeneous AST with defined[] operator,len andkind procs, which means it can be used withpattern matching. Until I finish implementation you can import it formhmisc/macros/matching and use like

case tree[0]:  of Declaration[@dtype,.._]:    echo"first is declaration with type", dtype.strVal()

It requires to enable{.experimental: "caseStmtMacros".}

Generate tree-sitter grammar files using EBNF notation from pure nim parser generator. Grammar is already available at runtimeas value so it is simple matter of converting it into javascript code.
Implement custom scanners in nim. You already compile nim code to C, so why spend 10x effort writing things in C for scanners when you can just do the same in nim. It is already possible to do this, but process could be streamlined even more.

Small utilities

`tscanf`

Statically typed wrapper on top ofscanf, supports all matcher syntax (e.g$w,$i etc) as well as custom matcher procedures. Tupleml is injected in the scope with following types for each capture:

i, o, b, h : int
f : float
*, + : string
${matcherProc} : if matcher proc has signature(s: string, arg: var T, start: int): int then tuple field will be of typeT

import hparse/tscanffunc matcher1(s:string, arg:varseq[string], start:int):int=#                               ^^^^^^^^^^^#                               type of the captured variable  arg=@["##","$$"]return s.len- startif tscanf("12x12---%1,1,1,1","$ix$i$+%${matcher1}"):  echo ml[0]/2," =", ml[2],"", ml[3]  assert declared(ml)  asserttype(ml[3])isseq[string]#                     ^^^^^^^^^^^#                     Resulting field type in tupleelse:  assertnot declared(ml)  echo"does not match"

6.0 = --- @["##", "$$"]

`hparse/tokenize`

Simple addition toparseutils library from stdlib - separate string on tokens based on character sets.

import hparse/tokenizeimport hpprinttype  LispPart=enum    lpPunct    lpQuoted    lpIdent    lpIntLitpprint"(hello '(world) 22)".tokenize({  {'(',')'} : lpPunct,  {'0'..'9'} : lpIntLit,  {'\'','a'..'z','A'..'Z',')','('} : lpQuoted,  {'a'..'z','A'..'Z'} : lpIdent})

- (lpPunct, "(")- (lpQuoted, "hello")- (lpQuoted, "'(world)")- (lpIntLit, "22")- (lpPunct, ")")

`[WIP]rx` macro

Lisp notation for regex description. reimplementation of emacs-lisp rx macro.

(print (rx (and"a""E") (or"()""{}")))

aE\(?:()\|{}\)

Parser generator

Parser generator focuses onsimplicity andease of use. Concrete implementation details of particular parser algorithms are hidden as much as possible - you write grammar and provide input tokens, and get a tree. Whole API can be described as

let grammar= makeGrammar:# grammar definitionlet parser= new<Algorithm-name>Parser(grammar)var stream=# create token streamlet tree= parser.parse(stream)

Grammar description

Grammar described using EBNF notation, with only exception being use of prefix notation - e.g. for zero-or-more you need to write*E.

Example of very simple grammar:

const defaultCategory= catNoCategorylet grammar= initGrammar[NoCategory,string]:  A ::=*("hello"&"world")

More complex example (with result tree)

exampleGrammarConst(grammar):  List ::=!"["& Elements&!"]"  Elements ::= Element&@*(@(!","& Element))  Element ::="i"| Listlet parser= exampleParser(grammar)var stream="[i,i,[i,i,i],i]".mapIt($it).makeTokens().makeStream()let tree= parser.parse(stream)echo tree.treeRepr()

+-> List    +-> Elements        +-> Element +-> 'i'        +-> Element +-> 'i'        +-> Element        |   +-> List        |       +-> Elements        |           +-> Element +-> 'i'        |           +-> Element +-> 'i'        |           +-> Element +-> 'i'        +-> Element +-> 'i'

DSL syntax

EBNF syntax

Note:<string> means a string literal, like “|????”

* zero-or-more
+ one-or-more
? optional
& concatenation
| alternative
Nonterminal ::= ... declare new nontemrinal. Identifiermust be uppercased.
<string> token literal. Default category is used
<string>.prCat or<string>.cat token literal with lexeme<string> and categoryprCat. Prefix is automatically inferred on grammar construction and can be omitted.
[[ expr ]] token with lexeme predicate.
[ ... ] option

Tree actions prefix

! drop
@ splice-discard
^ promote
^@ splice-promote

Prefix combinations

!* drop zero-or-more elements
!+ drop one-or-more
@+ splice one-or-more
@* splice zero-or-more
!? drop optional
^? promote optional

Invalid combinations:*!,+!,*@,+@,*^@,+^@,+^,*^

Delimiters

Nonterminals

Tree actions

Result of parser generator is aparse tree - very representation of original source code and contains all helper symbols (punctuation, brackets, precedence levels etc). All of this cruft is necessary to correctly recognize input sequence of tokens, but completely irrelevant afterwards - in nested list grammar onlyElements are actually necessary, everything else can be thrown away immediately.Tree actions are intended for this exact purpose - dropping unnecessary parts of the parse tree, flattening out nested parts etc. Right now there is five type of tree actions (four implemented).

Drop

Completely remove subtree element

echo ecompare(@["a","b","c"])do:  A ::="a"&"b"&"c"do:  A ::="a"&!"b"&"c"

+-> A        +-> A    +-> 'a'      +-> 'a'    +-> 'b'      +-> 'c'    +-> 'c'

Splice discard

Add subnode elements in parent tree. Subtree head is removed.

echo ecompare(@["-","+","+","+","-"])do:  A ::="-"&*"+"&"-"do:  A ::="-"&@*"+"&"-"

+-> A                +-> A    +-> '-'              +-> '-'    +-> [ [ ... ] ]      +-> '+'    |   +-> '+'          +-> '+'    |   +-> '+'          +-> '+'    |   +-> '+'          +-> '-'    +-> '-'

Splice promote

Splice all node node elements and replace parent node. NOTE: this replaces onlyparent node - in expression likeE ::= A & B parent node forB is concatenation - not nonterminal head.

echo ecompare(@["-","+","+","+"])do:  A ::="-"& B  B ::=*"+"do:  A ::="-"&^@B  B ::=*"+"

+-> A            +-> A    +-> '-'          +-> B    +-> B                +-> '-'        +-> '+'          +-> '+'        +-> '+'          +-> '+'        +-> '+'          +-> '+'

Subrule

Move part of the tree into separate list

echo ecompare(@["-","z","e"])do:  A ::="-"&"z"&"e"do:  A ::="-"& {"z"&"e" }

+-> A        +-> A    +-> '-'      +-> '-'    +-> 'z'      +-> [ [ ... ] ]    +-> 'e'          +-> 'z'                     +-> 'e'

Promote

Parse templates

Some patterns often occur in grammar construction - list with delimiters, kv pairs etc. Even though grammar is pretty simple, writing something likeElement & @*(@(!"," & Element)) over and over again is not really fun. Parse templates are designed to solve this issue.

Parse template is a function that will be executed to produce part of the pattern. In this example we generate template rule for comma-separated list of strings.

proc csvList(str:string): Patt[NoCategory,string]=  andP(    makeExpNoCat(str).tok(),    zeroP(andP(      makeExpNoCat(",").tok().addAction(taDrop),      makeExpNoCat(str).tok()    ).addAction(taSpliceDiscard)    ).addAction(taSpliceDiscard))echo csvList("@").exprRepr()echo eparse(@["@",",","@"], A ::=%csvList("@"))

{'@' & @*(@{!',' & '@'})}+-> A    +-> '@'    +-> '@'

DSL syntax is%functionName(..<list-of-arguments>..). For codegen-based parsers (recursiveLL(1) andLL(*)) function MUST be executable at compile-time. In all other cases grammar construction happens at runtime. In example aboveLL(*) parser was used.

Parse tree and tokens

Token is has three generic parameters, referred to asC,L andI throughout codebase.

First one is ‘category’ for token. It is expected (but not mandatory) to be an enum. Category is usuall things like punctuation, identifier, string/int literal, etc. If you don’t need token category useNoCategory enum.A
Second parameter - ‘lexeme’. It is can be absolutely anything (void included). This field stores ‘all other’ information about token - integer/string value for literals for example.
Last parameter ‘information’. Similar to lexeme - but made for storing additional ‘metainformation’ for token: position in source code, order in original token stream etc. THis information is NOT used in parsing.

For example of custom token category/lexeme seemanual.org

Token lexeme predicates

Token is accepted if lexeme predicate evaluates to ‘true’. Predicate is placed in double square braces =[[ expr ]]. Depending on syntax of the expression different actions are performed.

if it isInfix,Call orDotExpr (ex:it in ["a", "B"],startsWith(it, "..")) whole expression is wrapped into predicate functionproc(it: L): bool {.noSideEffect.} = <your-expression>.
otherwise it is passed tomakeExpTokenPredUsr(cat: C, val: <your-expression-type>)

import strutils, strformatconst defaultCategory= catNoCategoryfunc makeExpTokenPredUsr(  cat: NoCategory, valset:bool): ExpectedToken[NoCategory,string]=result= makeExpTokenPred[NoCategory,string](    catNoCategory,# Expected token category&"[{valset}]",# string representation of expected token predicate# (for pretty-printing)proc(str:string):bool= valset# Construct predicate yourself  )initGrammarConst[NoCategory,string](grammar):  A ::=*(B| C)  B ::= [[ it.startsWith("@") ]]#          ^^^^^^^^^^^^^^^^^^#          |#          Copied to predicate directly  C ::= [[true ]]# Fallback nonterminal#        ^^^^#        |#        Passed to `makeExpTokenPredUsr`let parser= newLLStarParser[NoCategory,string,void](grammar)var stream=@["@ident","#comment","@ident"].makeTokens().makeStream()let tree= parser.parse(stream)echo tree.treeRepr()

Development

Large part of the design is described indevnotes.org, all functions and types are documented in the source code. If you have any additional questions feel free to join mydiscord server and ask questions there.

Rationale

I’m not an expert on parsing algorithms and related things, so I tried to design it in a way that wouldactually abstract things and make it easy to understand the API.

Not supporting syntactic predicates allows use of multiple parsing algorithms for the same grammar, ranging from restrictive but fastLL(1) to something like earley parser.

The parser abstracts notion of token and is not tied to any lexer implementation - if you want to can just split string on spaces and call it a lexer. Or you can do some heuristics in lexer and assign category based on context. Or something else, I don’t know now.

The whole grammar is availableas a value, which means it is possible to easily do all sorts of preprocessing, error detection (like using undeclared nonterminal, left recursion detection and so on).

Tree actions and template rules provide small, but hopefully useful subset of syntactic actions. Advantage - it is possible to know how exactly the tree will look like. Generating statically typed case object for a grammar is possible.

Parser generator was originally intended to work in conjunction with term rewriting system. You write grammar in EBNF notation, dropping all cruft immediately (using splice-discard and drop rules) and then declaratively transform tree into something else.

State of development

Parser generator is currently work-in-progress. All advertized features are implemented, but number of supported algorithms is lacking - fully supported is only backtrackingLL(*). Codegen and table-drivenLL(1) are partially supported (have some weird bugs). Some work has been done on addingSLR andEarley parser.

Parser generator has relatively clean and documented internal API, designed to make implementation of new algorithms as simple as possible (most of details are abstracted).

Contribution

All sorts of contributions are welcome - issues, unit tests, documentation updates etc.

In addition there are several things that I wasn’t able to implement myself. If you are interested to solve one of there problems it will be especially useful.

If you have any question about implementation details, API etc. you can join mydiscord server.

Unsolved problems

tree-sitter fails at runtime inside docker container

WARNING: another issue I ran into - when actually running compiled & linked library in container, it just dies with, and I have no idea how to fix it.

__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:440440../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.

Fix tree after EBNF -> BNF rewriting

Only recursive descent parsers can accept EBNF notation as-is. Every other one requires conversion from EBNF to BNF (implemented, tested). The problem is - this trasnformation changes shape of the parsed tree. For exampleA ::= *(E) is converted toA ::= E1 andE1 ::= Ɛ | E E1 - recursion is replaced with iteration.

const defaultCategory= catNoCategoryinitGrammarConst[NoCategory,string](grammar):  A ::="hello"&*(B)&"world"  B ::="!!"var toks=@["hello","!!","!!","!!","world"].makeTokens().makeStream()let grammarVal=block:let tmp= grammar    tmp.toGrammar()echo"Original grammar"echo grammarVal.exprRepr()echo"---\n"echo"Grammar converter to BNF"echo grammarVal.toBNF().exprRepr()echo"---\n"echo"Recursive descent tree"let parser1= newLLStarParser[NoCategory,string,void](grammar)let tree1= parser1.parse(toks)echo tree1.treeRepr()echo"---\n"toks.revertTo(0)echo"Table-driven parser tree without structure fixup"let parser2= newLL1TableParser(  grammarVal,  dofixup=false,  retainGenerated=true)let tree2= parser2.parse(toks)echo tree2.treeRepr()echo"---\n"toks.revertTo(0)echo"Table-driven parser tree with fixup"let parser3= newLL1TableParser(grammarVal, dofixup=true)let tree3= parser3.parse(toks)echo tree3.treeRepr()echo"---\n"

Original grammarA            ::= {'hello' & *(<B>) & 'world'}B            ::= '!!'---Grammar converter to BNFA  ::=.0 | 'hello' & <A0_1> & 'world'B  ::=.0 | '!!'A0_1  ::=.0 | Ɛ.1 | <B> & <@A0_1>---Recursive descent tree+-> A    +-> 'hello'    +-> [ [ ... ] ]    |   +-> B +-> '!!'    |   +-> B +-> '!!'    |   +-> B +-> '!!'    +-> 'world'---Table-driven parser tree without structure fixup+-> A    +-> 'hello'    +-> A0_1    |   +-> B +-> '!!'    |   +-> A0_1    |       +-> B +-> '!!'    |       +-> A0_1    |           +-> B +-> '!!'    +-> 'world'---Table-driven parser tree with fixup+-> A    +-> 'hello'    +-> [ [ ... ] ]    |   +-> B +-> '!!'    |   +-> B +-> '!!'    |   +-> B +-> '!!'    +-> 'world'---

Instead of*(B) new ruleA0_1 is introduced, with two possible alternatives: either empty production (Ɛ) orB, followed byA0_1 again. How this conversion affects parse tree can be seen in the output: instead of simple list of elements you get deeply nested tree ofA0_1. This is fixed automatically when convertingEBNF grammar toBNF by adding ‘splice’ rule on every use of newly generated pattern.

It kind of works (not really tested though), but I’m yet to figure how to preserve original tree actions. For example, when converting something like@*(@{!',' & <Element>})} to BNF it gets flattened out, and it is not clear how to first splice things in!',' & <Element>, and then splice it again.

Future development

[ ] support`<token-literal>` in grammar
[ ] generate errors on unknown nonterminals used in production
[ ] Unit test for nimscript and js
[ ] Error reporting. Right now it is basically non-existent

Generate statically typed parse tree

Right now parse tree is ‘stringly typed’ - nonterminal heads are described usingstring and all subnodes are placed in the samesubnodes: seq[ParseTree[...]].

Grammar DSL contains all necessary information to construct case object with selector enum as well as order all fields (LL(*) parser uses constant grammar to generate set of mutally recursive functions). Tree actions could provide almost all necessary information for field types and ordering.

Possible mapping from grammar to constructed object

Nterm ::= ... ->of ptrNterm: <fields>
E1 & E2 & E3 ->tuple[e1: <type-of-E1>, ... ]
*E1 and+E1 ->seq[<type-of-E1>]
?E1 ->Optional[<type-of-E1>]
E1 | E2 ->case idx: [<number-of-alternatives>] and each alternative gets it’s own field. Case objects can be nested so this is not a problem.
<token> ->tok: Token[...]

There are several questions related to possible use cases, ease of use etc.

[ ] Determenistic and intuitive names for fields.
[ ] How fields should be named? It is not possible to have same-named fields in nim case objects.

Different type of tree

Right nowParseTree[C, L, I] is hardcoded into all parsers - I don’t think it will be enough for all use cases.

It is required to make separate type of parse tree defined for each grammar is
Inegration withnimtrs - construct term instead of parse tree andmaybe run rewriting actions immediately.

`L` and`S`-attributed grammars

Parser based on definitive clause grammars

I’m like, 40% sure that I’m not sure about what it is, but it looked nice when I saw it last time. It is related to prolog andnimtrs already implements large portions (no clauses and backtracking but full support of unification and all auxiliary functions for working with terms and environments).

DSL error reporting

DSL for this library useshmisc/hexceptions to generatemuch better compilation errors in case of malformed DSL.

let tree="h".exampleParse:  A ::=!@*("h")echo tree.treeRepr()

Unexpected prefix: '!@*' 2   let tree = "h".exampleParse: 5:8   A ::= !@*("h")             ^^^             |             Incorrect prefix combinationRaised in grammar_dsl.nim:112 [CodeError:ObjectType]

NOTE: output is not colored in readme (because githubfails to support this basic featuresince 2014), but it is colored by default terminal (controlled by using-d:plainStdout compilation flag).

About