typelevel/cats-parsePublic

NotificationsYou must be signed in to change notification settings
Fork51
Star245

A parsing library for the cats ecosystem

License

MIT license

245 stars 51 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 525 Commits
.github		.github
bench/src/main		bench/src/main
core		core
docs		docs
project		project
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
LICENSE		LICENSE
PULL_REQUEST_TEMPLATE.md		PULL_REQUEST_TEMPLATE.md
README.md		README.md
Release.md		Release.md
build.sbt		build.sbt

Repository files navigation

cats-parse

A parsing library for the cats ecosystem.

To use in sbt add, the following to yourlibraryDependencies:

// use this snippet for the JVMlibraryDependencies+="org.typelevel"%%"cats-parse"%"1.0.0"// use this snippet for JS, or cross-buildinglibraryDependencies+="org.typelevel"%%%"cats-parse"%"1.0.0"

TheAPI docs are published.

Why another parsing library? See thisblog post detailing thedesign. To reiterate,this library has a few goals:

Compatibility: should work on all scala platforms and recent versions. Currently it supports JVM, JS on versions 2.11, 2.12, 2.13, and 3. The core library should have minimal dependencies. Currently this library only depends on cats.
Excellent performance: should be as fast or faster than any parser combinator that has comparable scala version support.
Cats friendliness: method names match cats style, and out of the box support for cats typeclasses.
Precise errors: following theHaskell Trifecta parsing library, backtracking is opt-in vs opt-out. This design tends to make it easier to write parsers that point correctly to failure points.
Safety: by separating Parser0, a parser that may consume no input, from Parser, a parser must consume at least one character on success. Most combinators and methods can be made safer to use and less prone to runtime errors.
Stability: we are very reluctant to break compatibility between versions. We want to put a minimal tax on users to stay on the latest versions.

Tutorial

Simple parser

The library provides a set of simple parsers which might be combined to create any parsing logic. The simplest parser isParser.anyChar which is successful where there is one char at the input. It has typeParser[Char] which means it returns one parsed char.

To provide any input to parser one need to useparse method.

importcats.parse.Parservalp:Parser[Char]=Parser.anyCharp.parse("t")p.parse("")p.parse("two")

Notice the return type.Tuple2[String, Char] contains the rest of the input string and one parsed char if parsing was successful. It returnsLeft with error message if there was some parsing error.

Mapping output

The output of the parser might be processed withmap method:

importcats.parse.ParsercaseclassCharWrapper(value:Char)valp:Parser[CharWrapper]=Parser.anyChar.map(char=>CharWrapper(char))p.parse("t")

There are built-in methods for mapping the output to typesString orUnit:

importcats.parse.Rfc5234.digitimportcats.parse.Parser/* String*/valp2:Parser[String]= digit.map((c:Char)=> c.toString)// is analog tovalp3:Parser[String]= digit.stringp3.parse("1")/* Unit*/valp4:Parser[Unit]= digit.map(_=> ())// is analog tovalp5:Parser[Unit]= digit.voidp5.parse("1")

Combining parsers

The parsers might be combined through operators:

~ - product. Allows continuing parsing if the left side was successful;
<*,*> - productL and productR. Works just like product but drop part of result;
surroundedBy - identical toborder *> parsingResult <* border;
between - identical toborder1 *> parsingResult <* border2;
|,orElse. Parser will be successful if any of sides is successful.

For this example we'll be usingcats.parse.Rfc5234 package which contains such parsers asalpha (Latin alphabet) andsp (whitespace).

importcats.parse.Rfc5234.{sp,alpha,digit}importcats.parse.Parser/* Product*/// the sp parser won't return the whitespace, it just returns Unit if it successfulvalp1:Parser[(Char,Unit)]= alpha~ spp1.parse("t")p1.parse("t")/* productL, productR*/// The type is just Char because we dropping the space// to drop the alphabet change the arrow side: alpha *> spvalp2:Parser[Char]= alpha<* sp// still error since we need the space even if we drop itp2.parse("t")p2.parse("t")/* surroundedBy*/valp4:Parser[Char]= sp*> alpha<* spvalp5:Parser[Char]= alpha.surroundedBy(sp)p4.parse(" a")p5.parse(" a")/* between*/valp6:Parser[Char]= sp*> alpha<* digitvalp7:Parser[Char]= alpha.between(sp, digit)p6.parse(" a1")p7.parse(" a1")/* OrElse*/valp3:Parser[AnyVal]= alpha| spp3.parse("t")p3.parse("")

Repeating parsers

Sometimes we need something to repeat zero or more times. The cats-parse haverep andrep0 methods for repeating values.rep means that the parser must be successfulat least one time.rep0 means that the parser output might be empty.

importcats.data.NonEmptyListimportcats.parse.Rfc5234.alphaimportcats.parse.{Parser,Parser0}valp1:Parser[NonEmptyList[Char]]= alpha.repvalp2:Parser0[List[Char]]= alpha.rep0p1.parse("")p2.parse("")p2.parse("something")

Notice the types of parsers.Parser type always means some non-empty output and the output ofParser0 might be empty.

One common task in this example is to parse a full line (or words) of text. In the example it is done byrep, and then it could be mapped toString in different ways:

importcats.data.NonEmptyListimportcats.parse.Rfc5234.alphaimportcats.parse.Parservalp:Parser[String]= alpha.rep.map((l:NonEmptyList[Char])=> l.toList.mkString)valp2:Parser[String]= alpha.rep.stringvalp3:Parser[String]= alpha.repAs[String]

All three parsers will be identical in parsing results, butp2 andp3 are using built-in methods which will not create intermediate list.rep +map creates intermediate list which is mapped to string in this example.

Parsers with empty output

Some parsers never return a value. They have a typeParser0. One might get this type of parser when usingrep0 or.? methods.

importcats.parse.Rfc5234.{alpha,sp}importcats.parse.Parservalp:Parser[String]= (alpha.rep<* sp.?).rep.stringp.parse("hello world")

Notice the type we got -Parser[String]. That is because we haverep outside and ouralpha.rep parser withParser type is on the left side of the clause. But what if we want to parse strings with spaces at the beginning?

val p = (sp.? *> alpha.rep <* sp.?).rep.string

We will get an errorvalue rep is not a member of cats.parse.Parser0. This happens since we have the left-side parser as optional insp.? *> alpha.rep <* sp.? clause. This clause has a typeParser0 which can't be repeated.

But this parser can't be empty because ofalpha.rep parser, and we know it. For these types of parsers we need to usewith1 wrapper method on theleft side of the clause:

importcats.parse.Rfc5234.{alpha,sp}importcats.parse.Parservalp:Parser[String]= (sp.?.with1*> alpha.rep<* sp.?).rep.stringp.parse("hello world")p.parse(" hello world")

If we have multipleParser0 parsers before theParser - we'd need to use parenthesis like this:(sp.? ~ sp.?).with1 *> alpha.rep.

Error handling

Parser might be interrupted by parsing error. There are two kinds of errors:

an error that has consumed 0 characters (epsilon failure);
an error that has consumed 1 or more characters (arresting failure) (sometimes called halting failure).

importcats.parse.Rfc5234.{alpha,sp}importcats.parse.Parservalp1:Parser[Char]= alphavalp2:Parser[Char]= sp*> alpha// epsilon failurep1.parse("123")// arresting failurep2.parse(" 1")

We need to make this difference because the first type of error allows us to say that parser is not matching the input before we started to process it and the second error happens while parser processing the input.

Backtrack

Backtrack allows us to convert anarresting failure toepsilon failure. It also rewinds the input to the offset to that used before parsing began. The resulting parser might still be combined with others. Let's look at the example:

importcats.parse.Rfc5234.{digit,sp}valp= sp*> digit<* spp.parse(" 1")

Parser.Error contains two parameters:

finalcaseclassError(input:String,failedAtOffset:Int,expected:NonEmptyList[Expectation])caseclassInRange(offset:Int,lower:Char,upper:Char)extendsExpectation

In the error message we see the failed offset and the expected value. There is a lot of expected error types which can be found in source code.

One thing we can do in this situation is providing a fallback parser which can be used in case of error. We can do this by usingbacktrack (which rewinds the input, so it will be passed to fallback parser as it was before the error) and combining it withorElse operator:

importcats.parse.Rfc5234.{digit,sp}valp1= sp*> digit<* spvalp2= sp*> digitp1.backtrack.orElse(p2).parse(" 1")(p1.backtrack| p2 ).parse(" 1")

Notice that(p1.backtrack | p2) clause is another parser by itself since we're still combining parsers by usingorElse.

But we've already usedorElse in example before without anybacktrack operator, and it worked just fine. Why do we needbacktrack now? Let's look at this example:

importcats.parse.Rfc5234.{digit,sp}valp1= sp*> digit<* spvalp2= sp*> digitvalp3= digit(p1| p2).parse(" 1")(p1| p2| p3).parse("1")

The first parser combination is interrupted byarresting failures and the second parsing combination will only suffer fromepsilon failures. The second parser works becauseorElse and| operators actually allows recovering from epsilon failures, but not from arresting failures.

So thebacktrack helps us where theleft side returns arresting failure.

Soft

This method might look similar tobacktrack, but it allows us toproceed the parsing when theright side is returning an epsilon failure. It is really useful for ambiguous parsers when we can't really tell what exactly we are parsing before the end. Let's say we want to parse some input to the search engine which contains fields. This might look like "field:search_query". Let's try to write a parser for this:

importcats.parse.Rfc5234.{alpha,sp}importcats.parse.Parserimportcats.parse.Parser.{char=>pchar}valsearchWord= alpha.rep.stringvalfieldValue= alpha.rep.string~ pchar(':')valp1= fieldValue.?~ (searchWord~ sp.?).rep.stringp1.parse("title:The Wind Has Risen")p1.parse("The Wind Has Risen")

This error happens because we can't really tell if we are parsing thefieldValue before we met a: char. We might do this with by writing two parsers, converting the first one's failure to epsilon failure bybacktrack and then providing fallback parser by| operator (which allows the epsilon failures):

valp2= fieldValue.?~ (searchWord~ sp.?).rep.stringvalp3= (searchWord~ sp.?).rep.string(p2.backtrack| p3).parse("title:The Wind Has Risen")(p2.backtrack| p3).parse("The Wind Has Risen")

But this problem might be resolved withsoft method inside the first parser since the right side of it actually returns an epsilon failure itself:

valfieldValueSoft= alpha.rep.string.soft~ pchar(':')valp4= fieldValueSoft.?~ (searchWord~ sp.?).rep.stringp4.parse("title:The Wind Has Risen")p4.parse("The Wind Has Risen")

So when theright side returns an epsilon failure thesoft method allows us to rewind parsed input and try to proceed it's parsing with next parsers (without changing the parser itself!).

JSON parser example

Below is most of a json parser (the string unescaping is elided). This example can give you a feelfor what it is like to use this library.

importcats.parse.strings.Json.delimited.{parser=>jsonString}

importcats.parse.{Parser0,Parser=>P,Numbers}importorg.typelevel.jawn.ast._objectJson {private[this]valwhitespace:P[Unit]=P.charIn("\t\r\n").voidprivate[this]valwhitespaces0:Parser0[Unit]= whitespace.rep0.voidvalparser:P[JValue]=P.recursive[JValue] { recurse=>valpnull=P.string("null").as(JNull)valbool=P.string("true").as(JBool.True).orElse(P.string("false").as(JBool.False))valstr= jsonString.map(JString(_))valnum=Numbers.jsonNumber.map(JNum(_))vallistSep:P[Unit]=P.char(',').soft.surroundedBy(whitespaces0).voiddefrep[A](pa:P[A]):Parser0[List[A]]=      pa.repSep0(listSep).surroundedBy(whitespaces0)vallist= rep(recurse).with1      .between(P.char('['),P.char(']'))      .map { vs=>JArray.fromSeq(vs) }valkv:P[(String,JValue)]=      jsonString~ (P.char(':').surroundedBy(whitespaces0)*> recurse)valobj= rep(kv).with1      .between(P.char('{'),P.char('}'))      .map { vs=>JObject.fromSeq(vs) }P.oneOf(str:: num:: list:: obj:: bool:: pnull::Nil)  }// any whitespace followed by json followed by whitespace followed by endvalparserFile:P[JValue]= whitespaces0.with1*> parser<* (whitespaces0~P.end)}

Performance

We have a benchmark suite that compares JSON parsing across several commonly used libraries. Arecent (2021/11/05) result is below:

[info] Benchmark                         Mode  Cnt    Score    Error  Units[info] BarBench.catsParseParse           avgt    4   ≈ 10⁻⁴           ms/op[info] BarBench.fastparseParse           avgt    4   ≈ 10⁻⁴           ms/op[info] BarBench.jawnParse                avgt    4   ≈ 10⁻⁴           ms/op[info] BarBench.parboiled2Parse          avgt    4   ≈ 10⁻⁴           ms/op[info] BarBench.parsleyParseCold         avgt    4    0.064 ±  0.001  ms/op[info] Bla25Bench.catsParseParse         avgt    4   23.095 ±  0.174  ms/op[info] Bla25Bench.fastparseParse         avgt    4   15.622 ±  0.414  ms/op[info] Bla25Bench.jawnParse              avgt    4    7.501 ±  0.143  ms/op[info] Bla25Bench.parboiled2Parse        avgt    4   18.423 ±  6.094  ms/op[info] Bla25Bench.parsleyParseCold       avgt    4   30.752 ±  0.279  ms/op[info] CountriesBench.catsParseParse     avgt    4    7.169 ±  0.041  ms/op[info] CountriesBench.fastparseParse     avgt    4    5.023 ±  0.023  ms/op[info] CountriesBench.jawnParse          avgt    4    1.235 ±  0.011  ms/op[info] CountriesBench.parboiled2Parse    avgt    4    2.936 ±  0.008  ms/op[info] CountriesBench.parsleyParseCold   avgt    4   11.800 ±  0.162  ms/op[info] Qux2Bench.catsParseParse          avgt    4    7.031 ±  0.599  ms/op[info] Qux2Bench.fastparseParse          avgt    4    6.597 ±  0.031  ms/op[info] Qux2Bench.jawnParse               avgt    4    2.227 ±  0.014  ms/op[info] Qux2Bench.parboiled2Parse         avgt    4    5.514 ±  0.472  ms/op[info] Qux2Bench.parsleyParseCold        avgt    4   10.327 ±  0.293  ms/op[info] StringInBenchmarks.oneOfParse     avgt    4   88.105 ±  2.658  ns/op[info] StringInBenchmarks.stringInParse  avgt    4  129.246 ±  1.820  ns/op[info] Ugh10kBench.catsParseParse        avgt    4   53.679 ±  1.385  ms/op[info] Ugh10kBench.fastparseParse        avgt    4   45.165 ±  0.356  ms/op[info] Ugh10kBench.jawnParse             avgt    4   11.404 ±  0.068  ms/op[info] Ugh10kBench.parboiled2Parse       avgt    4   31.984 ±  0.748  ms/op[info] Ugh10kBench.parsleyParseCold      avgt    4   77.150 ±  1.093  ms/op

Note that parboiled and fastparse both use macros that make them very difficult to port to Dotty.Jawn is a specialized and optimized JSON parser, so that can be considered an upper bound onperformance.Keep in mind that parser performance depends both on the parsing library but also how the parseris written, but these results suggest that this library is already quite competitive.

Migrating from Fastparse

You should find all the Fastparse methods you are used to. If not, feel free to open an issue.There are a few things to keep in mind:

In fastparse, you wrap a parser inP(...) to make the interior lazy. Following cats, to get a lazily constructed parser useParser.defer orcats.Defer[Parser].defer.
In fastparse the~ operator does tuple concatenation. This can be nice, but also complex to see what the resulting type is. In cats-parse,~ always returns a Tuple2 containing the parsed values from the left and right. To recover fastparse-like behavior, use cats syntax(pa, pb, pc...).tupled.
In fastparse, backtracking is opt-out by using cuts. In cats-parse, backtracking is opt-in using.backtrack. Put another way, normal product operations in cats-parse are like~/ in fastparse.
In cats-parse, using*>,<*, and.void methods can be a significant optimization: if you don't need a result, communicate that to the library with those methods.

Getting and Giving Help

We welcome new contributors and new maintainers. Please feel free to open issues and PRs. If you have anyproblem using the library, an issue is the best way to ask a question until we flush out moredocumentation.