lettier/parsing-with-haskell-parser-combinatorsPublic

NotificationsYou must be signed in to change notification settings
Fork3
Star94

🔍 A step-by-step guide to parsing using Haskell parser combinators.

lettier.github.io/parsing-with-haskell-parser-combinators/

License

BSD-3-Clause license

94 stars 3 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
src		src
test-input		test-input
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Setup.hs		Setup.hs
parsing-with-haskell-parser-combinators.cabal		parsing-with-haskell-parser-combinators.cabal
stack.yaml		stack.yaml
stack.yaml.lock		stack.yaml.lock

Repository files navigation

Parsing With Haskell Parser Combinators

Need to parse something?Never heard of a "parser combinator"?Looking to learn some Haskell?Awesome!Below is everything you'll need to get up and parsing with Haskell parser combinators.From here you can try tackling esoteric data serialization formats,compiler front ends,domain specific languages—you name it!

Building The Demos

Included with this guide are two demo programs.

version-number-parser parses a file for a version number.srt-file-parser parses a file for SRT subtitles.Feel free to try them out with the files found intest-input/.

Stack

Download the Haskell toolStackand then run the following.

git clone https://github.com/lettier/parsing-with-haskell-parser-combinatorscd parsing-with-haskell-parser-combinatorsstack build

Cabal

If using Cabal, you can run the following.

git clone https://github.com/lettier/parsing-with-haskell-parser-combinatorscd parsing-with-haskell-parser-combinatorscabal sandbox initcabal --require-sandbox buildcabal --require-sandbox install

Running The Demos

After building the two demo programs, you can run them like so.

Stack

To try the version number parser, run the following.

cd parsing-with-haskell-parser-combinatorsstackexec -- version-number-parserWhat is the version output file path?test-input/gifcurry-version-output.txt

To try the SRT file parser, run the following.

cd parsing-with-haskell-parser-combinatorsstackexec -- srt-file-parserWhat is the SRT file path?test-input/subtitles.srt

Cabal

To try the version number parser, run the following.

cd parsing-with-haskell-parser-combinators.cabal-sandbox/bin/version-number-parserWhat is the version output file path?test-input/gifcurry-version-output.txt

To try the SRT file parser, run the following.

cd parsing-with-haskell-parser-combinators.cabal-sandbox/bin/srt-file-parserWhat is the SRT file path?test-input/subtitles.srt

Parser Combinator

One of the better ways to learn about the parsing strategy,parser combinator,is to look at an implementation of one.

Parsers built using combinators are straightforward to construct, readable, modular, well-structured, and easily maintainable.

^{—Parser combinator - Wikipedia}

ReadP

Let's take a look under the hood ofReadP,a parser combinator library found in base.Since it is in base, you should already have it.

💡 Note, you may want to try outParsec after getting familiar with ReadP.It too is a parser combinator library that others prefer to ReadP.As an added bonus, it is included inGHC's boot librariesas of GHC version 8.4.1.

P Data Type

-- (c) The University of Glasgow 2002dataPa=Get (Char->Pa)  |Look (String->Pa)  |Fail  |Resulta (Pa)  |Final [(a,String)]derivingFunctor

We'll start with theP data type.Thea inP a is up to you (the library user) and can be whatever you'd like.The compiler creates a functor instance automatically and there are hand-written instances forapplicative,monad,MonadFail,and alternative.

💡 Note, for more on functors, applicatives, and monads, checkoutYour easy guide to Monads, Applicatives, & Functors.

P is asum type with five cases.

Get consumes a single character from the input string and returns a newP.
Look accepts a duplicate of the input string and returns a newP.
Fail indicates the parser finished without a result.
Result holds a possible parsing and anotherP case.
Final is a list of two-tuples. The first tuple element is a possible parsing of the inputand the second tuple element is the rest of the input string that wasn't consumed byGet.

Run

-- (c) The University of Glasgow 2002run::Pa->ReadSarun (Get f)      (c:s)= run (f c) srun (Look f)     s= run (f s) srun (Result x p) s= (x,s): run p srun (Final r)    _= rrun _            _=[]

run is the heart of the ReadP parser.It does all of the heavy lifting as it recursively runs through all of the parser states that we saw up above.You can see that it takes aP and returns aReadS.

-- (c) The University of Glasgow 2002typeReadSa=String-> [(a,String)]

ReadS a is a type alias forString -> [(a,String)].So whenever you seeReadS a, thinkString -> [(a,String)].

-- (c) The University of Glasgow 2002run::Pa->String-> [(a,String)]run (Get f)      (c:s)= run (f c) srun (Look f)     s= run (f s) srun (Result x p) s= (x,s): run p srun (Final r)    _= rrun _            _=[]

run pattern matches the different cases ofP.

If it'sGet,it calls itself with a newP (returned by passing the functionf, inGet f, the next characterc in the input string)and the rest of the input strings.
If it'sLook,it calls itself with a newP (returned by passing the functionf, inLook f, the input strings)and the input string.Notice howLook doesn't consume any characters from the input string likeGet does.
If it'sResult,it assembles a two-tuple—containing the parsed result and what's left of the input string—andprepends this to the result of a recursive call that runs with anotherP case and the input string.
If it'sFinal,run returns a list of two-tuples containing parsed results and input string leftovers.
For anything else,run returns an empty list.For example, if the case isFail,run will return an empty list.

> run (Get (\ a->Get (\ b->Result [a,b]Fail)))"12345"[("12","345")]

ReadP doesn't exposerun but if it did, you could call it like this.The twoGets consume the'1' and'2', leaving the"345" behind.

> run (Get (\ a->Get (\ b->Result [a,b]Fail)))"12345"> run (Get (\ b->Result ['1',b]Fail))"2345"> run (Result ['1','2']Fail)"345"> (['1','2'],"345"): run (Fail)"345"> (['1','2'],"345"):[][("12","345")]

Running through each recursive call, you can see how we arrived at the final result.

> run (Get (\ a->Get (\ b->Result [a,b] (Final [(['a','b'],"c")]))))"12345"[("12","345"),("ab","c")]

UsingFinal, you can include a parsed result in the final list of two-tuples.

readP_to_S

-- (c) The University of Glasgow 2002readP_to_S::ReadPa->ReadSa-- readP_to_S :: ReadP a -> String -> [(a,String)]   readP_to_S (R f)= run (freturn)

While ReadP doesn't exposerun directly, it does expose it viareadP_to_S.readP_to_S introduces anewtype calledReadP.readP_to_S accepts aReadP a, a string, and returns a list of two-tuples.

ReadP Newtype

-- (c) The University of Glasgow 2002newtypeReadPa=R (forallb. (a->Pb)->Pb)

Here's the definition ofReadP a.There are instances for functor, applicative, monad,MonadFail, alternative, andMonadPlus.TheR constructor takes a function that takes another function and returns aP.The accepted function takes whatever you chose fora and returns aP.

-- (c) The University of Glasgow 2002readP_to_S (R f)= run (freturn)

Recall thatP is a monad andreturn's type isa -> m a.Sof is the(a -> P b) -> Pb function andreturn is the(a -> P b) function.Ultimately,run gets theP b it expects.

-- (c) The University of Glasgow 2002readP_to_S (R f) inputString= run (freturn) inputString--^^^^^^^^^^^                  ^^^^^^^^^^^

It's left off in the source code but remember thatreadP_to_S andrun expects an input string.

-- (c) The University of Glasgow 2002instanceFunctorReadPwherefmap h (R f)=R (\k-> f (k. h))

Here's the functor instance definition forReadP.

> readP_to_S (fmap toLower get)"ABC"[('a',"BC")]> readP_to_S (toLower<$> get)"ABC"[('a',"BC")]

This allows us to do something like this.fmap functor mapstoLower over the functorget which equalsR Get.Recall that the type ofGet is(Char -> P a) -> P a which theReadP constructor (R) accepts.

-- (c) The University of Glasgow 2002fmap h       (R f  )=R (\ k-> f   (k. h      ))fmap toLower (RGet)=R (\ k->Get (k. toLower))

Here you see the functor definition rewritten for thefmap toLower get example.

Applicative P Instance

Looking up above, how didreadP_to_S return[('a',"BC")] when we only usedGet which doesn't terminaterun?The answer lies in the applicative definition forP.

-- (c) The University of Glasgow 2002instanceApplicativePwherepure x=Result xFail(<*>)= ap

return equalspure so we could rewritereadP_to_S (R f) = run (f return) to bereadP_to_S (R f) = run (f pure).By usingreturn or ratherpure,readP_to_S setsResult x Fail as the final caserun will encounter.If reached,run will terminate and we'll get our list of parsings.

> readP_to_S (fmap toLower get)"ABC"-- Use the functor instance to transform fmap toLower get.> readP_to_S (R (\ k->Get (k. toLower)))"ABC"-- Call run which removes R.> run ((\ k->Get (k. toLower))pure)"ABC"-- Call function with pure to get rid of k.> run (Get (pure. toLower))"ABC"-- Call run for Get case to get rid of Get.> run ((pure. toLower)'A')"BC"-- Call toLower with 'A' to get rid of toLower.> run (pure'a')"BC"-- Use the applicative instance to transform pure 'a'.> run (Result'a'Fail)"BC"-- Call run for the Result case to get rid of Result.> ('a',"BC"): run (Fail)"BC"-- Call run for the Fail case to get rid of Fail.> ('a',"BC"):[]-- Prepend.[('a',"BC")]

Here you see the flow fromreadP_to_S to the parsed result.

Alternative P Instance

-- (c) The University of Glasgow 2002instanceAlternativePwhere-- ...-- most common case: two gets are combinedGet f1<|>Get f2=Get (\c-> f1 c<|> f2 c)-- results are delivered as soon as possibleResult x p<|> q=Result x (p<|> q)  p<|>Result x q=Result x (p<|> q)-- ...

TheAlternative instance forP allows us to split the flow of the parser into a left and right path.This comes in handy when the input can go none, one, or (more rarely) two of two ways.

> readP_to_S ((get>>=\ a->return a)<|> (get>> get>>=\ b->return b))"ABC"[('A',"BC"),('B',"C")]

The<|> operator or function introduces a fork in the parser's flow.The parser will travel through both the left and right paths.The end result will contain all of the possible parsings that went leftand all of the possible parsings that went right.If both paths fail, then the whole parser fails.

💡 Note, in other parser combinator implementations,when using the<|> operator,the parser will go left or right but not both.If the left succeeds, the right is ignored.The right is only processed if the left side fails.

> readP_to_S ((get>>=\ a->return [a])<|> look<|> (get>> get>>=\a->return [a]))"ABC"[("ABC","ABC"),("A","BC"),("B","C")]

You can chain the<|> operator for however many options or alternatives there are.The parser will return a possible parsing involving each.

ReadP Failure

-- (c) The University of Glasgow 2002instanceMonadReadPwherefail _=R (\_->Fail)R m>>= f=R (\k-> m (\a->letR m'= f ain m' k))

Here is theReadP monad instance.Notice the definition forfail.

> readP_to_S ((\ a b c-> [a,b,c])<$> get<*> get<*> get)"ABC"[("ABC","")]> readP_to_S ((\ a b c-> [a,b,c])<$> get<*>fail""<*> get)"ABC"[]> readP_to_S (get>>=\ a-> get>>=\ b-> get>>=\ c->return [a,b,c])"ABC"[("ABC","")]> readP_to_S (get>>=\ a-> get>>=\ b->fail"">>=\ c->return [a,b,c])"ABC"[]

You can cause an entire parser path to abort by callingfail.Since ReadP doesn't provide a direct way to generate aResult orFinal case,the return value will be an empty list.If the failed path is the only path, then the entire result will be an empty list.Recall that whenrun matchesFail, it returns an empty list.

-- (c) The University of Glasgow 2002instanceAlternativePwhere-- ...-- fail disappearsFail<|> p= p  p<|>Fail= p-- ...

Going back to the alternativeP instance,you can see how a failure on either side (but not both) will not fail the whole parser.

> readP_to_S (get>>=\ a-> get>>=\ b-> pfail>>=\ c->return [a,b,c])"ABC"[]

Instead of usingfail, ReadP providespfail which allows you to generate aFail case directly.

Version Number

Gifcurry,the Haskell-built video editor for GIF makers, shells out to various different programs.To ensure compatibility, it needs the version number for each of the programs it shells out to.One of those programs is ImageMagick.

Version: ImageMagick 6.9.10-14 Q16 x86_64 2018-10-24 https://imagemagick.orgCopyright: © 1999-2018 ImageMagick Studio LLCLicense: https://imagemagick.org/script/license.phpFeatures: Cipher DPC HDRI Modules OpenCL OpenMP

Here you see the output ofconvert --version.How could you parse this to capture the 6, 9, 10, and 14?

Looking at the output,we know the version number is a collection of numbers separated by either a period or a dash.This definition covers the dates as well so we'll make sure that the first two numbers are separated by a period.That way, if they put a date before the version number, we won't get the wrong result.

1. Consume zero or more characters that are not 0 through 9 and go to 2.2. Consume zero or more characters that are 0 through 9, save this number, and go to 3.3. Look at the rest of the input and go to 4.4. If the input    - is empty, go to 6.    - starts with a period, go to 1.    - starts with a dash        - and you have exactly one number, go to 5.        - and you have more than one number, go to 1.    - doesn't start with a period or dash        - and you have exactly one number, go to 5.        - you have more than one number, go to 6.5. Delete any saved numbers and go to 1.6. Return the numbers found.

Before we dive into the code, here's the algorithm we'll be following.

Building The Version Number Parser

parseVersionNumber::  [String]->ReadP [String]parseVersionNumber  nums=do  _<- parseNotNumber  num<- parseNumberlet nums'= nums++ [num]  parseSeparator nums' parseVersionNumber

parseVersionNumber is the main parser combinator that parses an input string for a version number.It accepts a list of strings and returns a list of strings in the context of theReadP data type.The accepted list of strings is not the input that gets parsed but rather the list of numbers found so far.For the first function call, the list is empty since it hasn't parsed anything yet.

parseVersionNumber  nums

Starting from the top,parseVersionNumber takes a list of strings which are the current list of numbers found so far.

  _<- parseNotNumber

parseNotNumber consumes everything that isn't a number from the input string.Since we are not interested in the result, we discard it (_ <-).

  num<- parseNumberlet nums'= nums++ [num]

Next we consume everything that is a number and then add that to the list of numbers found so far.

  parseSeparator nums' parseVersionNumber

AfterparseVersionNumber has processed the next number, it passes the list of numbers found and itself toparseSeparator.

Parsing The Separator

parseSeparator::  [String]->  ([String]->ReadP [String])->ReadP [String]parseSeparator  nums  f=do  next<- lookcase nextof""->return nums    (c:_)->case cof'.'-> f nums'-'->iflength nums==1then f[]else f nums        _->iflength nums==1then f[]elsereturn nums

Here you seeparseSeparator.

  next<- lookcase nextof""->return nums    (c:_)->

look allows us to get what's left of the input string without consuming it.If there's nothing left, it returns the numbers found.However, if there is something left, it analyzes the first character.

case cof'.'-> f nums'-'->iflength nums==1then f[]else f nums        _->iflength nums==1then f[]elsereturn nums

If the next character is a period, callparseVersionNumber again with the current list of numbers found.If it's a dash and we have exactly one number, callparseVersionNumber with an empty list of numbers since it's a date.If it's a dash and we don't have exactly one number, callparseVersionNumber with the list of numbers found so far.Otherwise,callparseVersionNumber with an empty list if we have exactly one numberor return the numbers found if we don't have exactly one number.

Parsing Non-numbers

parseNotNumber::ReadPStringparseNotNumber=  munch (not. isNumber)

parseNotNumber usesmunch whichReadP provides.munch is given the predicate(not . isNumber) which returns true for any character that isn't 0 through 9.

munch:: (Char->Bool)->ReadPString

munch continuously callsget if the next character in the input string satisfies the predicate.If it doesn't,munch returns the characters that did, if any.Since it only usesget, munch always succeeds.

💡 Note,parseNumber is similar toparseNotNumber.Instead ofnot . isNumber, the predicate is justisNumber.

Munch Versus Many

parseNotNumber'::ReadPStringparseNotNumber'=  many (satisfy (not. isNumber))

Instead of usingmunch,you could writeparseNotNumber like this,usingmany andsatisfy—both of which ReadP provides.Looking at the type signature formany, it accepts a single parser combinator (ReadP a).In this instance, it's being given the parser combinatorsatisfy.

> readP_to_S (satisfy (not. isNumber))"a"[('a',"")]> readP_to_S (satisfy (not. isNumber))"1"[]

satisfy takes a predicate and usesget to consume the next character.If the accepted predicate returns true,satisfy returns the character.Otherwise,satisfy callspfail and fails.

> readP_to_S (munch (not. isNumber))"abc123"[("abc","123")]> readP_to_S (many (satisfy (not. isNumber)))"abc123"[("","abc123"),("a","bc123"),("ab","c123"),("abc","123")]

Usingmany can give you unwanted results.Ultimately,many introduces one or moreResult cases.Because of this,many always succeeds.

> readP_to_S (many look)"abc123"-- Runs forever.

many will run your parser until it fails or runs out of input.If your parser never fails or never runs out of input,many will never return.

> readP_to_S (many (get>>=\ a->return (read (a:"")::Int)))"12345"[([],"12345"),([1],"2345"),([1,2],"345"),([1,2,3],"45"),([1,2,3,4],"5"),([1,2,3,4,5],"")]

For every index in the result,the parsed result will be the outcome of having ran the parser index times on the entire input.

>let parser= get>>=\ a->return (read (a:"")::Int)>let many' results=return results<|> (parser>>=\ result-> many' (results++ [result]))> readP_to_S (many'[])"12345"[([],"12345"),([1],"2345"),([1,2],"345"),([1,2,3],"45"),([1,2,3,4],"5"),([1,2,3,4,5],"")]

Here's an alternate definition formany.On the left side of<|>,it returns the current parser results.On the right side of<|>,it runs the parser,adds that result to the current parser results,and calls itself with the updated results.This has a cumulative sum type effect where indexi is the parser result appended to the parser result ati - 1,i - 2,...,and1.

Running The Version Number Parser

Now that we built the parser, let's run it.

>let inputString=>"Some Program (C) 1234-56-78 All rights reserved.\n\>     \Version: 12.345.6-7\n\>     \License: Some open source license."> readP_to_S (parseVersionNumber[]) inputString[(["12","345","6","7"],"\nLicense: Some open source license.")]

You can see it extracted the version number correctly even with the date coming before it.

SRT

Now let's parse something more complicated—SRT files.

For the release ofGifcurrysix, I needed to parseSRT (SubRip Text) files.SRT files contain subtitles that video processing programs use to display text on top of a video.Typically this text is the dialog of a movie translated into various different languages.By keeping the text separate from the video,there only needs to be one video which saves time, storage space, and bandwidth.The video software can swap out the text without having to swap out the video.Contrast this with burning-in or hard-coding the subtitles where the text becomes a part of the image data that makes up the video.In this case, you would need a video for each collection of subtitles.

^{Inner Video © Blender Foundation |www.sintel.org}

Gifcurry can take a SRT file and burn-in the subtitles for the video slice your select.

700:02:09,400 --> 00:02:13,800What brings you tothe land of the gatekeepers?800:02:15,000 --> 00:02:17,500I'm searching for someone.900:02:18,000 --> 00:02:22,200Someone very dear?A kindred spirit?

Here you see the English subtitles forSintel (© Blender Foundation |www.sintel.org).

SRT Format

SRT is perhaps the most basic of all subtitle formats.

^{—SRT Subtitle | Matrosk}

The SRT file format consists of blocks, one for each subtitle, separated by an empty line.

At the top of the block is the index.This determines the order of the subtitles.Hopefully the subtitles are already in order and all of them have unique indexes but this may not be the case.

01:04:13,000 --> 02:01:01,640 X1:167 X2:267 Y1:33 Y2:63

After the index is the start time, end time, and an optional set of points specifying the rectangle thesubtitle text should go in.

01:04:13,000

The timestamp format ishours:minutes:seconds,milliseconds.

💡 Note the comma instead of the period separating the seconds from the milliseconds.

This is the actual subtitletext. It can span multiple lines.It may include formatinglike <b>bold</b>, <i>italic</i>,<u>underline</u>,and <font color="#010101">font color</font>.

The third and last part of a block is the subtitle text.It can span multiple lines and ends when there is an empty line.The text can include formatting tags reminiscent of HTML.

Building The SRT Parser

parseSrt::ReadP [SrtSubtitle]parseSrt=  manyTill parseBlock (skipSpaces>> eof)

parseSrt is the main parser combinator that handles everything.It parses each block until it reaches the end of the file (eof) or input.To be on the safe side,there could be trailing whitespace between the last block and the end of the file.To handle this, it parses zero or more characters of whitespace (skipSpaces) before parsingthe end of the file (skipSpaces >> eof).If there is still input left by the timeeof is reached,eof will fail and this will return nothing.Therefore, it's important thatparseBlock doesn't leave any thing but whitespace behind.

Building The SRT Block Parser

parseBlock::ReadPSrtSubtitleparseBlock=do  i<- parseIndex  (s, e)<- parseTimestamps  c<- parseCoordinates  t<- parseTextLinesreturnSrtSubtitle      { index= i      , start= s      , end= e      , coordinates= c      , taggedText= t      }

As we went over earlier, a block consists of an index, timestamps, possibly some coordinates, and some lines of text.In this version ofparseBlock, you see the more imperative do notation style with the record syntax.

parseBlock'::ReadPSrtSubtitleparseBlock'=SrtSubtitle<$> parseIndex<*> parseStartTimestamp<*> parseEndTimestamp<*> parseCoordinates<*> parseTextLines

Here's another way you could writeparseBlock.This is the applicative style.Just be sure to get the order right.For example, I could've accidentally mixed up the start and end timestamps.

Building The SRT Index Parser

parseIndex::ReadPIntparseIndex=      skipSpaces>>  readInt<$> parseNumber

At the top of the block is the index.Here you seeskipSpaces again.After skipping over whitespace,it parses the input for numbers and converts it to an actual integer.

readInt::String->IntreadInt=read

readInt looks like this.

>read"123"::Int123>read"1abc"::Int***Exception:Prelude.read: no parse

Normally usingread directly can be dangerous.read may not be able to convert the input to the specified type.However,parseNumber will only return the 10 numerical digit characters (['0'..'9'])so usingread directly becomes safe.

Building The SRT Timestamps Parser

Parsing the timestamps are a little more involved than parsing the index.

parseTimestamps::ReadP (Timestamp,Timestamp)parseTimestamps=do  _<- char'\n'  s<- parseTimestamp  _<- skipSpaces  _<- string"-->"  _<- skipSpaces  e<- parseTimestampreturn (s, e)

This is the main combinator for parsing the timestamps.

char parses the character you give it or it fails.If it fails thenparseTimestamps fails, ultimately causingparseSrt to failso there must be a newline character after the index.

string is likechar except instead of just one character, itparses the string of characters you give it or it fails.

parseStartTimestamp::ReadPTimestampparseStartTimestamp=      char'\n'>>  parseTimestamp

parseTimestamps parses both timestamps,but for the applicative style (parseSrt'),we need a parser just for the start timestamp.

parseEndTimestamp::ReadPTimestampparseEndTimestamp=      skipSpaces>>  string"-->">>  skipSpaces>>  parseTimestamp

This parses everything between the timestamps and returns the end timestamp.

parseTimestamp::ReadPTimestampparseTimestamp=do  h<- parseNumber  _<- char':'  m<- parseNumber  _<- char':'  s<- parseNumber  _<- char','<|> char'.'  m'<- parseNumberreturnTimestamp      { hours= readInt h      , minutes= readInt m      , seconds= readInt s      , milliseconds= readInt m'      }

This parses the four numbers that make up the timestamp.The first three numbers are separated by a colon and the last one is separated by a comma.To be more forgiving, however, we allow the possibility of there being a period instead of a comma.

> readP_to_S (char'.'<|> char',')"..."[('.',"..")]> readP_to_S (char'.'<|> char',')",.."[(',',"..")]

💡 Note, when usingchar with<|>,only one side can succeed (twochar enter, onechar leave)sincechar consumes a single character and two characters cannot occupy the same space.

Building The SRT Coordinates Parser

The coordinates are an optional part of the block but if included, will be on the same line as the timestamps.

parseCoordinates::ReadP (MaybeSrtSubtitleCoordinates)parseCoordinates=  optionNothing$do    _<- skipSpaces1    x1<- parseCoordinate'x'1    _<- skipSpaces1    x2<- parseCoordinate'x'2    _<- skipSpaces1    y1<- parseCoordinate'y'1    _<- skipSpaces1    y2<- parseCoordinate'y'2return$JustSrtSubtitleCoordinates          { x1= readInt x1          , x2= readInt x2          , y1= readInt y1          , y2= readInt y2          }

option takes two arguments.The first argument is returned if the second argument, a parser, fails.So if the coordinates parser fails,parseCoordinates will returnNothing.Put another way, the coordinates parser failing does not cause the whole parser to fail.This block will just haveNothing for itscoordinates "field".

parseCoordinate::Char->Int->ReadPStringparseCoordinate  c  n=do  _<- char (Data.Char.toUpper c)<|> char (Data.Char.toLower c)  _<- string$show n++":"  parseNumber

This parser allows the coordinate labels to be in either uppercase or lowercase.For example,x1:1 X2:2 Y1:3 y2:4 would succeed.

Building The SRT Text Parser

Parsing the text is the most involved portion due to the HTML-like tag formatting.

Tag parsing can be challenging—just ask anyone who parses them with a regular expression.To make this easier on us—and for the user—we'll use atag soupkind of approach.The parser will allow unclosed and/or wrongly nested tags.It will also allow any tag and not justb,u,i, andfont.

parseTextLines::ReadP [TaggedText]parseTextLines=      char'\n'>>  (getTaggedText<$> manyTill parseAny parseEndOfTextLines)

We start out by matching on a newline character.After that, we functor map or fmap (<$>)getTaggedText over the subtitle text characters until we reach the end of the text lines.

parseEndOfTextLines::ReadP()parseEndOfTextLines=  void (string"\n\n")<|> eof

We stop collecting characters (parseAny) when we reach two newline characters or the end of the file.This signals the end of the block.

getTaggedText::String->  [TaggedText]getTaggedText  s=fst$foldl      folder      ([],[])      parsedwhere

getTaggedText folds through the parsed text from left to right, returning the accumulated tagged text.

    parsed::  [String]    parsed=case readP_to_S (parseTaggedText[]) sof[]-> [s]        r@(_:_)-> (fst.last) r

parsed returns a list of one or more strings.It attempts to parse the input text for tags.If that fails,parsed returns the input string inside a list.Otherwise, ifparseTaggedText succeeds,parse returns the last possible parsing ((fst . last) r).

    folder::  ([TaggedText], [Tag])->String->  ([TaggedText], [Tag])    folder      (tt, t)      x| isTag x= (tt, updateTags t x)|otherwise= (tt++ [TaggedText { text= x, tags= t}], t)

Asfolder moves from left to right, over the parsed strings, it checks if the current string is a tag.If it is a tag, it updates the current set of active tags (t).Otherwise, it appends another tagged piece of text associated with the set of active tags.

updateTags::  [Tag]->String->  [Tag]updateTags  tags  x| isClosingTag x= remove compare' tags (makeTag x)| isOpeningTag x= add    compare' tags (makeTag x)|otherwise= tagswhere    compare'::Tag->Tag->Bool    compare'      a      b=      name a/= name b

updateTags updates thetags given by either removing or adding the given tag (x) depending on if it is a closing or opening tag.If it is neither, it just returns the passed set of tags.add will overwrite an existing tag iftags already has a tag by the same name.You can see this in thecompare' function given.

To keep the parser simple, if an opening tagT is found,T gets added to the list of tagsor overwrites an exitingT if already present.If a corresponding closing/T is found, thenT is removed from the list of tags, if present.It doesn't matter if there is two or moreTs in a row,one or moreTs without a closing/T,and/or there's a closing/T without an openingT.

makeTag::String->TagmakeTag  s=Tag    { name= getTagName       s    , attributes= getTagAttributes s    }

makeTag assembles a tag from the given string (s).EachTag has a name and zero or more attributes.

parseTaggedText::  [String]->ReadP [String]parseTaggedText  strings=do  s<- lookcase sof""->return strings    _->do      r<- munch1 (/='<')<++ parseClosingTag<++ parseOpeningTag      parseTaggedText$ strings++ [r]

parseTaggedText returns the input string broken up into pieces.Each piece is either the text enclosed by tags, a closing tag, or an opening tag.After it splits off a piece, it adds it to the other pieces and calls itself again.If the remaining input string is empty, it returns the list of strings found.

> readP_to_S (string"ab"<++ string"abc")"abcd"[("ab","cd")]> readP_to_S (string"ab"+++ string"abc")"abcd"[("ab","cd"),("abc","d")]> readP_to_S (string"ab"<|> string"abc")"abcd"[("ab","cd"),("abc","d")]

The<++ operator is left biased meaning that if the left side succeeds, it won't even bother with the right.Recall that when we run the parser, we get a list of all the possible parsings.All of these possible parsings are the result of the parser having traveled through all of the possible paths.By using<++,we receive the possible parsings from the left path and from the right path if and only if the left side failed.If you'd like all of the possible parsings through the left and right side,you can use the+++ operator provided byReadP.+++ is just<|> which we saw up above.

parseOpeningTag::ReadPStringparseOpeningTag=do  _<- char'<'  t<- munch1 (\ c-> c/='/'&& c/='>')  _<- char'>'return$"<"++ t++">"

An opening tag is an opening angle bracket, some text that doesn't include a forward slash, and the next immediate closing angle bracket.

parseClosingTag::ReadPStringparseClosingTag=do  _<- char'<'  _<- char'/'  t<- munch1 (/='>')  _<- char'>'return$"</"++ t++">"

A closing tag is an opening angle bracket, a forward slash, some text, and the next immediate closing angle bracket.

getTagAttributes::String->  [TagAttribute]getTagAttributes  s=if isOpeningTag sthencase readP_to_S (parseTagAttributes[]) sof[]->[]        (x:_)->fst xelse[]

Opening tags can have attributes.For example,<font color="#101010">.Each attribute is a two-tuple, key-value pair.In the above example,color would be the key and#101010 would be the value.

getTagName::String->StringgetTagName  s=case readP_to_S parseTagName sof[]->""    (x:_)-> toLower'$fst x

This returns the tag name in lowercase.

parseTagName::ReadPStringparseTagName=do  _<- char'<'  _<- munch (=='/')  _<- skipSpaces  n<- munch1 (\ c-> c/=''&& c/='>')  _<- munch  (/='>')  _<- char'>'return n

The tag name is the first string of non-whitespace charactersafter the opening angle bracket,a possible forward slash,and some possible whitespaceand before some more whitespaceand/or the closing angle bracket.

parseTagAttributes::  [TagAttribute]->ReadP [TagAttribute]parseTagAttributes  tagAttributes=do  s<- lookcase sof""->return tagAttributes    _->dolet h=head scase hof'>'->return tagAttributes'<'-> trimTagname>> parseTagAttributes'        _-> parseTagAttributes'where    parseTagAttributes'::ReadP [TagAttribute]    parseTagAttributes'=do      tagAttribute<- parseTagAttribute      parseTagAttributes        ( add            (\ a b->fst a/=fst b)            tagAttributes            tagAttribute        )

parseTagAttributes recursively goes through the input string, collecting up the key-value pairs.At the start of the tag (<), it first trims the tag name before tackling the attributes.It stops parsing for attributes when it reaches the closing angle bracket (>).If a tag happens to have duplicate attributes (based on the key),add will ensure only the latest one remains in the list.

trimTagname::ReadP()trimTagname=      char'<'>> skipSpaces>> munch1 (\ c-> c/=''&& c/='>')>>return()

This trims or discards the tag name.

parseTagAttribute::ReadPTagAttributeparseTagAttribute=do  _<- skipSpaces  k<- munch1 (/='=')  _<- string"=\""  v<- munch1 (/='\"')  _<- char'\"'  _<- skipSpacesreturn (toLower' k, v)

The attribute key is any string of non-whitespace characters before the equal sign.The attribute value is any characters after the equal sign and double quote and before the next immediate double quote.

isTag::String->BoolisTag  s=  isOpeningTag s|| isClosingTag s

A string is a tag if it is either an opening tag or a closing tag.

isOpeningTag::String->BoolisOpeningTag  s=  isPresent$ readP_to_S parseOpeningTag s

A string is an opening tag if the opening tag parser succeeds.

isClosingTag::String->BoolisClosingTag  s=  isPresent$ readP_to_S parseClosingTag s

A string is a closing tag if the closing tag parser succeeds.

Running The SRT Parser

Now that we've assembled the parser, let's try it out.

>let srt=>" 1\n\>       \0:0:0,1 --> 0:1:0.2  x1:1 X2:3  y1:4 y2:10\n\>       \<font color=\"red\" color=\"blue\">This is some <b><u><i>\n \>       \subtitle\n\>       \</u>text.</b>"> readP_to_S parseSrt srt[([SrtSubtitle      { index=1      , start=Timestamp {hours=0, minutes=0, seconds=0, milliseconds=1}      , end=Timestamp {hours=0, minutes=1, seconds=0, milliseconds=2}      , coordinates=Just (SrtSubtitleCoordinates {x1=1, x2=3, y1=4, y2=10})      , taggedText=  [TaggedText                        { text="This is some"                        , tags= [Tag {name="font", attributes= [("color","blue")]}                                 ]                        }                      ,TaggedText                          { text="\n subtitle\n"                          , tags= [Tag {name="font", attributes= [("color","blue")]}                                   ,Tag {name="b",    attributes=[]}                                   ,Tag {name="u",    attributes=[]}                                   ,Tag {name="i",    attributes=[]}                                   ]                          }                      ,TaggedText                          { text="text."                          , tags= [Tag {name="font", attributes= [("color","blue")]}                                   ,Tag {name="b",    attributes=[]}                                   ,Tag {name="i",    attributes=[]}                                   ]                          }                      ,TaggedText                          { text=""                          , tags= [Tag {name="font", attributes= [("color","blue")]}                                   ,Tag {name="i",    attributes=[]}                                   ]                          }                      ]      }  ],"")]

Here you see the result of parsing a test string.Notice the errors in the test string like the use of a period instead of a comma or the duplicate tag attribute.

Exercises

Write a program that can convert an SRT file to a JSON file.
Rewrite the version number parser using Parsec instead of ReadP.
Rewrite the SRT parser using Parsec instead of ReadP.

Copyright

About

🔍 A step-by-step guide to parsing using Haskell parser combinators.

lettier.github.io/parsing-with-haskell-parser-combinators/

Releases

No releases published

Packages

No packages published

Languages

Haskell100.0%

Movatterモバイル変換

License

lettier/parsing-with-haskell-parser-combinators

Folders and files

Latest commit

History

Repository files navigation

Parsing With Haskell Parser Combinators

Building The Demos

Stack

Cabal

Running The Demos

Stack

Cabal

Parser Combinator

ReadP

P Data Type

Run

readP_to_S

ReadP Newtype

Applicative P Instance

Alternative P Instance

ReadP Failure

Version Number

Building The Version Number Parser

Parsing The Separator

Parsing Non-numbers

Munch Versus Many

Running The Version Number Parser

SRT

SRT Format

Building The SRT Parser

Building The SRT Block Parser

Building The SRT Index Parser

Building The SRT Timestamps Parser

Building The SRT Coordinates Parser

Building The SRT Text Parser

Running The SRT Parser

Exercises

Copyright

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages