Parsing expressions with combinators

Question 1

I'm very new to Haskell's ReadP library, and the entire concept of parser combinators, so I was wondering whether there are better ways to do some things in this program:

import Text.ParserCombinators.ReadPimport Control.Applicativeimport Data.Listdata Operator = Add | Subtractinstance Show Operator where    show Add = "+"    show Subtract = "-"instance Read Operator where    readsPrec _ "+" = [(Add, "")]    readsPrec _ "-" = [(Subtract, "")]    readsPrec _ _ = []data Expression = Number Int                | Infix { left :: Expression, op :: Operator, right :: Expression }instance Show Expression where    show (Number x) = show x    show (Infix left op right) = "(" ++ (show left) ++ " " ++ (show op) ++ " " ++ (show right) ++ ")"digit :: ReadP Chardigit = satisfy $ \char -> char >= '0' && char <= '9'number :: ReadP Expressionnumber = fmap (Number . read) (many1 digit)operator :: ReadP Operatoroperator = fmap read (string "+" <|> string "-")expression :: ReadP Expressionexpression = do    skipSpaces    left <- number    skipSpaces    op <- Control.Applicative.optional operator    case op of        Nothing -> return left        Just op -> do            skipSpaces            right <- expression            return (Infix left op right)parseExpression :: String -> Maybe ExpressionparseExpression input = case readP_to_S expression input of    [] -> Nothing    xs -> (Just . fst . last) xs

The main area where I'm looking for improvements is theexpression function, and specifically the things which seem improvable are the repeated calls toskipSpaces and the case expression to check whether an operator was parsed, but of course if you notice anything else that would be helpful too!

Question 2

First, a standard approach for dealing with theskipSpaces issue is to define a higher-order parser combinator, traditionally calledlexeme:

lexeme :: ReadP a -> ReadP alexeme p = p <* skipSpaces

Here,lexeme takes a space-naive parserp, and converts it into a new parser that parses whateverp was planning to parse, and then reads and discards any trailing spaces. You uselexeme in the definitions of any of your parsers that would reasonably be assumed to read a complete "lexeme" and ignore any trailing space. For example,number should be alexeme parser:

number :: ReadP Expressionnumber = lexeme $ fmap (Number . read) (many1 digit)

So shouldoperator, though obviously notdigit!expression won't need to uselexeme, because we'll arrange to end it with a lexeme parser.

It's also helpful to define asymbol parser which is essentiallystring that ignores trailing spaces:

symbol :: String -> ReadP Stringsymbol = lexeme . string

Consistently used,lexeme (andsymbol) will deal with all unwanted spaces, other than any leading spaces at the very start of your parse. If you have a non-recursive "top level" grammar production, like say a parser forprogram :: ReadP Program, then you would probably deal with them there. In your example, you don't have such a production (e.g.,expression is recursive), so you'd stick an extraskipSpaces inparseExpression. This is also a good place to puteof to make sure there isn't any trailing material that you aren't parsing:

parseExpression :: String -> Maybe ExpressionparseExpression input = case readP_to_S (skipSpaces *> expression <* eof) input of    [] -> Nothing    xs -> (Just . fst . last) xs

Second, your use of aRead instance for parsing your operators is very unusual. It would be more standard to make it a helper in theoperator parser, writing something like:

operator :: ReadP Operatoroperator = readSymbol <$> (symbol "+" <|> symbol "-")  where readSymbol "+" = Add        readSymbol "-" = Subtract

(though an evenmore standard version is given below).

Third, inexpression, you can avoid thecase construct by using alternation(<|>) like so:

expression' :: ReadP Expressionexpression' = do    left <- number    (do op <- operator        right <- expression        return (Infix left op right)     <|> return left)

This would be the standard approach for non-parallel parser libraries (e.g., Parsec or Megaparsec). ForReadP, it's better to replace the(<|>) operator with the ReadP-specific(<++) operator to avoid also following the unwanted second parse in parallel. Beware that(<++) has higher precedence than(<|>), so some extra parentheses might be needed if it's being used in combination with other operators, as in the examples below.

Fourth, you've probably noticed my use of the applicative operators<* and*> and the alias<$> forfmap in the code above. It isvery common to use these -- plus the additional applicative operator<*> and sometimes the operators<**> or<$ -- in parsers. Once you get used to them, they tend to lead to less cluttered code.

For example, a more standard way of writingexpression would be:

expression' :: ReadP Expressionexpression' =     Infix <$> number <*> operator <*> expression              <|> number

or the slightly more efficient solution:

expression :: ReadP Expressionexpression = do  left <- number  (Infix left <$> operator <*> expression) <++ return left

Note that, in the context of parsers, an expression likef <$> p <*> q means "try to run the parserp, and then the parserq; assuming they both succeed, pass their return values tof". In other words, thatInfix expression is essentially:

Infix left op right

whereop is the return value from the parseroperator andright is the return value from the parserexpression.

Similarly, the standard way of writingoperator is actually:

operator :: ReadP Operatoroperator = Add <$ symbol "+" <|> Subtract <$ symbol "-"

This one requires an additional word of explanation. The operator<$ is kind of an odd duck. It's type signature is:

(<$) :: a -> f b -> f a

but in the context of parsers specifically, the meaning ofx <$ p is "try to run the parserp; if it succeeds, ignore its return value and returnx". Basically, it's used to replace the return value of a parser that's used only for its success or failure and not its return value.

Note that these versions ofexpression, like your original version, treat the operators as right associative. This may be a problem if you're trying to parse "1-2-3" as equivalent to "(1-2)-3" instead of "1-(2-3)".

A few additional minor points:

isDigit is a more readable name for\c -> c >= '0' && c <= '9'
munch1 is more efficient thanmany1 (satisfy xxx), so I'd redefinenumber to use it
for testing, it's probably a good idea to have aparseExpressions function that looks atall the parses
for production, it's probably a good idea to check for ambiguous parses and do something about it, rather than (fairly arbitrarily) selecting the last parse in the list

With all of these suggestions implemented, the final version would look something like:

{-# OPTIONS_GHC -Wall #-}import Data.Char (isDigit)import Control.Applicativeimport Text.ParserCombinators.ReadP (eof, munch1, ReadP, readP_to_S,                                     skipSpaces, string, (<++))data Operator = Add | Subtractdata Expression = Number Int                | Infix { left :: Expression, op :: Operator, right :: Expression }lexeme :: ReadP a -> ReadP alexeme p = p <* skipSpacessymbol :: String -> ReadP Stringsymbol = lexeme . stringnumber :: ReadP Expressionnumber = lexeme $ Number . read <$> munch1 isDigitoperator :: ReadP Operatoroperator = Add <$ symbol "+" <|> Subtract <$ symbol "-"expression :: ReadP Expressionexpression = do  x <- number  (Infix x <$> operator <*> expression) <++ return xtop :: ReadP Expressiontop = skipSpaces *> expression <* eofparseExpressions :: String -> [(Expression, String)]parseExpressions = readP_to_S topparseExpression :: String -> Maybe ExpressionparseExpression input = case parseExpressions input of    [] -> Nothing    [(x,"")] -> Just x    _ -> error "ambiguous parse"

K. A. Buhr 1,0065 silver badges8 bronze badges · Accepted Answer · 2019-04-26 01:06:07Z

First, a standard approach for dealing with theskipSpaces issue is to define a higher-order parser combinator, traditionally calledlexeme:

lexeme :: ReadP a -> ReadP alexeme p = p <* skipSpaces

Here,lexeme takes a space-naive parserp, and converts it into a new parser that parses whateverp was planning to parse, and then reads and discards any trailing spaces. You uselexeme in the definitions of any of your parsers that would reasonably be assumed to read a complete "lexeme" and ignore any trailing space. For example,number should be alexeme parser:

number :: ReadP Expressionnumber = lexeme $ fmap (Number . read) (many1 digit)

So shouldoperator, though obviously notdigit!expression won't need to uselexeme, because we'll arrange to end it with a lexeme parser.

It's also helpful to define asymbol parser which is essentiallystring that ignores trailing spaces:

symbol :: String -> ReadP Stringsymbol = lexeme . string

Consistently used,lexeme (andsymbol) will deal with all unwanted spaces, other than any leading spaces at the very start of your parse. If you have a non-recursive "top level" grammar production, like say a parser forprogram :: ReadP Program, then you would probably deal with them there. In your example, you don't have such a production (e.g.,expression is recursive), so you'd stick an extraskipSpaces inparseExpression. This is also a good place to puteof to make sure there isn't any trailing material that you aren't parsing:

parseExpression :: String -> Maybe ExpressionparseExpression input = case readP_to_S (skipSpaces *> expression <* eof) input of    [] -> Nothing    xs -> (Just . fst . last) xs

Second, your use of aRead instance for parsing your operators is very unusual. It would be more standard to make it a helper in theoperator parser, writing something like:

operator :: ReadP Operatoroperator = readSymbol <$> (symbol "+" <|> symbol "-")  where readSymbol "+" = Add        readSymbol "-" = Subtract

(though an evenmore standard version is given below).

Third, inexpression, you can avoid thecase construct by using alternation(<|>) like so:

expression' :: ReadP Expressionexpression' = do    left <- number    (do op <- operator        right <- expression        return (Infix left op right)     <|> return left)

This would be the standard approach for non-parallel parser libraries (e.g., Parsec or Megaparsec). ForReadP, it's better to replace the(<|>) operator with the ReadP-specific(<++) operator to avoid also following the unwanted second parse in parallel. Beware that(<++) has higher precedence than(<|>), so some extra parentheses might be needed if it's being used in combination with other operators, as in the examples below.

Fourth, you've probably noticed my use of the applicative operators<* and*> and the alias<$> forfmap in the code above. It isvery common to use these -- plus the additional applicative operator<*> and sometimes the operators<**> or<$ -- in parsers. Once you get used to them, they tend to lead to less cluttered code.

For example, a more standard way of writingexpression would be:

expression' :: ReadP Expressionexpression' =     Infix <$> number <*> operator <*> expression              <|> number

or the slightly more efficient solution:

expression :: ReadP Expressionexpression = do  left <- number  (Infix left <$> operator <*> expression) <++ return left

Note that, in the context of parsers, an expression likef <$> p <*> q means "try to run the parserp, and then the parserq; assuming they both succeed, pass their return values tof". In other words, thatInfix expression is essentially:

Infix left op right

whereop is the return value from the parseroperator andright is the return value from the parserexpression.

Similarly, the standard way of writingoperator is actually:

operator :: ReadP Operatoroperator = Add <$ symbol "+" <|> Subtract <$ symbol "-"

This one requires an additional word of explanation. The operator<$ is kind of an odd duck. It's type signature is:

(<$) :: a -> f b -> f a

but in the context of parsers specifically, the meaning ofx <$ p is "try to run the parserp; if it succeeds, ignore its return value and returnx". Basically, it's used to replace the return value of a parser that's used only for its success or failure and not its return value.

Note that these versions ofexpression, like your original version, treat the operators as right associative. This may be a problem if you're trying to parse "1-2-3" as equivalent to "(1-2)-3" instead of "1-(2-3)".

A few additional minor points:

isDigit is a more readable name for\c -> c >= '0' && c <= '9'
munch1 is more efficient thanmany1 (satisfy xxx), so I'd redefinenumber to use it
for testing, it's probably a good idea to have aparseExpressions function that looks atall the parses
for production, it's probably a good idea to check for ambiguous parses and do something about it, rather than (fairly arbitrarily) selecting the last parse in the list

With all of these suggestions implemented, the final version would look something like:

{-# OPTIONS_GHC -Wall #-}import Data.Char (isDigit)import Control.Applicativeimport Text.ParserCombinators.ReadP (eof, munch1, ReadP, readP_to_S,                                     skipSpaces, string, (<++))data Operator = Add | Subtractdata Expression = Number Int                | Infix { left :: Expression, op :: Operator, right :: Expression }lexeme :: ReadP a -> ReadP alexeme p = p <* skipSpacessymbol :: String -> ReadP Stringsymbol = lexeme . stringnumber :: ReadP Expressionnumber = lexeme $ Number . read <$> munch1 isDigitoperator :: ReadP Operatoroperator = Add <$ symbol "+" <|> Subtract <$ symbol "-"expression :: ReadP Expressionexpression = do  x <- number  (Infix x <$> operator <*> expression) <++ return xtop :: ReadP Expressiontop = skipSpaces *> expression <* eofparseExpressions :: String -> [(Expression, String)]parseExpressions = readP_to_S topparseExpression :: String -> Maybe ExpressionparseExpression input = case parseExpressions input of    [] -> Nothing    [(x,"")] -> Just x    _ -> error "ambiguous parse"

Movatterモバイル変換

Stack Exchange Network

Parsing expressions with combinators

1 Answer1

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS