Recursive descent JSON parser in Rust

Question 1

I've written a simple recursive descent JSON parser in Rust. It just reads in a JSON file and prints the syntax tree to the terminal. I'm still learning Rust, and I'd appreciate any review/feedback. Particularly in regard to error handling and idiomatic language, but there could still be bugs lurking around. I've omitted tests for brevity.

lexer.rs

use core::fmt;use anyhow::{bail, Result};#[derive(Debug, PartialEq)]pub enum Token<'a> {    LeftBrace,    RightBrace,    LeftBracket,    RightBracket,    Colon,    Comma,    Str(&'a str),    Number(Numeric),    True,    False,    Null,}impl fmt::Display for Token<'_> {    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {        match self {            Self::LeftBrace => write!(f, "{{"),            Self::RightBrace => write!(f, "}}"),            Self::LeftBracket => write!(f, "["),            Self::RightBracket => write!(f, "]"),            Self::Colon => write!(f, ":"),            Self::Comma => write!(f, ","),            Self::Str(s) => write!(f, "\"{s}\""),            Self::Number(Numeric::Integer(n)) => write!(f, "{n}"),            Self::Number(Numeric::Float(n)) => write!(f, "{n}"),            Self::True => write!(f, "true"),            Self::False => write!(f, "false"),            Self::Null => write!(f, "null"),        }    }}#[derive(Debug, PartialEq, Copy, Clone)]pub enum Numeric {    Float(f64),    Integer(i64),}pub struct Lexer<'a> {    json: &'a str,}impl<'a> Lexer<'a> {    pub fn new(json: &'a str) -> Self {        Self {             json,        }    }    pub fn lex(&mut self) -> Result<Vec<Token>> {        use Token::*;        let mut tokens: Vec<Token> = vec![];        loop {            let mut chars = self.json.chars();            match chars.next() {                Some(c) => {                    match c {                        '{' => {                            tokens.push(LeftBrace);                            self.json = chars.as_str();                        }                        '}' => {                            tokens.push(RightBrace);                            self.json = chars.as_str();                        }                        '[' => {                            tokens.push(LeftBracket);                            self.json = chars.as_str();                        }                        ']' => {                            tokens.push(RightBracket);                            self.json = chars.as_str();                        }                        ':' => {                            tokens.push(Colon);                            self.json = chars.as_str();                        }                        ',' => {                            tokens.push(Comma);                            self.json = chars.as_str();                        }                        '"' => {                            let s = self.lex_string()?;                            tokens.push(Str(s))                        }                        '0'..='9' | '-' => {                            let num = self.lex_number()?;                            tokens.push(Number(num));                        }                        't' if self.json.starts_with("true") => {                            tokens.push(True);                            self.json = &self.json["true".len()..];                        }                        'f' if self.json.starts_with("false") => {                            tokens.push(False);                            self.json = &self.json["false".len()..];                        }                        'n' if self.json.starts_with("null") => {                            tokens.push(Null);                            self.json = &self.json["null".len()..];                        }                        ' ' | '\t' | '\r' | '\n' => {                            self.json = chars.as_str();                        }                        c => bail!("Lexing failure on character {c}"),                    }                }                None => {                    break;                }            }        }        Ok(tokens)    }    fn lex_string(&mut self) -> Result<&'a str> {        let mut charindices = self.json.char_indices();        let _ = charindices.next(); // skip first '"'        let mut escape = false;        let n;        loop {            match charindices.next() {                None => bail!("Unexpected end of input in string literal"),                Some((i, '"')) if !escape => {                    n = i;                    break;                }                Some((_, '\\')) => {                    escape = true;                }                Some(_) => {                    escape = false;                }            }        }        let result = &self.json[1..n];        self.json = charindices.as_str();        Ok(result)    }    fn lex_number(&mut self) -> Result<Numeric> {        let mut charindices = self.json.char_indices().peekable();        let mut n = 0;        let mut floating_point = false;        if matches!(charindices.peek(), Some((_, '-'))) {            charindices.next();        }        if matches!(charindices.peek(), Some((_, '.')) | Some((_, 'e')) | Some((_, 'E'))) {            bail!("Invalid numeric literal");        }        if matches!(charindices.peek(), Some((_, '0'))) {            charindices.next().unwrap(); // safe from if condition            if let Some((_, c)) = charindices.peek() {                match c {                    'e' | 'E' | '.' | ' ' | '\t' | '\r' | '\n' | ',' | ']' | '}' => {}                    _ => bail!("Invalid numeric literal"),                }            }        }         // read integer        loop {            match charindices.peek() {                None => {                    n = self.json.as_bytes().len();                    break;                }                Some((i, c)) => {                    match c {                        '0'..='9' => {                            charindices.next();                        }                        '.' | 'E' | 'e'  => {                            break;                        }                        ' ' | '\t' | '\r' | '\n' | ',' | ']' | '}' => {                            n = *i;                            break;                        }                        _ => bail!("Invalid numeric literal"),                    }                }            }        }        // read fraction        if matches!(charindices.peek(), Some((_, '.'))) {            floating_point = true;            charindices.next();            loop {                match charindices.peek() {                    None => {                        n = self.json.as_bytes().len();                        break;                    }                    Some((i, c)) => match c {                        '0'..='9' => {                            charindices.next();                        }                        'e' | 'E' => break,                        ' ' | '\t' | '\r' | '\n' | ',' | ']' | '}' => {                            n = *i;                            break;                        }                        _ => bail!("Invalid numeric literal"),                    }                }            }        }        // read exponent        if matches!(charindices.peek(), Some((_, 'e')) | Some((_, 'E'))) {            floating_point = true;            charindices.next();            if matches!(charindices.peek(), Some((_, '+')) | Some((_, '-'))) {                charindices.next();            }            loop {                match charindices.peek() {                    None => {                        n = self.json.as_bytes().len();                        break;                    }                    Some((i, c)) => match c {                        '0'..='9' => {                            charindices.next();                        }                        ' ' | '\t' | '\r' | '\n' | ',' | ']' | '}' => {                            n = *i;                            break;                        }                        _ => bail!("Invalid numeric literal"),                    }                }            }        }        let result = if floating_point {            let res = self.json[0..n].parse::<f64>()?;            Ok(Numeric::Float(res))        } else {            let res = self.json[0..n].parse::<i64>()?;            Ok(Numeric::Integer(res))        };        self.json = &self.json[n..];        result    }}

parser.rs

use std::collections::HashMap;use std::iter::Peekable;use std::slice::Iter;use anyhow::{bail, Result};use crate::lexer::Token;#[derive(Debug)]pub enum JsonTree<'a> {    Object(HashMap<&'a str, JsonTree<'a>>),    Array(Vec<JsonTree<'a>>),    Atom(Token<'a>),}pub struct Parser { }impl<'a> Parser {    pub fn new() -> Self {        Self { }    }    pub fn parse(&'a self, vec: &'a [Token]) -> Result<Option<JsonTree>> {        self.value(&mut vec.iter().peekable())    }    fn value(&'a self, iter: &mut Peekable<Iter<Token<'a>>>) -> Result<Option<JsonTree>> {        match iter.peek() {            Some(Token::LeftBrace) => {                match self.object(iter) {                    Ok(object) => Ok(Some(object)),                    Err(e) => Err(e.context("In parsing of object")),                }            }            Some(Token::LeftBracket) => {                match self.array(iter) {                    Ok(array) => Ok(Some(array)),                    Err(e) => Err(e.context("In parsing of array"))                }            }            Some(Token::Str(s)) => {                iter.next();                Ok(Some(JsonTree::Atom(Token::Str(s))))            }            Some(Token::Number(n)) => {                iter.next();                Ok(Some(JsonTree::Atom(Token::Number(*n))))            }            Some(Token::True) => {                iter.next();                Ok(Some(JsonTree::Atom(Token::True)))            }            Some(Token::False) => {                iter.next();                Ok(Some(JsonTree::Atom(Token::False)))            }            Some(Token::Null) => {                iter.next();                Ok(Some(JsonTree::Atom(Token::Null)))            }            None => Ok(None),            Some(t) => bail!("Unexpected token {t:?}"),        }    }    fn object(&'a self, iter: &mut Peekable<Iter<Token<'a>>>) -> Result<JsonTree> {        iter.next(); // consume left brace        let mut elements: HashMap<&str, JsonTree> = HashMap::new();        if matches!(iter.peek(), Some(Token::RightBrace)) {            return Ok(JsonTree::Object(elements));        }        loop {            let name = match iter.next() {                Some(Token::Str(s)) => s,                t => bail!("Invalid token {t:?}"),            };            match iter.next() {                Some(Token::Colon) => {}                Some(t) => bail!("Expected ':' but found {:?}", t),                None => bail!("Unexpected end of input"),            }            let member = match self.value(iter) {                Ok(Some(value)) => value,                Ok(None) => bail!("Unexpected end of input"),                Err(e) => return Err(e.context("In object member definition")),            };            elements.insert(name, member);            if matches!(iter.peek(), Some(Token::Comma)) {                iter.next();            } else {                break;            }        }        let token = iter.next();        if !matches!(token, Some(Token::RightBrace)) {            bail!("Expected }}, got {token:?}");        }        Ok(JsonTree::Object(elements))    }    fn array(&'a self, iter: &mut Peekable<Iter<Token<'a>>>) -> Result<JsonTree> {        iter.next();        let mut elements = Vec::new();        if matches!(iter.peek(), Some(Token::RightBracket)) {            iter.next();            return Ok(JsonTree::Array(elements));        }        loop {            let element = match self.value(iter) {                Ok(Some(value)) => value,                Ok(None) => bail!("Unexpected end of input"),                Err(e) => return Err(e.context("In array definition")),            };            elements.push(element);            if matches!(iter.peek(), Some(Token::Comma)) {                iter.next();            } else {                break;            }        }        let token = iter.next();        if !matches!(token, Some(Token::RightBracket)) {            bail!("Expected ] but got {token:?}");        }        Ok(JsonTree::Array(elements))    }}

main.rs

use std::io::Read;use std::path::PathBuf;use std::process::exit;use std::fs::File;use anyhow::Result;use clap::Parser;use lexer::Lexer;use parser::Parser as MyParser;mod lexer;mod parser;fn main() -> Result<()> {    let args = Args::parse();    let mut file = match File::open(&args.path) {        Ok(f) => f,        Err(e) => {             println!("Encountered error opening file {}: {e}", args.path.to_str().unwrap());            exit(-1);        }    };    let mut contents = String::new();    match file.read_to_string(&mut contents) {        Ok(_) => {}        Err(e) => {            println!("Encountered error reading file {}: {e}", args.path.to_str().unwrap());            exit(-1);        }    }    let mut lexer = Lexer::new(&contents);    let tokens = lexer.lex()?;    let parser = MyParser::new();    match parser.parse(&tokens) {        Ok(Some(tree)) => {            println!("{tree:?}")        }        Ok(None) => {}        Err(e) => {            println!("Error parsing {}: {e}", args.path.to_str().unwrap());            exit(1);        }    }    Ok(())}// Simple JSON parser#[derive(Parser, Debug)]struct Args {    // The input file path    path: PathBuf,}

Question 2

You implementation appears incomplete. E.g. "NaN"cannot be parsed while itshould be valid.

Question 3

Main

Error Handling

You error handling is really not great:

Errors should be printed to stderr, not stdout.
Pattern matching +println! +exit is very verbose.

You're really not using the strengths ofanyhow here!

Instead, consider using:

.with_context to attach context.
? to propagate the error upwards.

As in:

let mut file = File::open(&args.path)    .with_context(|| format!("Failed to open '{}'", args.path.to_str().unwrap()))?;

Much more concise!

Factorization

Considering factorizingargs.path.to_str().unwrap() into a variable, or at least providing a simple getter onArgs that does the job.

Non UTF-8 Paths

The reason thatunwrap is necessary inargs.path.to_str().unwrap() is that not all paths are valid UTF-8 (filesystems don't care). If the unwrap panics, then you'll get the message and backtrace of the unwrap, and lose the error message you were trying to print... which is sad.

Instead, since this is for display purposes, consider using.to_string_lossy().

Lexer

Memory Allocation!

Why lex in aVec?

You should be able to parse JSON with O(1) lookahead, which means you shouldn't need aVec. Instead, yourLexer could implement theIterator trait, and provide token one at a time, until it can't.

(Or an error, of course, which would also be the last thing it produces)

This is especially vexing as the parser immediately turns theVec into an iterator...

Bytes! It's just Bytes!

First of all, it would be cool if the lexer could consume bytes without first validating at the UTF-8 level.

But more importantly, parsing via.chars() costs you quite a bit of performance:

.chars() has to reconstitute achar from a stream of bytes.
You then match on a 32-bitschar instead of an 8-bits byte.

I would encourage you to use.as_bytes() instead, and parse byte by byte. All the delimiters you need to match are single-byte, after all.

And you can still sub-slice on the base&str when producing an&str, to avoid having to re-validate that it's UTF-8 all the way through.

Built-in search!

&str has a built-in search function which is likely better optimized than yourchar bychar loop.

It's not very well-documented, unfortunately, but you can search by an array (or slice) ofchar and get the index of the first matching byte in the string, that is:

let mut rest = &self.json[1..];while let Some(n) = rest.find(['\\', '"']) {    let byte = rest.as_bytes()[n];    assert!(byte == b'\\' || byte == b'"');}

Hopefully, behind the scenes it uses an accelerated vectorized search. Otherwise, consider thememchr crate, and specifically itsmemchr2 function to search for the end of the string.

Redundant number parsing

You arevery carefully evaluating each numberchar bychar, only to then call the standard library parsing routine.

The standard library parsing routinewill parse the number, and validate eachchar in turn, so all your work is quite redundant really.

Instead, you only really need to figure out where the end of the number is. This is the first byte that is neither. nor ASCII digit.

Unfortunatelymemchr cannot help here because there's two many terminating characters (space, tab, carriage return, feed return,},],,) so a simplestr.find by predicate (neither. nor ASCII digit) is the easiest, if not necessarily the fastest.

And once you have the range, it's a simple matter of figuring out whether there's a. in there, whichmemchr can help with, to know whether it's a float or an integral.

Float or Integral

An interesting conundrum with relying on "flags" of floats to determine whether to parse as float (or not) is that a number can be an integral, yet not fit in ani64 (oru128 for that matter).

One technique to deal with such overflows would be to attempt to parse a would-be integer as an integral, but fall back tof64 on overflow.

Parser

Tighten up the model!

ThatAtom(Token<'_>) may feel slick, but it's really unfortunate.

Imagine the poor consumer of this API: they'll need to match onLeftBrace,RightBrace, etc... all the time even though those should never be there!

I would advise just accepting your fate, and either creating a newAtom enum, or just enumerating all cases inJSonTree.

A tight type is a type which can only represent valid values!

Equality is more readable than`matches!`

Consider:

if matches!(iter.peek(), Some(Token::RightBrace)) { ... }if iter.peek() == Some(Token::RightBrace) { ... }if !matches!(iter.peek(), Some(Token::RightBrace)) { ... }if iter.peek() != Some(Token::RightBrace) { ... }

The only think you need to be able to write the equality, is to derivePartialEq onToken, and you may as well deriveEq while you are at it. It's free.

Parsing Stack Overflow

It's not clear whether you'll care, but it's worth touching upon in general when talking about parsers.

In the case of a very deeply nested JSON, unbounded recursion in the code may lead to a Stack Overflow, terminating the thread. Youcould argue it's a fine, and that the user can just use a thread with a sufficiently large stack.

Or you could change how the parser works to use amanual stack instead. At the very least, it's a fairly interesting exercise.

Model Stack Overflow

And of course, if you have a very deeply nested JSON, you end up with a very deeply nestedJsonTree, whoseDrop implementation may overflow.

To avoid the issue, you would need to write a customDrop implementation forJsonTree, whichdoesn't recurse.

Hint 1:

The simplest implementation is to just use a stack ofJsonTree.

Hint 2:

Create aVec, push all the elements/fields (if any) ofself into it. Then pop from theVec, again push all the elements/fields (if any) into it. Go back to popping.

Matthieu M. 6,3991 gold badge21 silver badges34 bronze badges · Accepted Answer · 2024-12-06 10:03:47Z

Main

Error Handling

You error handling is really not great:

Errors should be printed to stderr, not stdout.
Pattern matching +println! +exit is very verbose.

You're really not using the strengths ofanyhow here!

Instead, consider using:

.with_context to attach context.
? to propagate the error upwards.

As in:

let mut file = File::open(&args.path)    .with_context(|| format!("Failed to open '{}'", args.path.to_str().unwrap()))?;

Much more concise!

Factorization

Considering factorizingargs.path.to_str().unwrap() into a variable, or at least providing a simple getter onArgs that does the job.

Non UTF-8 Paths

The reason thatunwrap is necessary inargs.path.to_str().unwrap() is that not all paths are valid UTF-8 (filesystems don't care). If the unwrap panics, then you'll get the message and backtrace of the unwrap, and lose the error message you were trying to print... which is sad.

Instead, since this is for display purposes, consider using.to_string_lossy().

Lexer

Memory Allocation!

Why lex in aVec?

You should be able to parse JSON with O(1) lookahead, which means you shouldn't need aVec. Instead, yourLexer could implement theIterator trait, and provide token one at a time, until it can't.

(Or an error, of course, which would also be the last thing it produces)

This is especially vexing as the parser immediately turns theVec into an iterator...

Bytes! It's just Bytes!

First of all, it would be cool if the lexer could consume bytes without first validating at the UTF-8 level.

But more importantly, parsing via.chars() costs you quite a bit of performance:

.chars() has to reconstitute achar from a stream of bytes.
You then match on a 32-bitschar instead of an 8-bits byte.

I would encourage you to use.as_bytes() instead, and parse byte by byte. All the delimiters you need to match are single-byte, after all.

And you can still sub-slice on the base&str when producing an&str, to avoid having to re-validate that it's UTF-8 all the way through.

Built-in search!

&str has a built-in search function which is likely better optimized than yourchar bychar loop.

It's not very well-documented, unfortunately, but you can search by an array (or slice) ofchar and get the index of the first matching byte in the string, that is:

let mut rest = &self.json[1..];while let Some(n) = rest.find(['\\', '"']) {    let byte = rest.as_bytes()[n];    assert!(byte == b'\\' || byte == b'"');}

Hopefully, behind the scenes it uses an accelerated vectorized search. Otherwise, consider thememchr crate, and specifically itsmemchr2 function to search for the end of the string.

Redundant number parsing

You arevery carefully evaluating each numberchar bychar, only to then call the standard library parsing routine.

The standard library parsing routinewill parse the number, and validate eachchar in turn, so all your work is quite redundant really.

Instead, you only really need to figure out where the end of the number is. This is the first byte that is neither. nor ASCII digit.

Unfortunatelymemchr cannot help here because there's two many terminating characters (space, tab, carriage return, feed return,},],,) so a simplestr.find by predicate (neither. nor ASCII digit) is the easiest, if not necessarily the fastest.

And once you have the range, it's a simple matter of figuring out whether there's a. in there, whichmemchr can help with, to know whether it's a float or an integral.

Float or Integral

An interesting conundrum with relying on "flags" of floats to determine whether to parse as float (or not) is that a number can be an integral, yet not fit in ani64 (oru128 for that matter).

One technique to deal with such overflows would be to attempt to parse a would-be integer as an integral, but fall back tof64 on overflow.

Parser

Tighten up the model!

ThatAtom(Token<'_>) may feel slick, but it's really unfortunate.

Imagine the poor consumer of this API: they'll need to match onLeftBrace,RightBrace, etc... all the time even though those should never be there!

I would advise just accepting your fate, and either creating a newAtom enum, or just enumerating all cases inJSonTree.

A tight type is a type which can only represent valid values!

Equality is more readable than`matches!`

Consider:

if matches!(iter.peek(), Some(Token::RightBrace)) { ... }if iter.peek() == Some(Token::RightBrace) { ... }if !matches!(iter.peek(), Some(Token::RightBrace)) { ... }if iter.peek() != Some(Token::RightBrace) { ... }

The only think you need to be able to write the equality, is to derivePartialEq onToken, and you may as well deriveEq while you are at it. It's free.

Parsing Stack Overflow

It's not clear whether you'll care, but it's worth touching upon in general when talking about parsers.

In the case of a very deeply nested JSON, unbounded recursion in the code may lead to a Stack Overflow, terminating the thread. Youcould argue it's a fine, and that the user can just use a thread with a sufficiently large stack.

Or you could change how the parser works to use amanual stack instead. At the very least, it's a fairly interesting exercise.

Model Stack Overflow

And of course, if you have a very deeply nested JSON, you end up with a very deeply nestedJsonTree, whoseDrop implementation may overflow.

To avoid the issue, you would need to write a customDrop implementation forJsonTree, whichdoesn't recurse.

Hint 1:

The simplest implementation is to just use a stack ofJsonTree.

Hint 2:

Create aVec, push all the elements/fields (if any) ofself into it. Then pop from theVec, again push all the elements/fields (if any) into it. Go back to popping.

Movatterモバイル変換

Stack Exchange Network

Recursive descent JSON parser in Rust

lexer.rs

parser.rs

main.rs

1 Answer1

Main

Error Handling

Factorization

Non UTF-8 Paths

Lexer

Memory Allocation!

Bytes! It's just Bytes!

Built-in search!

Redundant number parsing

Float or Integral

Parser

Tighten up the model!

Equality is more readable than`matches!`

Parsing Stack Overflow

Model Stack Overflow

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS

Movatterモバイル変換

Recursive descent JSON parser in Rust

lexer.rs

parser.rs

main.rs

1 Answer1

Main

Error Handling

Factorization

Non UTF-8 Paths

Lexer

Memory Allocation!

Bytes! It's just Bytes!

Built-in search!

Redundant number parsing

Float or Integral

Parser

Tighten up the model!

Equality is more readable thanmatches!

Parsing Stack Overflow

Model Stack Overflow

You mustlog in to answer this question.

Related

Hot Network Questions

Equality is more readable than`matches!`