S-expression parser in Rust

Question 1

I wrote a toy s-expression parser, and I'd like to know if I can make it more Rusty.I'm not terribly worried about the functionality. It's only a Rust exercise for me.

enum Token {    Number(u64),    Identifier(String),    ParOpen,    ParClose,}fn tokenize (source: &str) -> Result<Vec<Token>, String> {    let mut tokens: Vec<Token> = Vec::new();    let regex = Regex::new(r"[0-9][-+*?!_a-zA-Z0-9]*|[-+*?!_a-zA-Z][-+*?!_a-zA-Z0-9]*|\(|\)").unwrap();    for candidate_match in regex.find_iter(source) {        let candidate_string = candidate_match.as_str();        match candidate_string.chars().next().unwrap() {            '-' | '+' | '*' | '?' | '!' | '_' | 'a'..='z' | 'A'..='Z' => tokens.push(Token::Identifier(candidate_string.to_owned())),            '0'..='9' => {                match candidate_string.parse::<u64>() {                    Ok(n) => tokens.push(Token::Number(n)),                    Err(_) => return Err(format!("Invalid number: {}", candidate_string)),                }            },            '(' => tokens.push(Token::ParOpen),            ')' => tokens.push(Token::ParClose),            _ => (),        }    }    Ok(tokens)}enum Node {    Number(u64),    Identifier(String),    List(Vec<Node>),}fn parse (tokens: &[Token]) -> Result<Node, String> {    let mut index: usize = 0;    let mut stack: Vec<Vec<Node>> = vec![Vec::new()];    loop {        match &tokens[index] {            Token::Number(value) => {                stack.last_mut().unwrap().push(Node::Number(value.clone()));            },            Token::Identifier(name) => {                stack.last_mut().unwrap().push(Node::Identifier(name.clone()));            }            Token::ParOpen => {                stack.push(Vec::new());            },            Token::ParClose => {                if stack.len() <= 1 {                    return Err("Unexpected ')'".to_owned());                }                let list = Node::List(stack.pop().unwrap());                stack.last_mut().unwrap().push(list);            },        }        if index >= tokens.len() - 1 {            if stack.len() > 1 {                return Err("Expected ')'".to_owned());            }            return Ok(Node::List(stack.pop().unwrap()));        }        index += 1;    }}fn stringify (node: &Node) -> String {    fn stringify (node: &Node) -> String {        match node {            Node::Number(value) => value.to_string(),            Node::Identifier(name) => name.clone(),            Node::List(entries) => format!("({})", entries.iter().map(stringify).collect::<Vec<String>>().join(" ")),        }    }    match node {        Node::Number(value) => value.to_string(),        Node::Identifier(name) => name.clone(),        Node::List(entries) => entries.iter().map(stringify).collect::<Vec<String>>().join(" "),    }}

https://godbolt.org/z/fP6fd41Tr

I am not concerned whether this is a good s-expression parser or not. This is just Rust exercise for me, not a parsing one. I am interested in Rust-related feedback.

I'm a bit new to the language, and my code looks verbose to me. Perhaps I can simplify this? Any of it? Am I copying data around without good reason? Is there a nice way to compile that regex once and store it somewhere?

Question 2

Your questions

Is there a nice way to compile that regex once and store it somewhere?

Yep,OnceLock:

use std::sync::OnceLock;static REGEX: OnceLock<Regex> = OnceLock::new();fn tokenize (source: &str) -> Result<Vec<Token>, String> {    let mut tokens: Vec<Token> = Vec::new();    let regex = REGEX.get_or_init(        || Regex::new(r"[0-9][-+*?!_a-zA-Z0-9]*|[-+*?!_a-zA-Z][-+*?!_a-zA-Z0-9]*|\(|\)").unwrap()    );    // ...    Ok(vec![])}

Parsing approach

You chose regex as a way to find the relevant symbols. It may fit your design well, we don't know the context, but it's quite unusual for a parser to simply discard anything it does not understand.

For example, you will parsea & b as two identifiersa andb, ignoring the ampersand altogether. Again, maybe that's what you really want, but it would be confusing for me.

The more standard approach to manual (without defining a grammar) parsing would be to read characters one by one, rejecting invalid ones immediately and classifying the rest as expected. You'll have to record current state to know, for example, whether a symbola continues a previous identifier or starts a new one, but that can be implemented wit a simple enum, quite doable for very simple grammars.

Strings as errors

Your parser returns errors as human-readable strings. It would make interfacing with your parser extremely challenging for other programs. Consider defining aParseError enum instead and implementingDisplay for it for human-readable output. Or just usethiserror to to the same in declarative fashion, reducing boilerplate to the minimum.

Error context

It would be nice to say "Unexpected ')' at char 12" instead of just "Unexpected ')'" - precise location can help your users find the unbalanced parenthesis much easier.

Run some checks

Please consider writing some basic tests for your parser. Rust ships with a "testing framework" out of the box, so you can just do

#[test]fn test_something() {    // ...}

and thencargo test.

Use your tools

Cargo ships with two amazing tools:rustfmt andclippy. Your code is obviously manually formatted, and this will be a problem when many developers work on a codebase. Aim for a standard style that can be easily enforced, this will reduce your merge conflicts and make it easy for everyone to follow the code.

clippy is a linter - built-in yet powerful. Running it on your Godbolt example (you can even do that online in the official playground at play.rust-lang.org), I get

warning: using `clone` on type `u64` which implements the `Copy` trait  --> src/main.rs:51:61   |51 |                 stack.last_mut().unwrap().push(Node::Number(value.clone()));   |                                                             ^^^^^^^^^^^^^ help: try dereferencing it: `*value`   |   = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#clone_on_copy   = note: `#[warn(clippy::clone_on_copy)]` on by defaultwarning: this expression creates a reference which is immediately dereferenced by the compiler   --> src/main.rs:116:20    |116 |     match tokenize(&s1) {    |                    ^^^ help: change this to: `s1`    |    = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#needless_borrow    = note: `#[warn(clippy::needless_borrow)]` on by default

consider addressing both warnings (using*value and removing the&, resp.). You can also make clippymuch stricter with smth like#![warn(clippy::all, clippy::pedantic)], that will report a few more style-ish issues:

warning: consider adding a `;` to the last statement for consistent formatting  --> src/main.rs:25:17   |25 |                 tokens.push(Token::Identifier(candidate_string.to_owned()))   |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: add a `;` here: `tokens.push(Token::Identifier(candidate_string.to_owned()));`   |   = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#semicolon_if_nothing_returnednote: the lint level is defined here  --> src/main.rs:1:22   | 1 | #![warn(clippy::all, clippy::pedantic)]   |                      ^^^^^^^^^^^^^^^^   = note: `#[warn(clippy::semicolon_if_nothing_returned)]` implied by `#[warn(clippy::pedantic)]`warning: variables can be used directly in the `format!` string  --> src/main.rs:29:38   |29 |                 Err(_) => return Err(format!("Invalid number: {}", candidate_string)),   |                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   |   = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#uninlined_format_args   = note: `#[warn(clippy::uninlined_format_args)]` implied by `#[warn(clippy::pedantic)]`help: change this to   |29 -                 Err(_) => return Err(format!("Invalid number: {}", candidate_string)),29 +                 Err(_) => return Err(format!("Invalid number: {candidate_string}")),   |warning: variables can be used directly in the `format!` string   --> src/main.rs:123:28    |123 |             Err(reason) => println!("Error: {}", reason),    |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    |    = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#uninlined_format_argshelp: change this to    |123 -             Err(reason) => println!("Error: {}", reason),123 +             Err(reason) => println!("Error: {reason}"),    |warning: variables can be used directly in the `format!` string   --> src/main.rs:125:24    |125 |         Err(reason) => println!("Error: {}", reason),    |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    |    = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#uninlined_format_argshelp: change this to    |125 -         Err(reason) => println!("Error: {}", reason),125 +         Err(reason) => println!("Error: {reason}"),    |```

Question 3

+1 for OnceLock thanks! I'm not trying to make a good s-expression parser, just good <anything> Rust code.

STerliakov 2,1423 silver badges13 bronze badges · Answer 1 · 2025-11-03 15:40:41Z

Your questions

Is there a nice way to compile that regex once and store it somewhere?

Yep,OnceLock:

use std::sync::OnceLock;static REGEX: OnceLock<Regex> = OnceLock::new();fn tokenize (source: &str) -> Result<Vec<Token>, String> {    let mut tokens: Vec<Token> = Vec::new();    let regex = REGEX.get_or_init(        || Regex::new(r"[0-9][-+*?!_a-zA-Z0-9]*|[-+*?!_a-zA-Z][-+*?!_a-zA-Z0-9]*|\(|\)").unwrap()    );    // ...    Ok(vec![])}

Parsing approach

You chose regex as a way to find the relevant symbols. It may fit your design well, we don't know the context, but it's quite unusual for a parser to simply discard anything it does not understand.

For example, you will parsea & b as two identifiersa andb, ignoring the ampersand altogether. Again, maybe that's what you really want, but it would be confusing for me.

The more standard approach to manual (without defining a grammar) parsing would be to read characters one by one, rejecting invalid ones immediately and classifying the rest as expected. You'll have to record current state to know, for example, whether a symbola continues a previous identifier or starts a new one, but that can be implemented wit a simple enum, quite doable for very simple grammars.

Strings as errors

Your parser returns errors as human-readable strings. It would make interfacing with your parser extremely challenging for other programs. Consider defining aParseError enum instead and implementingDisplay for it for human-readable output. Or just usethiserror to to the same in declarative fashion, reducing boilerplate to the minimum.

Error context

It would be nice to say "Unexpected ')' at char 12" instead of just "Unexpected ')'" - precise location can help your users find the unbalanced parenthesis much easier.

Run some checks

Please consider writing some basic tests for your parser. Rust ships with a "testing framework" out of the box, so you can just do

#[test]fn test_something() {    // ...}

and thencargo test.

Use your tools

Cargo ships with two amazing tools:rustfmt andclippy. Your code is obviously manually formatted, and this will be a problem when many developers work on a codebase. Aim for a standard style that can be easily enforced, this will reduce your merge conflicts and make it easy for everyone to follow the code.

clippy is a linter - built-in yet powerful. Running it on your Godbolt example (you can even do that online in the official playground at play.rust-lang.org), I get

warning: using `clone` on type `u64` which implements the `Copy` trait  --> src/main.rs:51:61   |51 |                 stack.last_mut().unwrap().push(Node::Number(value.clone()));   |                                                             ^^^^^^^^^^^^^ help: try dereferencing it: `*value`   |   = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#clone_on_copy   = note: `#[warn(clippy::clone_on_copy)]` on by defaultwarning: this expression creates a reference which is immediately dereferenced by the compiler   --> src/main.rs:116:20    |116 |     match tokenize(&s1) {    |                    ^^^ help: change this to: `s1`    |    = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#needless_borrow    = note: `#[warn(clippy::needless_borrow)]` on by default

consider addressing both warnings (using*value and removing the&, resp.). You can also make clippymuch stricter with smth like#![warn(clippy::all, clippy::pedantic)], that will report a few more style-ish issues:

warning: consider adding a `;` to the last statement for consistent formatting  --> src/main.rs:25:17   |25 |                 tokens.push(Token::Identifier(candidate_string.to_owned()))   |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: add a `;` here: `tokens.push(Token::Identifier(candidate_string.to_owned()));`   |   = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#semicolon_if_nothing_returnednote: the lint level is defined here  --> src/main.rs:1:22   | 1 | #![warn(clippy::all, clippy::pedantic)]   |                      ^^^^^^^^^^^^^^^^   = note: `#[warn(clippy::semicolon_if_nothing_returned)]` implied by `#[warn(clippy::pedantic)]`warning: variables can be used directly in the `format!` string  --> src/main.rs:29:38   |29 |                 Err(_) => return Err(format!("Invalid number: {}", candidate_string)),   |                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   |   = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#uninlined_format_args   = note: `#[warn(clippy::uninlined_format_args)]` implied by `#[warn(clippy::pedantic)]`help: change this to   |29 -                 Err(_) => return Err(format!("Invalid number: {}", candidate_string)),29 +                 Err(_) => return Err(format!("Invalid number: {candidate_string}")),   |warning: variables can be used directly in the `format!` string   --> src/main.rs:123:28    |123 |             Err(reason) => println!("Error: {}", reason),    |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    |    = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#uninlined_format_argshelp: change this to    |123 -             Err(reason) => println!("Error: {}", reason),123 +             Err(reason) => println!("Error: {reason}"),    |warning: variables can be used directly in the `format!` string   --> src/main.rs:125:24    |125 |         Err(reason) => println!("Error: {}", reason),    |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    |    = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.91.0/index.html#uninlined_format_argshelp: change this to    |125 -         Err(reason) => println!("Error: {}", reason),125 +         Err(reason) => println!("Error: {reason}"),    |```

+1 for OnceLock thanks! I'm not trying to make a good s-expression parser, just good <anything> Rust code.

Movatterモバイル変換

Stack Exchange Network

S-expression parser in Rust

1 Answer1

Your questions

Parsing approach

Strings as errors

Error context

Run some checks

Use your tools

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS