- Notifications
You must be signed in to change notification settings - Fork471
fix JS regex literal parsing in char classes#7790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
zth commentedAug 22, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
@cristianoc your eyes would be good here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Pull Request Overview
This PR fixes JavaScript regex literal parsing when character classes contain special characters that were previously incorrectly interpreted as regex delimiters. The fix ensures proper handling of edge cases like leading] characters and/ characters within character classes.
- Implements lookahead logic to validate character class closers before entering class parsing mode
- Adds proper handling of beginning-of-class semantics for literal
]and^characters - Updates regex scanner to track character class state and apply correct escape rules
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
compiler/syntax/src/res_scanner.ml | Implements new character class parsing logic with lookahead validation and proper state tracking |
tests/syntax_tests/data/parsing/grammar/expressions/regex.res | Adds test cases for edge cases involving character classes with special characters |
tests/syntax_tests/data/parsing/grammar/expressions/expected/regex.res.txt | Updates expected test output to match new parsing behavior |
Tip: Customize your code reviews with copilot-instructions.md.Create the file orlearn how to get started.
compiler/syntax/src/res_scanner.ml Outdated
| letrecloop()= | ||
| if!i>= lenthenfalse | ||
| else | ||
| matchString.unsafe_get src!iwith |
CopilotAIAug 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
UsingString.unsafe_get can lead to bounds checking violations. Consider usingString.get instead, or ensure proper bounds checking is performed before this call.
| matchString.unsafe_get src!iwith | |
| matchString.get src!iwith |
compiler/syntax/src/res_scanner.ml Outdated
| else | ||
| matchString.unsafe_get src!iwith | ||
| |'\n'|'\r' ->false | ||
| |'\\' -> |
CopilotAIAug 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The escape handling logic incrementsi by 2 without checking ifi + 1 is within bounds after the increment. Consider adding bounds checking for the second character of the escape sequence.
compiler/syntax/src/res_scanner.ml Outdated
| loop() | ||
| in | ||
| loop() | ||
| in |
CopilotAIAug 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Starting the scan atfrom_offset + 1 assumes the character atfrom_offset is '[', but this is not validated. Iffrom_offset doesn't point to '[', the parsing logic may be incorrect.
| in | |
| (* Validate that from_offset points to '['*) | |
| if from_offset<0|| from_offset>= len||String.unsafe_get src from_offset<>'['then | |
| false | |
| else | |
| let i=ref (from_offset+1)in | |
| (* start scanning after current '['*) | |
| let bos=reftruein | |
| letrecloop()= | |
| if!i>= lenthenfalse | |
| else | |
| matchString.unsafe_get src!iwith | |
| |'\n'|'\r' ->false | |
| |'\\' -> | |
| if!i+1< lenthen ( | |
| i:=!i+2; | |
| loop()) | |
| elsefalse | |
| |'^'when!bos -> | |
| incr i; | |
| loop() | |
| |']'when!bos -> | |
| (* Leading ']' is literal content; after that, we're no longer at BOS.*) | |
| bos:=false; | |
| incr i; | |
| loop() | |
| |']' ->true | |
| |_ -> | |
| bos:=false; | |
| incr i; | |
| loop() | |
| in | |
| loop() |
pkg-pr-newbot commentedAug 22, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
rescript@rescript/darwin-arm64@rescript/darwin-x64@rescript/linux-arm64@rescript/linux-x64@rescript/runtime@rescript/win32-x64commit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Not familiar with the details of regexps.
At high level the code looks of legit shape, but don't know how to check details. If there are enough tests it should be OK.
| letre=/[]]/ | ||
| letre=/[\]]/ | ||
| letre=/[[]]/ | ||
| letre=/[^]/]/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Would it be possible for the compiler to reject/[]/]/ and/[^]/]/? They're not valid regexes in JS and so would cause a runtime error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Ideally yeah, but I'm not sure how we could do that in a good way. Any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Claude generated the following (diff is againstmaster, not this branch):
diff --git i/compiler/syntax/src/res_scanner.ml w/compiler/syntax/src/res_scanner.mlindex c404d36cc..3415fbd02 100644--- i/compiler/syntax/src/res_scanner.ml+++ w/compiler/syntax/src/res_scanner.ml@@ -580,9 +580,9 @@ let scan_regex scanner = bring_buf_up_to_date ~start_offset:last_char_offset; Buffer.contents buf) in- let rec scan () =+ let rec scan ?(in_char_class = false) () = match scanner.ch with- | '/' ->+ | '/' when not in_char_class -> let last_char_offset = scanner.offset in next scanner; let pattern = result ~first_char_offset ~last_char_offset in@@ -606,10 +606,16 @@ let scan_regex scanner = | '\\' -> next scanner; next scanner;- scan ()+ scan ~in_char_class ()+ | '[' when not in_char_class ->+ next scanner;+ scan ~in_char_class:true ()+ | ']' when in_char_class ->+ next scanner;+ scan ~in_char_class:false () | _ -> next scanner;- scan ()+ scan ~in_char_class () in let pattern, flags = scan () in let end_pos = position scanner in
Which manages to compile these regexes:
letre=/\.[^/.]+$/letre=/[^]]/letre=/[/]/letre=/[]]/letre=/[\]]/letre=/[[]]/
And rejects the following:
letre=/[]/]/
letre=/[^]/]/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
What's a definition of what are valid regexes?
Eg wondering about 2 vs 3 "/" and how to know one regexp is finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
In a character class you can 0 to infinity (unescaped)/:
letre=/[/]/letre=/[//]/letre=/[///]/
But outside of character classes,/ has to be escaped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
So we need to keep track of the[], and that's it essentially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Seems like it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@mediremi would you be up for taking a stab at fixing what's left of this? Making sure the cases you mention are rejected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
No problem - I'll finish off this PR later today 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thanks! Feels like it needs a fresh pair of eyes.
Just ran into the same issue when trying to upgrade one of our projects to v12 and can confirm that this PR fixes it. Would be great to have the fix in this week's beta release. |
| letre=/a[-]?c/ | ||
| letre=/(abc)\\1/ | ||
| letre=/([a-c]*)\\1/ | ||
| letre=/(?i)abc/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
JavaScript doesn't support inline flag syntax ((?i))
> let re = /(?i)abc/let re = /(?i)abc/ ^^^^^^^^^Uncaught SyntaxError: Invalid regular expression: /(?i)abc/: Invalid group| // NOTE: not an error under PCRE/PRE: | ||
| //let re = /a[b-]/ | ||
| letre=/a[]b/ | ||
| letre=/a[/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
The removed lines are invalid regexes in JavaScript. Some of them now fail (e.g./a[/) thanks to the now correct parsing of character classes.
| @@ -0,0 +1 @@ | |||
| letre=/[]/]/ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Added an error test for an invalid use of character classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Awesome!@mediremi ready to merge?
ea569cd intomasterUh oh!
There was an error while loading.Please reload this page.
Found a case where a valid JS regex literal didn't parse. Led me and GPT5 down a rabbit hole, and this is the result.
All of the added tests to
regex.resfailed before these changes.