Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

fix JS regex literal parsing in char classes#7790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
mediremi merged 4 commits intomasterfromfix-regex-char-classes
Aug 28, 2025

Conversation

@zth
Copy link
Member

@zthzth commentedAug 22, 2025

Found a case where a valid JS regex literal didn't parse. Led me and GPT5 down a rabbit hole, and this is the result.

All of the added tests toregex.res failed before these changes.

@zth
Copy link
MemberAuthor

zth commentedAug 22, 2025
edited
Loading

@cristianoc your eyes would be good here.
@glennsl if you're around, seeing as you did the original implementation, care to take a look?

Copy link
Contributor

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

This PR fixes JavaScript regex literal parsing when character classes contain special characters that were previously incorrectly interpreted as regex delimiters. The fix ensures proper handling of edge cases like leading] characters and/ characters within character classes.

  • Implements lookahead logic to validate character class closers before entering class parsing mode
  • Adds proper handling of beginning-of-class semantics for literal] and^ characters
  • Updates regex scanner to track character class state and apply correct escape rules

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

FileDescription
compiler/syntax/src/res_scanner.mlImplements new character class parsing logic with lookahead validation and proper state tracking
tests/syntax_tests/data/parsing/grammar/expressions/regex.resAdds test cases for edge cases involving character classes with special characters
tests/syntax_tests/data/parsing/grammar/expressions/expected/regex.res.txtUpdates expected test output to match new parsing behavior

Tip: Customize your code reviews with copilot-instructions.md.Create the file orlearn how to get started.

letrecloop()=
if!i>= lenthenfalse
else
matchString.unsafe_get src!iwith

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

UsingString.unsafe_get can lead to bounds checking violations. Consider usingString.get instead, or ensure proper bounds checking is performed before this call.

Suggested change
matchString.unsafe_get src!iwith
matchString.get src!iwith

Copilot uses AI. Check for mistakes.
else
matchString.unsafe_get src!iwith
|'\n'|'\r' ->false
|'\\' ->

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The escape handling logic incrementsi by 2 without checking ifi + 1 is within bounds after the increment. Consider adding bounds checking for the second character of the escape sequence.

Copilot uses AI. Check for mistakes.
loop()
in
loop()
in

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Starting the scan atfrom_offset + 1 assumes the character atfrom_offset is '[', but this is not validated. Iffrom_offset doesn't point to '[', the parsing logic may be incorrect.

Suggested change
in
(* Validate that from_offset points to '['*)
if from_offset<0|| from_offset>= len||String.unsafe_get src from_offset<>'['then
false
else
let i=ref (from_offset+1)in
(* start scanning after current '['*)
let bos=reftruein
letrecloop()=
if!i>= lenthenfalse
else
matchString.unsafe_get src!iwith
|'\n'|'\r' ->false
|'\\' ->
if!i+1< lenthen (
i:=!i+2;
loop())
elsefalse
|'^'when!bos ->
incr i;
loop()
|']'when!bos ->
(* Leading ']' is literal content; after that, we're no longer at BOS.*)
bos:=false;
incr i;
loop()
|']' ->true
|_ ->
bos:=false;
incr i;
loop()
in
loop()

Copilot uses AI. Check for mistakes.
@pkg-pr-new
Copy link

pkg-pr-newbot commentedAug 22, 2025
edited
Loading

Open in StackBlitz

rescript

npm i https://pkg.pr.new/rescript-lang/rescript@7790

@rescript/darwin-arm64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/darwin-arm64@7790

@rescript/darwin-x64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/darwin-x64@7790

@rescript/linux-arm64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/linux-arm64@7790

@rescript/linux-x64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/linux-x64@7790

@rescript/runtime

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/runtime@7790

@rescript/win32-x64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/win32-x64@7790

commit:16412d8

Copy link
Collaborator

@cristianoccristianoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not familiar with the details of regexps.
At high level the code looks of legit shape, but don't know how to check details. If there are enough tests it should be OK.

letre=/[]]/
letre=/[\]]/
letre=/[[]]/
letre=/[^]/]/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would it be possible for the compiler to reject/[]/]/ and/[^]/]/? They're not valid regexes in JS and so would cause a runtime error

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ideally yeah, but I'm not sure how we could do that in a good way. Any ideas?

Copy link
Member

@mediremimediremiAug 23, 2025
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Claude generated the following (diff is againstmaster, not this branch):

diff --git i/compiler/syntax/src/res_scanner.ml w/compiler/syntax/src/res_scanner.mlindex c404d36cc..3415fbd02 100644--- i/compiler/syntax/src/res_scanner.ml+++ w/compiler/syntax/src/res_scanner.ml@@ -580,9 +580,9 @@ let scan_regex scanner =       bring_buf_up_to_date ~start_offset:last_char_offset;       Buffer.contents buf)   in-  let rec scan () =+  let rec scan ?(in_char_class = false) () =     match scanner.ch with-    | '/' ->+    | '/' when not in_char_class ->       let last_char_offset = scanner.offset in       next scanner;       let pattern = result ~first_char_offset ~last_char_offset in@@ -606,10 +606,16 @@ let scan_regex scanner =     | '\\' ->       next scanner;       next scanner;-      scan ()+      scan ~in_char_class ()+    | '[' when not in_char_class ->+      next scanner;+      scan ~in_char_class:true ()+    | ']' when in_char_class ->+      next scanner;+      scan ~in_char_class:false ()     | _ ->       next scanner;-      scan ()+      scan ~in_char_class ()   in   let pattern, flags = scan () in   let end_pos = position scanner in

Which manages to compile these regexes:

letre=/\.[^/.]+$/letre=/[^]]/letre=/[/]/letre=/[]]/letre=/[\]]/letre=/[[]]/

And rejects the following:

letre=/[]/]/
letre=/[^]/]/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What's a definition of what are valid regexes?
Eg wondering about 2 vs 3 "/" and how to know one regexp is finished.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

In a character class you can 0 to infinity (unescaped)/:

letre=/[/]/letre=/[//]/letre=/[///]/

But outside of character classes,/ has to be escaped

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So we need to keep track of the[], and that's it essentially?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Seems like it

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@mediremi would you be up for taking a stab at fixing what's left of this? Making sure the cases you mention are rejected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

No problem - I'll finish off this PR later today 👍

zth reacted with rocket emoji
Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks! Feels like it needs a fresh pair of eyes.

@cknitt
Copy link
Member

Just ran into the same issue when trying to upgrade one of our projects to v12 and can confirm that this PR fixes it.

Would be great to have the fix in this week's beta release.

zth reacted with thumbs up emoji

@zthzthforce-pushed thefix-regex-char-classes branch frome0d818b tof2d3e4cCompareAugust 25, 2025 16:42
letre=/a[-]?c/
letre=/(abc)\\1/
letre=/([a-c]*)\\1/
letre=/(?i)abc/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

JavaScript doesn't support inline flag syntax ((?i))

> let re = /(?i)abc/let re = /(?i)abc/         ^^^^^^^^^Uncaught SyntaxError: Invalid regular expression: /(?i)abc/: Invalid group

// NOTE: not an error under PCRE/PRE:
//let re = /a[b-]/
letre=/a[]b/
letre=/a[/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The removed lines are invalid regexes in JavaScript. Some of them now fail (e.g./a[/) thanks to the now correct parsing of character classes.

@@ -0,0 +1 @@
letre=/[]/]/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Added an error test for an invalid use of character classes

@mediremimediremi marked this pull request as ready for reviewAugust 27, 2025 22:48
Copy link
MemberAuthor

@zthzth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Awesome!@mediremi ready to merge?

mediremi reacted with thumbs up emoji
@mediremimediremi merged commitea569cd intomasterAug 28, 2025
25 checks passed
@mediremimediremi deleted the fix-regex-char-classes branchAugust 28, 2025 08:32
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@mediremimediremimediremi left review comments

Copilot code reviewCopilotCopilot left review comments

@cristianoccristianoccristianoc approved these changes

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

5 participants

@zth@cknitt@mediremi@cristianoc

[8]ページ先頭

©2009-2025 Movatter.jp