NotificationsYou must be signed in to change notification settings
Fork471
Star7.2k

fix JS regex literal parsing in char classes#7790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

mediremi merged 4 commits intomasterfromfix-regex-char-classes

Aug 28, 2025

Merged

fix JS regex literal parsing in char classes#7790

mediremi merged 4 commits intomasterfromfix-regex-char-classes

Aug 28, 2025

Conversation

Copy link

Member

zth commentedAug 22, 2025

Found a case where a valid JS regex literal didn't parse. Led me and GPT5 down a rabbit hole, and this is the result.

All of the added tests toregex.res failed before these changes.

zth requested review fromCopilot andcristianoc

August 22, 2025 20:05

Copy link

MemberAuthor

zth commentedAug 22, 2025•
edited
Loading

@cristianoc your eyes would be good here.
@glennsl if you're around, seeing as you did the original implementation, care to take a look?

CopilotAI reviewed

Aug 22, 2025

View reviewed changes

Copy link

Contributor

CopilotAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pull Request Overview

This PR fixes JavaScript regex literal parsing when character classes contain special characters that were previously incorrectly interpreted as regex delimiters. The fix ensures proper handling of edge cases like leading] characters and/ characters within character classes.

Implements lookahead logic to validate character class closers before entering class parsing mode
Adds proper handling of beginning-of-class semantics for literal] and^ characters
Updates regex scanner to track character class state and apply correct escape rules

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`compiler/syntax/src/res_scanner.ml`	Implements new character class parsing logic with lookahead validation and proper state tracking
`tests/syntax_tests/data/parsing/grammar/expressions/regex.res`	Adds test cases for edge cases involving character classes with special characters
`tests/syntax_tests/data/parsing/grammar/expressions/expected/regex.res.txt`	Updates expected test output to match new parsing behavior

_{Tip: Customize your code reviews with copilot-instructions.md.Create the file orlearn how to get started.}

compiler/syntax/src/res_scanner.ml Outdated

		letrecloop()=
		if!i>= lenthenfalse
		else
		matchString.unsafe_get src!iwith

Copy link

CopilotAIAug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

UsingString.unsafe_get can lead to bounds checking violations. Consider usingString.get instead, or ensure proper bounds checking is performed before this call.

Suggested change

	matchString.unsafe_get src!iwith
	matchString.get src!iwith

Copilot uses AI. Check for mistakes.

compiler/syntax/src/res_scanner.ml Outdated

		else
		matchString.unsafe_get src!iwith
		\|'\n'\|'\r' ->false
		\|'\\' ->

Copy link

CopilotAIAug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The escape handling logic incrementsi by 2 without checking ifi + 1 is within bounds after the increment. Consider adding bounds checking for the second character of the escape sequence.

Copilot uses AI. Check for mistakes.

compiler/syntax/src/res_scanner.ml Outdated

		loop()
		in
		loop()
		in

Copy link

CopilotAIAug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Starting the scan atfrom_offset + 1 assumes the character atfrom_offset is '[', but this is not validated. Iffrom_offset doesn't point to '[', the parsing logic may be incorrect.

Suggested change

	in
	(* Validate that from_offset points to '['*)
	if from_offset<0\|\| from_offset>= len\|\|String.unsafe_get src from_offset<>'['then
	false
	else
	let i=ref (from_offset+1)in
	(* start scanning after current '['*)
	let bos=reftruein
	letrecloop()=
	if!i>= lenthenfalse
	else
	matchString.unsafe_get src!iwith
	\|'\n'\|'\r' ->false
	\|'\\' ->
	if!i+1< lenthen (
	i:=!i+2;
	loop())
	elsefalse
	\|'^'when!bos ->
	incr i;
	loop()
	\|']'when!bos ->
	(* Leading ']' is literal content; after that, we're no longer at BOS.*)
	bos:=false;
	incr i;
	loop()
	\|']' ->true
	\|_ ->
	bos:=false;
	incr i;
	loop()
	in
	loop()

Copilot uses AI. Check for mistakes.

Copy link

pkg-pr-newbot commentedAug 22, 2025•
edited
Loading

Open in StackBlitz

rescript

npm i https://pkg.pr.new/rescript-lang/rescript@7790

@rescript/darwin-arm64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/darwin-arm64@7790

@rescript/darwin-x64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/darwin-x64@7790

@rescript/linux-arm64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/linux-arm64@7790

@rescript/linux-x64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/linux-x64@7790

@rescript/runtime

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/runtime@7790

@rescript/win32-x64

npm i https://pkg.pr.new/rescript-lang/rescript/@rescript/win32-x64@7790

commit:16412d8

cristianoc reviewed

Aug 23, 2025

View reviewed changes

Copy link

Collaborator

cristianoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not familiar with the details of regexps.
At high level the code looks of legit shape, but don't know how to check details. If there are enough tests it should be OK.

mediremi reviewed

Aug 23, 2025

View reviewed changes

tests/syntax_tests/data/parsing/grammar/expressions/regex.res Outdated

		letre=/[]]/
		letre=/[\]]/
		letre=/[[]]/
		letre=/[^]/]/

Copy link

Member

mediremiAug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would it be possible for the compiler to reject/[]/]/ and/[^]/]/? They're not valid regexes in JS and so would cause a runtime error

Copy link

MemberAuthor

zthAug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ideally yeah, but I'm not sure how we could do that in a good way. Any ideas?

Copy link

Member

mediremiAug 23, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Claude generated the following (diff is againstmaster, not this branch):

diff --git i/compiler/syntax/src/res_scanner.ml w/compiler/syntax/src/res_scanner.mlindex c404d36cc..3415fbd02 100644--- i/compiler/syntax/src/res_scanner.ml+++ w/compiler/syntax/src/res_scanner.ml@@ -580,9 +580,9 @@ let scan_regex scanner =       bring_buf_up_to_date ~start_offset:last_char_offset;       Buffer.contents buf)   in-  let rec scan () =+  let rec scan ?(in_char_class = false) () =     match scanner.ch with-    | '/' ->+    | '/' when not in_char_class ->       let last_char_offset = scanner.offset in       next scanner;       let pattern = result ~first_char_offset ~last_char_offset in@@ -606,10 +606,16 @@ let scan_regex scanner =     | '\\' ->       next scanner;       next scanner;-      scan ()+      scan ~in_char_class ()+    | '[' when not in_char_class ->+      next scanner;+      scan ~in_char_class:true ()+    | ']' when in_char_class ->+      next scanner;+      scan ~in_char_class:false ()     | _ ->       next scanner;-      scan ()+      scan ~in_char_class ()   in   let pattern, flags = scan () in   let end_pos = position scanner in

Which manages to compile these regexes:

letre=/\.[^/.]+$/letre=/[^]]/letre=/[/]/letre=/[]]/letre=/[\]]/letre=/[[]]/

And rejects the following:

letre=/[]/]/

letre=/[^]/]/

Copy link

Collaborator

cristianocAug 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What's a definition of what are valid regexes?
Eg wondering about 2 vs 3 "/" and how to know one regexp is finished.

Copy link

Member

mediremiAug 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

In a character class you can 0 to infinity (unescaped)/:

letre=/[/]/letre=/[//]/letre=/[///]/

But outside of character classes,/ has to be escaped

Copy link

MemberAuthor

zthAug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

So we need to keep track of the[], and that's it essentially?

Copy link

Member

mediremiAug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Seems like it

Copy link

MemberAuthor

zthAug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@mediremi would you be up for taking a stab at fixing what's left of this? Making sure the cases you mention are rejected.

Copy link

Member

mediremiAug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

No problem - I'll finish off this PR later today 👍

Copy link

MemberAuthor

zthAug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks! Feels like it needs a fresh pair of eyes.

Copy link

Member

cknitt commentedAug 25, 2025

Just ran into the same issue when trying to upgrade one of our projects to v12 and can confirm that this PR fixes it.

Would be great to have the fix in this week's beta release.

cristianoc approved these changes

Aug 25, 2025

View reviewed changes

fix JS regex literal parsing in char classes

f2d3e4c

zth force-pushed thefix-regex-char-classes branch frome0d818b tof2d3e4cCompare

August 25, 2025 16:42

mediremi added2 commits

August 28, 2025 00:19

Merge branch 'master' into fix-regex-char-classes

fe5e0f9

Simplify regex character class parsing and reject more invalid regexes

88efa86

mediremi reviewed

Aug 27, 2025

View reviewed changes

tests/syntax_tests/data/parsing/grammar/expressions/regex.res

		letre=/a[-]?c/
		letre=/(abc)\\1/
		letre=/([a-c]*)\\1/
		letre=/(?i)abc/

Copy link

Member

mediremiAug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

JavaScript doesn't support inline flag syntax ((?i))

> let re = /(?i)abc/let re = /(?i)abc/         ^^^^^^^^^Uncaught SyntaxError: Invalid regular expression: /(?i)abc/: Invalid group

mediremi reviewed

Aug 27, 2025

View reviewed changes

tests/syntax_tests/data/parsing/grammar/expressions/regex.res

		// NOTE: not an error under PCRE/PRE:
		//let re = /a[b-]/
		letre=/a[]b/
		letre=/a[/

Copy link

Member

mediremiAug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The removed lines are invalid regexes in JavaScript. Some of them now fail (e.g./a[/) thanks to the now correct parsing of character classes.

mediremi reviewed

Aug 27, 2025

View reviewed changes

tests/syntax_tests/data/parsing/errors/expressions/regexCharacterClasses.res

		@@ -0,0 +1 @@
		letre=/[]/]/

Copy link

Member

mediremiAug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Added an error test for an invalid use of character classes

Add CHANGELOG entry

16412d8

mediremi marked this pull request as ready for review

August 27, 2025 22:48

zth commented

Aug 28, 2025

View reviewed changes

Copy link

MemberAuthor

zth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Awesome!@mediremi ready to merge?

mediremi merged commitea569cd intomaster

Aug 28, 2025

25 checks passed

mediremi deleted the fix-regex-char-classes branch

August 28, 2025 08:32

Labels

None yet

5 participants

Movatterモバイル変換

fix JS regex literal parsing in char classes#7790

fix JS regex literal parsing in char classes#7790

Uh oh!

Conversation

zth commentedAug 22, 2025

Uh oh!

zth commentedAug 22, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

CopilotAI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

CopilotAIAug 22, 2025

Choose a reason for hiding this comment

Uh oh!

CopilotAIAug 22, 2025

Choose a reason for hiding this comment

Uh oh!

CopilotAIAug 22, 2025

Choose a reason for hiding this comment

Uh oh!

pkg-pr-newbot commentedAug 22, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

cristianoc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mediremiAug 23, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cknitt commentedAug 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zth commentedAug 22, 2025•
edited
Loading

pkg-pr-newbot commentedAug 22, 2025•
edited
Loading

mediremiAug 23, 2025•
edited
Loading