I'm not convinced this is the best way to achieve this, and likely would benefit from some tests where it would not be a utf8 char boundary, but it passes the Oniguruma test cases.

Copy link

Contributor

robinstApr 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Can you check how Oniguruma handles Unicode? E.g. for a regex like(δ)(?i:\\1), does it match the textδΔ (I suspect yes)?
I think we should delegate to the regex crate's implementation of case-insensitive matching, to be consistent with other cases. The question is, how do we do that most efficiently given the literal we're matching against comes from a match during vm execution. I had a look through regex-automata and regex-syntax, and it looks like case folding happens in regex-syntax (I thinkhere). I wonder what the cheapest way of creating a regex that matches a literal is (i.e. if we could bypass parsing, that would be cool; instead of having to escape the literal and then go through full regex parsing before matching).

Copy link

ContributorAuthor

keith-hallApr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Regarding number 1, yep, you are right, Oniguruma finds a match

With number 2, I believe we can constructHir directly, to avoid having to escape the literal and go through the parser again 👍

Copy link

Contributor

robinstMay 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

With number 2, I believe we can constructHir directly, to avoid having to escape the literal and go through the parser again 👍

Sounds good!

Copy link

ContributorAuthor

keith-hallMay 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

done :)

fix test

3d4bde3

keith-hall force-pushed therelative_recursion_level_backref branch from7fc83d3 to453c1b9Compare

April 29, 2025 20:16

keith-hall added4 commits

May 10, 2025 22:47

use hir to support non-ascii case insensitive backrefs

4821f76

- keep the simple approach for when the captured group contents is ascii- use hir to utilize regex-automata's case insensitive literal matching for when the captured group contents is not ascii

Merge remote-tracking branch 'fancy-regex/main' into casei_backrefs

f7ae14c

Merge branch 'relative_recursion_level_backref' into casei_backrefs

905ed68

add tests to prove that literals which could have special meaning in …

ab8db70

…regex are treated verbatim

robinst approved these changes

May 11, 2025

View reviewed changes

Copy link

Contributor

robinst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Some suggestions :)

src/vm.rs Outdated

		@@ -419,6 +424,60 @@ fn matches_literal(s: &str, ix: usize, end: usize, literal: &str) -> bool {
		end <= s.len() && &s.as_bytes()[ix..end] == literal.as_bytes()
		}

		#[inline]

Copy link

Contributor

robinstMay 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'd probably not force inlining of this as it's pretty long (and probably not used often). AFAIK leaving it off will mean that the compiler will decide.

Copy link

ContributorAuthor

keith-hallMay 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

good point, thanks, fixed :)

src/vm.rs Outdated


		let chars = literal.chars();

		use alloc::boxed::Box;

Copy link

Contributor

robinstMay 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is thisuse necessary?

Copy link

ContributorAuthor

keith-hallMay 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

it was (for--no-default-features) but since I applied your suggestion it become unnecessary 🎉

Copy link

Contributor

robinstMay 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Oh I see :)

src/vm.rs Outdated

Comment on lines 446 to 472

		let span = Span::splat(Position::new(0, 0, 0));
		let mut literals = Vec::with_capacity(literal.len());
		for c in chars {
		let ast_literal = Ast::Literal(Box::new(Literal {
		span,
		kind: LiteralKind::Verbatim,
		c: c,
		}));
		literals.push(ast_literal);
		}
		let ast_flags = Ast::Group(Box::new(Group {
		span,
		kind: GroupKind::NonCapturing(Flags {
		span,
		items: vec![FlagsItem {
		span,
		kind: FlagsItemKind::Flag(Flag::CaseInsensitive),
		}],
		}),
		ast: Box::new(Ast::Concat(Box::new(Concat {
		span,
		asts: literals,
		}))),
		}));

		let mut translator = regex_syntax::hir::translate::Translator::new();
		let hir = translator.translate(literal, &ast_flags).unwrap();

Copy link

Contributor

robinstMay 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks like we can simplify this a bit, like this:

let span =Span::splat(Position::new(0,0,0));let literals = literal.chars().map(|c|{Ast::literal(Literal{                span,kind:LiteralKind::Verbatim,c: c,})}).collect();let ast =Ast::concat(Concat{        span,asts: literals,});letmut translator = regex_syntax::hir::translate::TranslatorBuilder::new().case_insensitive(true).build();let hir = translator.translate(literal,&ast).unwrap();