Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Support case insensitive backrefs#160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
keith-hall merged 11 commits intofancy-regex:mainfromforkeith:casei_backrefs
May 11, 2025

Conversation

keith-hall
Copy link
Contributor

No description provided.

src/vm.rs Outdated
Comment on lines 431 to 433
|| (s.is_char_boundary(ix)
&& s.is_char_boundary(end)
&& s[ix..end].eq_ignore_ascii_case(literal)))
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm not convinced this is the best way to achieve this, and likely would benefit from some tests where it would not be a utf8 char boundary, but it passes the Oniguruma test cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

  1. Can you check how Oniguruma handles Unicode? E.g. for a regex like(δ)(?i:\\1), does it match the textδΔ (I suspect yes)?
  2. I think we should delegate to the regex crate's implementation of case-insensitive matching, to be consistent with other cases. The question is, how do we do that most efficiently given the literal we're matching against comes from a match during vm execution. I had a look through regex-automata and regex-syntax, and it looks like case folding happens in regex-syntax (I thinkhere). I wonder what the cheapest way of creating a regex that matches a literal is (i.e. if we could bypass parsing, that would be cool; instead of having to escape the literal and then go through full regex parsing before matching).

keith-hall reacted with thumbs up emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Regarding number 1, yep, you are right, Oniguruma finds a match

With number 2, I believe we can constructHir directly, to avoid having to escape the literal and go through the parser again 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

With number 2, I believe we can constructHir directly, to avoid having to escape the literal and go through the parser again 👍

Sounds good!

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

done :)

@keith-hallkeith-hallforce-pushed therelative_recursion_level_backref branch from7fc83d3 to453c1b9CompareApril 29, 2025 20:16
- keep the simple approach for when the captured group contents is ascii- use hir to utilize regex-automata's case insensitive literal matching for when the captured group contents is not ascii
Copy link
Contributor

@robinstrobinst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Some suggestions :)

src/vm.rs Outdated
@@ -419,6 +424,60 @@ fn matches_literal(s: &str, ix: usize, end: usize, literal: &str) -> bool {
end <= s.len() && &s.as_bytes()[ix..end] == literal.as_bytes()
}

#[inline]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'd probably not force inlining of this as it's pretty long (and probably not used often). AFAIK leaving it off will mean that the compiler will decide.

keith-hall reacted with thumbs up emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

good point, thanks, fixed :)

src/vm.rs Outdated

let chars = literal.chars();

use alloc::boxed::Box;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Is thisuse necessary?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

it was (for--no-default-features) but since I applied your suggestion it become unnecessary 🎉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Oh I see :)

src/vm.rs Outdated
Comment on lines 446 to 472
let span = Span::splat(Position::new(0, 0, 0));
let mut literals = Vec::with_capacity(literal.len());
for c in chars {
let ast_literal = Ast::Literal(Box::new(Literal {
span,
kind: LiteralKind::Verbatim,
c: c,
}));
literals.push(ast_literal);
}
let ast_flags = Ast::Group(Box::new(Group {
span,
kind: GroupKind::NonCapturing(Flags {
span,
items: vec![FlagsItem {
span,
kind: FlagsItemKind::Flag(Flag::CaseInsensitive),
}],
}),
ast: Box::new(Ast::Concat(Box::new(Concat {
span,
asts: literals,
}))),
}));

let mut translator = regex_syntax::hir::translate::Translator::new();
let hir = translator.translate(literal, &ast_flags).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks like we can simplify this a bit, like this:

let span =Span::splat(Position::new(0,0,0));let literals = literal.chars().map(|c|{Ast::literal(Literal{                span,kind:LiteralKind::Verbatim,c: c,})}).collect();let ast =Ast::concat(Concat{        span,asts: literals,});letmut translator = regex_syntax::hir::translate::TranslatorBuilder::new().case_insensitive(true).build();let hir = translator.translate(literal,&ast).unwrap();

(Note that usingmap andcollect means we don't need to give our own size hint to the vector,Chars does it for us.)

keith-hall reacted with heart emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

thanks, I like! 🚀 done! 🙂

@keith-hallkeith-hall deleted the branchfancy-regex:mainMay 11, 2025 17:19
@keith-hallkeith-hall changed the base branch fromrelative_recursion_level_backref tomainMay 11, 2025 17:22
@keith-hallkeith-hall merged commit30567d8 intofancy-regex:mainMay 11, 2025
14 of 18 checks passed
@keith-hallkeith-hall deleted the casei_backrefs branchMay 11, 2025 17:25
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@robinstrobinstrobinst approved these changes

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

2 participants
@keith-hall@robinst

[8]ページ先頭

©2009-2025 Movatter.jp