Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork41
Support case insensitive backrefs#160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
src/vm.rs Outdated
|| (s.is_char_boundary(ix) | ||
&& s.is_char_boundary(end) | ||
&& s[ix..end].eq_ignore_ascii_case(literal))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'm not convinced this is the best way to achieve this, and likely would benefit from some tests where it would not be a utf8 char boundary, but it passes the Oniguruma test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
- Can you check how Oniguruma handles Unicode? E.g. for a regex like
(δ)(?i:\\1)
, does it match the textδΔ
(I suspect yes)? - I think we should delegate to the regex crate's implementation of case-insensitive matching, to be consistent with other cases. The question is, how do we do that most efficiently given the literal we're matching against comes from a match during vm execution. I had a look through regex-automata and regex-syntax, and it looks like case folding happens in regex-syntax (I thinkhere). I wonder what the cheapest way of creating a regex that matches a literal is (i.e. if we could bypass parsing, that would be cool; instead of having to escape the literal and then go through full regex parsing before matching).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Regarding number 1, yep, you are right, Oniguruma finds a match
With number 2, I believe we can constructHir
directly, to avoid having to escape the literal and go through the parser again 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
With number 2, I believe we can construct
Hir
directly, to avoid having to escape the literal and go through the parser again 👍
Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
done :)
7fc83d3
to453c1b9
Compare- keep the simple approach for when the captured group contents is ascii- use hir to utilize regex-automata's case insensitive literal matching for when the captured group contents is not ascii
…regex are treated verbatim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Some suggestions :)
src/vm.rs Outdated
@@ -419,6 +424,60 @@ fn matches_literal(s: &str, ix: usize, end: usize, literal: &str) -> bool { | |||
end <= s.len() && &s.as_bytes()[ix..end] == literal.as_bytes() | |||
} | |||
#[inline] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'd probably not force inlining of this as it's pretty long (and probably not used often). AFAIK leaving it off will mean that the compiler will decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
good point, thanks, fixed :)
src/vm.rs Outdated
let chars = literal.chars(); | ||
use alloc::boxed::Box; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Is thisuse
necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
it was (for--no-default-features
) but since I applied your suggestion it become unnecessary 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Oh I see :)
src/vm.rs Outdated
let span = Span::splat(Position::new(0, 0, 0)); | ||
let mut literals = Vec::with_capacity(literal.len()); | ||
for c in chars { | ||
let ast_literal = Ast::Literal(Box::new(Literal { | ||
span, | ||
kind: LiteralKind::Verbatim, | ||
c: c, | ||
})); | ||
literals.push(ast_literal); | ||
} | ||
let ast_flags = Ast::Group(Box::new(Group { | ||
span, | ||
kind: GroupKind::NonCapturing(Flags { | ||
span, | ||
items: vec![FlagsItem { | ||
span, | ||
kind: FlagsItemKind::Flag(Flag::CaseInsensitive), | ||
}], | ||
}), | ||
ast: Box::new(Ast::Concat(Box::new(Concat { | ||
span, | ||
asts: literals, | ||
}))), | ||
})); | ||
let mut translator = regex_syntax::hir::translate::Translator::new(); | ||
let hir = translator.translate(literal, &ast_flags).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Looks like we can simplify this a bit, like this:
let span =Span::splat(Position::new(0,0,0));let literals = literal.chars().map(|c|{Ast::literal(Literal{ span,kind:LiteralKind::Verbatim,c: c,})}).collect();let ast =Ast::concat(Concat{ span,asts: literals,});letmut translator = regex_syntax::hir::translate::TranslatorBuilder::new().case_insensitive(true).build();let hir = translator.translate(literal,&ast).unwrap();
(Note that usingmap
andcollect
means we don't need to give our own size hint to the vector,Chars
does it for us.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
thanks, I like! 🚀 done! 🙂
30567d8
intofancy-regex:mainUh oh!
There was an error while loading.Please reload this page.
No description provided.