Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork41
Rust library for regular expressions using "fancy" features like look-around and backreferences
License
fancy-regex/fancy-regex
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A Rust library for compiling and matching regular expressions. It uses a hybridregex implementation designed to support a relatively rich set of features.In particular, it uses backtracking to implement "fancy" features such aslook-around and backtracking, which are not supported in purelyNFA-based implementations (exemplified byRE2, and implemented in Rust in theregex crate).
A goal is to be as efficient as possible. For a given regex, the NFAimplementation has asymptotic running time linear in the length of theinput, while in the general case a backtracking implementation hasexponential blowup. An example given inStatic Analysis for RegularExpression Exponential Runtime via SubstructuralLogics is:
importrere.compile('(a|b|ab)*bc').match('ab'*28+'ac')
In Python (tested on both 2.7 and 3.5), this match takes 91s, anddoubles for each additional repeat of 'ab'.
Thus, many proponentsadvocate a purely NFA(nondeterministic finite automaton) based approach. Even so,backreferences and look-around do add richness to regexes, and theyare commonly used in applications such as syntax highlighting for texteditors. In particular, TextMate'ssyntaxdefinitions,based on theOnigurumabacktracking engine, are now used in a number of other populareditors, including Sublime Text and Atom. These syntax definitionsroutinely use backreferences and look-around. For example, thefollowing regex captures a single-line Rust raw string:
r(#*)".*?"\1
There is no NFA that can express this simple and useful pattern. Yet,a backtracking implementation handles it efficiently.
This package is one of the first that handles both cases well. Theexponential blowup case above is run in 258ns. Thus, it should be avery appealing alternative for applications that require both richnessand performance.
NFA-based approaches give strong guarantees about worst-caseperformance. For regexes that contain "fancy" features such asbackreferences and look-around, this module gives no correspondingguarantee. If an attacker can control the regular expressions thatwill be matched against, they will be able to successfully mount adenial-of-service attack. Be warned.
SeePERFORMANCE.md for some examples.
One workable approach is to detect the presence of "fancy" features,and choose either an NFA implementation or a backtracker depending onwhether they are used.
However, this module attempts to be more fine-grained. Instead, itimplements a true hybrid approach. In essence, it is a backtracking VM(as well explained inRegular Expression Matching: the VirtualMachine Approach) inwhich one of the "instructions" in the VM delegates to an inner NFAimplementation (in Rust, the regex crate, though a similar approachwould certainly be possible using RE2 or the Goregexp package). Then there's ananalysis which decides for each subexpression whether it is "hard", orcan be delegated to the NFA matcher. At the moment, it is eager, anddelegates as much as possible to the NFA engine.
(This section is written in a somewhat informal style; I hope toexpand on it)
The fundamental idea is that it's a backtracking VM like PCRE, but asmuch as possible it delegates to an "inner" RE engine like RE2 (inthis case, the Rust one). For the sublanguage not using fancyfeatures, the library becomes a thin wrapper.
Otherwise, you do an analysis to figure out what you can delegate andwhat you have to backtrack. I was thinking it might be tricky, butit's actually quite simple. The first phase, you just label eachsubexpression as "hard" (groups that get referenced in a backref,look-around, etc), and bubble that up. You also do a little extraanalysis, mostly determining whether an expression has constant matchlength, and the minimum length.
The second phase is top down, and you carry a context, also a booleanindicating whether it's "hard" or not. Intuitively, a hard context isone in which the match length will affect future backtracking.
If the subexpression is easy and the context is easy, generate aninstruction in the VM that delegates to the inner NFA implementation.Otherwise, generate VM code as in a backtracking engine. Mostexpression nodes are pretty straightforward; the only interesting caseis concat (a sequence of subexpressions).
Even that one is not terribly complex. First, determine a prefix ofeasy nodes of constant match length (this won't affect backtracking,so safe to delegate to NFA). Then, if your context is easy, determinea suffix of easy nodes. Both of these delegate to NFA. For the ones inbetween, recursively compile. In an easy context, the last of thesealso gets an easy context; everything else is generated in a hardcontext. So, conceptually, hard context flows from right to left, andfrom parents to children.
Still in development, though the basic ideas are in place. Currently,the following features are missing:
- Procedure calls and recursive expressions
Many thanks toAndrew Gallant forstimulating conversations that inspired this approach, as well as forcreating the excellent regex crate.
The main author is Raph Levien, with many contributions from Robin Stocker.
We gladly accept contributions via GitHub pull requests. Please seeCONTRIBUTING.md for more details.
This project started out as a Google 20% project, but none of the authors currentlywork at Google so it has been forked to be community-maintained.
About
Rust library for regular expressions using "fancy" features like look-around and backreferences
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Sponsor this project
Uh oh!
There was an error while loading.Please reload this page.
Packages0
Uh oh!
There was an error while loading.Please reload this page.