fancy-regex/fancy-regexPublic

NotificationsYou must be signed in to change notification settings
Fork41
Star494

Rust library for regular expressions using "fancy" features like look-around and backreferences

License

MIT license

494 stars 41 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 537 Commits
.github		.github
benches		benches
examples		examples
fuzz		fuzz
src		src
tests		tests
.gitignore		.gitignore
AUTHORS		AUTHORS
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock.msrv		Cargo.lock.msrv
Cargo.toml		Cargo.toml
LICENSE		LICENSE
PERFORMANCE.md		PERFORMANCE.md
README.md		README.md
codecov.yml		codecov.yml

Repository files navigation

fancy-regex

A Rust library for compiling and matching regular expressions. It uses a hybridregex implementation designed to support a relatively rich set of features.In particular, it uses backtracking to implement "fancy" features such aslook-around and backtracking, which are not supported in purelyNFA-based implementations (exemplified byRE2, and implemented in Rust in theregex crate).

A goal is to be as efficient as possible. For a given regex, the NFAimplementation has asymptotic running time linear in the length of theinput, while in the general case a backtracking implementation hasexponential blowup. An example given inStatic Analysis for RegularExpression Exponential Runtime via SubstructuralLogics is:

importrere.compile('(a|b|ab)*bc').match('ab'*28+'ac')

In Python (tested on both 2.7 and 3.5), this match takes 91s, anddoubles for each additional repeat of 'ab'.

Thus, many proponentsadvocate a purely NFA(nondeterministic finite automaton) based approach. Even so,backreferences and look-around do add richness to regexes, and theyare commonly used in applications such as syntax highlighting for texteditors. In particular, TextMate'ssyntaxdefinitions,based on theOnigurumabacktracking engine, are now used in a number of other populareditors, including Sublime Text and Atom. These syntax definitionsroutinely use backreferences and look-around. For example, thefollowing regex captures a single-line Rust raw string:

r(#*)".*?"\1

There is no NFA that can express this simple and useful pattern. Yet,a backtracking implementation handles it efficiently.

This package is one of the first that handles both cases well. Theexponential blowup case above is run in 258ns. Thus, it should be avery appealing alternative for applications that require both richnessand performance.

A warning about worst-case performance

NFA-based approaches give strong guarantees about worst-caseperformance. For regexes that contain "fancy" features such asbackreferences and look-around, this module gives no correspondingguarantee. If an attacker can control the regular expressions thatwill be matched against, they will be able to successfully mount adenial-of-service attack. Be warned.

SeePERFORMANCE.md for some examples.

A hybrid approach

One workable approach is to detect the presence of "fancy" features,and choose either an NFA implementation or a backtracker depending onwhether they are used.

However, this module attempts to be more fine-grained. Instead, itimplements a true hybrid approach. In essence, it is a backtracking VM(as well explained inRegular Expression Matching: the VirtualMachine Approach) inwhich one of the "instructions" in the VM delegates to an inner NFAimplementation (in Rust, the regex crate, though a similar approachwould certainly be possible using RE2 or the Goregexp package). Then there's ananalysis which decides for each subexpression whether it is "hard", orcan be delegated to the NFA matcher. At the moment, it is eager, anddelegates as much as possible to the NFA engine.

Theory

(This section is written in a somewhat informal style; I hope toexpand on it)

The fundamental idea is that it's a backtracking VM like PCRE, but asmuch as possible it delegates to an "inner" RE engine like RE2 (inthis case, the Rust one). For the sublanguage not using fancyfeatures, the library becomes a thin wrapper.

Otherwise, you do an analysis to figure out what you can delegate andwhat you have to backtrack. I was thinking it might be tricky, butit's actually quite simple. The first phase, you just label eachsubexpression as "hard" (groups that get referenced in a backref,look-around, etc), and bubble that up. You also do a little extraanalysis, mostly determining whether an expression has constant matchlength, and the minimum length.

The second phase is top down, and you carry a context, also a booleanindicating whether it's "hard" or not. Intuitively, a hard context isone in which the match length will affect future backtracking.

If the subexpression is easy and the context is easy, generate aninstruction in the VM that delegates to the inner NFA implementation.Otherwise, generate VM code as in a backtracking engine. Mostexpression nodes are pretty straightforward; the only interesting caseis concat (a sequence of subexpressions).

Even that one is not terribly complex. First, determine a prefix ofeasy nodes of constant match length (this won't affect backtracking,so safe to delegate to NFA). Then, if your context is easy, determinea suffix of easy nodes. Both of these delegate to NFA. For the ones inbetween, recursively compile. In an easy context, the last of thesealso gets an easy context; everything else is generated in a hardcontext. So, conceptually, hard context flows from right to left, andfrom parents to children.

Current status

Still in development, though the basic ideas are in place. Currently,the following features are missing:

Procedure calls and recursive expressions

Acknowledgements

Many thanks toAndrew Gallant forstimulating conversations that inspired this approach, as well as forcreating the excellent regex crate.

Authors

The main author is Raph Levien, with many contributions from Robin Stocker.

Contributions

We gladly accept contributions via GitHub pull requests. Please seeCONTRIBUTING.md for more details.

This project started out as a Google 20% project, but none of the authors currentlywork at Google so it has been forked to be community-maintained.

About

Rust library for regular expressions using "fancy" features like look-around and backreferences

Releases7

0.15.0 Latest

Jul 6, 2025

+ 6 releases

Sponsor this project

Learn more about GitHub Sponsors

Packages

No packages published

Contributors27

+ 13 contributors

Languages

Rust100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

fancy-regex

A warning about worst-case performance

A hybrid approach

Theory

Current status

Acknowledgements

Authors

Contributions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases7

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors27

Uh oh!

Languages

Movatterモバイル変換

Uh oh!

License

fancy-regex/fancy-regex

Folders and files

Latest commit

History

Repository files navigation

fancy-regex

A warning about worst-case performance

A hybrid approach

Theory

Current status

Acknowledgements

Authors

Contributions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases7

Sponsor this project

Uh oh!

Packages0

Uh oh!

Contributors27

Uh oh!

Languages

Packages