- Notifications
You must be signed in to change notification settings - Fork159
linear time regular expression matching in Java
License
google/re2j
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
RE2 is a regular expression engine that runs in time linear in the size of theinput. RE2/J is a port of C++ libraryRE2 to pure Java.
Java's standard regular expression package,java.util.regex
, and many otherwidely used regular expression packages such as PCRE, Perl and Python use abacktracking implementation strategy: when a pattern presents two alternativessuch asa|b
, the engine will try to match subpatterna
first, and if thatyields no match, it will reset the input stream and try to matchb
instead.
If such choices are deeply nested, this strategy requires an exponential numberof passes over the input data before it can detect whether the input matches.If the input is large, it is easy to construct a pattern whose running timewould exceed the lifetime of the universe. This creates a security risk whenaccepting regular expression patterns from untrusted sources, such as users ofa web application.
In contrast, the RE2 algorithm explores all matches simultaneously in a singlepass over the input data by using anondeterministic finite automaton.
There are certain features of PCRE or Perl regular expressions that cannot beimplemented in linear time, for example, backreferences, but the vast majorityof regular expressions patterns in practice avoid such features.
If you use regular expression patterns with a high degree of alternation, yourcode may run faster with RE2/J. In the worst case, thejava.util.regex
matcher may run forever, or exceed the available stack space and fail; thiswill never happen with RE2/J.
This is not an official Google product (experimental or otherwise), it is justcode that happens to be owned by Google.
RE2/J is not a drop-in replacement forjava.util.regex
. Aside from thedifferent package name, it doesn't support the following parts of theinterface:
- the MatchResult class
- Matcher.hasAnchoringBounds()
- Matcher.hasTransparentBounds()
- Matcher.hitEnd()
- Matcher.region(int, int)
- Matcher.regionEnd()
- Matcher.regionStart()
- Matcher.requireEnd()
- Matcher.toMatchResult()
- Matcher.useAnchoringBounds(boolean)
- Matcher.usePattern(Pattern)
- Matcher.useTransparentBounds(boolean)
- CANON_EQ
- COMMENTS
- LITERAL
- UNICODE_CASE
- UNICODE_CHARACTER_CLASS
- UNIX_LINES
- PatternSyntaxException.getMessage()
It also doesn't have parity with the full set of Java's character classes andspecial regular expression constructs.
If you're using Maven, you can use the following snippet in yourpom.xml
to get RE2/J:
<dependency> <groupId>com.google.re2j</groupId> <artifactId>re2j</artifactId> <version>1.6</version></dependency>
You can use the same artifact details in any build system compatible with the Maven Central repositories (e.g. Gradle, Ivy).
You can also download RE2/J the old-fashioned way: go tothe RE2/J release tag, download the RE2/J JAR and add it to your CLASSPATH.
We have set up a Google Group for discussion, please join theRE2/J discussionlist if you'd like to get intouch.
If you would like to contribute patches, please see theinstructions forcontributors.
RE2 was designed and implemented in C++ by Russ Cox. The C++ implementationincludes both NFA and DFA engines and numerous optimisations. Russ also porteda simplified version of the NFA to Go. Alan Donovan ported the NFA-based Goimplementation to Java. Afroz Mohiuddin wrapped the engine in a familiar JavaMatcher
/Pattern
API. James Ring prepared the open-source releaseand has been its primary maintainer since then.
About
linear time regular expression matching in Java