- Notifications
You must be signed in to change notification settings - Fork5
kyx0r/pikevm
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
What is pikevm?==============re1 (http://code.google.com/p/re1/) is "toy regular expression implementation"by Russel Cox, featuring simplicity and minimal code size unheard of in otherimplementations. re2 (http://code.google.com/p/re2/) is "an efficient,principled regular expression library" by the same author. It is robust,full-featured, and ... bloated, comparing to re1.This is implementation of pikevm based on re1.5 which adds features required forminimalistic real-world use, while sticking to the minimal code size andmemory use.https://github.com/pfalcon/re1.5Why?====Pikevm guarantees that any input regex will scale O(n) with the size of thestring, thus making it the fastest regex implementation. There is no backtrackingthat usually expodes to O(n^k) time and space where k is some constant.Features========* UnLike re1.5, here is only pikevm, one file easy to use.* Unlike re1.5, regexes is compiled to type sized code rather than bytecode,eliviating the problem of byte overflow in splits/jmps on large regexes.Currently the type used is int, and every atom in compiled code is alignedto that.* Matcher does not take size of string as param, it checks for '\0' instead,so that the user does not need to waste time taking strlen()* Highly optimized source code, probably 2x faster than re1.5* Support for quoted chars in regex. Escapes in brackets.* Support for ^, $ assertions in regex.* Support for repetition operator {n} and {n,m} and {n,}.* Support for Unicode (UTF-8).* Unlike other engines, the output is byte level offset. (Which is more useful)* Support for non capture group ?:* Support for wordend & wordbeg assertions- Some limitations for word assertions are meta chars like spaces being usedin for expression itself, for example "\< abc" should match " abc" exactly atthat space word boundary but it won't. It's possible to fix this, but it wouldrequire rsplit before word assert, and some dirty logic to check that the characteror class is a space we want to match not assert at. But the code for it was toodirty and I scrapped it. Syntax for word assertions are like posix C library, notthe pcre "\b" which can be used both in front or back of the word, because there isno distinction, it makes the implementation potentially even uglier.* Assert flags like REG_ICASE,REG_NOTEOL,REG_NOTBOL and lookaround assertionsare implemented in nextvi branch or Nextvi's regex.chttps://github.com/kyx0r/nextvi/blob/master/regex.cNOTES=====The problem described in this paper has been fixed. Ambiguous matching is correct.HISTORY:https://re2c.org/2019_borsotti_trofimovich_efficient_posix_submatch_extraction_on_nfa.pdf"Cox, 2009 (incorrect). Cox came up with the idea of backward POSIX matching,which is based on the observation that reversing the longest-match rulesimplifies the handling of iteration subexpressions: instead of maximizingsubmatch from the first to the last iteration, one needs to maximize theiterations in reverse order. This means that the disambiguation is alwaysbased on the most recent iteration, removing the need to remember all previousiterations (except for the backwards-first, i.e. the last one, which containssubmatch result). The algorithm tracks two pairs of offsets per each submatchgroup: the active pair (used for disambiguation) and the result pair. It givesincorrect results under two conditions: (1) ambiguous matches have equaloffsets on some iteration, and (2) disambiguation happens too late, whenthe active offsets have already been updated and the difference betweenambiguous matches is erased. We found that such situations may occur for tworeasons. First, the ε-closure algorithm may compare ambiguous paths aftertheir join point, when both paths have a common suffix with taggedtransitions. This is the case with the Cox prototype implementation; forexample, it gives incorrect results for (aa|a)* and string aaaaa. Most of suchfailures can be repaired by exploring states in topological order, but atopological order does not exist in the presence of ε-loops. The second reasonis bounded repetition: ambiguous paths may not have an intermediate join pointat all. For example, in the case of (aaaa|aaa|a){3,4} and string aaaaaaaaaa wehave matches (aaaa)(aaaa)(a)(a) and (aaaa)(aaa)(aaa) with a different numberof iterations. Assuming that the bounded repetition is unrolled by chainingthree sub-automata for (aaaa|aaa|a) and an optional fourth one, by the timeambiguous paths meet both have active offsets (0,4). Despite the flaw, Coxalgorithm is interesting: if somehow the delayed comparison problem was fixed,it would work. The algorithm requires O(mt) memory and O(nm^2t) time(assuming a worst-case optimal closure algorithm), where n is thelength of input, m it the size of RE and t is the number of submatch groupsand subexpressions that contain them."Research has shown that it is possible to disambiguate NFA in polynomial timebut it brings serious performance issues on non ambiguous inputs. See thebranch "disambiguate_paths" on this repo shows what is being done to solve itand the potential performance costs. In short it requires tracking the parentof every state added on nlist from clist. If the state from nlist matchesthe consumer, the alternative clist state related to that nlist state getsdiscarded and the nsub ref can be decremented (freed). The reason why thisproblem does not exist for non ambiguous regexes is because the alternativeclist state will never match due to the next state having a different consumer. There is no need for any extra handling it gets freed normally. I decidedto not apply this solution here because I think most use cases for regex arenot ambiguious like say regex: "a{10000}". If you try matching 10000 'a'characters in a row like that you will have a problem where the stack usagewill jump up to 10000*(subsize) but it will never exceed the size of regexthough, but the number of NFA states will also increase by the same amount,so at the charater 9999 you will find 9999 redundant nlist states, that willdegrade performance linearly, however it will be very slow compared touplimited regex like a+. The cost of this solution is somewhere around 2%general performance decrease (broadly), but a magnitude of complexitydecrease for ambiguous cases, for example matching 64 characters went downfrom 30 to 9 microseconds. Another solution to this problem can be todetermine the ambiguous paths at compile time and flag the inner states asambiguous ahead of time, still this can't avoid having a loop though the altstates as their positioning in clist can't be precomputed due to the dynamicchanges.(Comment about O(mt) memory complexity)This worst case scenario can only happen on ambiguous input. Ambiguousconsumers (char, class, any) assuming t is 1. In practice there is almostnever a situation where someone wants to search using regex this large. Mostof the time memory usage is very low and the space complexity for nonambigious regex is O(nt) where n is the number of currently consideringalternate paths in the regex and t is the number of submatch groups.This pikevm implementation features an improved submatch extraction algorithmbased on Russ Cox's original design. I - Kyryl Melekhin have found a way tooptimize the tracking properly of 1st number in the submatch pair. Based onsimple observation of how the NFA is constructed I noticed that there is noway for addthread1() to ever reach inner SAVE instructions in the regex, sothat leaves tracking 2nd pairs by addthread1() irrelevant to the finalresults (except the need to initialize the sub after allocation). Thisimproved the overall performance by 25% which is massive considering that atthe time there was nothing else left to can be done to make it faster.What are on##list macros?Redundant state inside nlist can happen in couple of ways, and has to do with the (closure) a* (star) operations and also +. Due to the automata machine design split happens to be above the next consumed instruction and if that state gets added onto the list we may segfault or give wrong submatch result. Rsplit does not have this problem because it is generated below the consumer instruction, but it can still add redundant states. Overall this is extremely difficult to understand or explain, but this is just something we have to check for. We checked for this using extra int inside the split instructions, so this left some global state inside the machine insts. Most of the time we just added to the next gen number and kept incrementing it forever. This leaves a small chance of overflowing the int and getting a run on a false state left from previous use of the regex. Though if overflow never happens there is no chance of getting a false state. Overflows like this pose a high security threat, if the hacker knows how many cycles he needs to overflow the gen variable and get inconsistent result. It is possible to reset the marks if we near the overflow, but as you may guess that does not come for free.Currently I removed all dynamic global state from the instructions fixing any overlow issue utilizing a sparse set datastructure trick which abuses the uninitialized varibles. This allows the redundant states to be excluded inO(1) operation. That said, don't run valgrind on pikevm as it will go crazy, or find a way to surpress errors from pikevm.Further reading===============https://research.swtch.com/sparsehttps://swtch.com/~rsc/regexp/regexp1.htmlAuthor and License==================licensed under BSD license, just as the original re1.
About
Russ Cox/Rob Pike pikevm regex implementation
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
No releases published
Packages0
No packages published