- Notifications
You must be signed in to change notification settings - Fork0
A Python module for generating performant regex patterns in a readable and maintainable way
License
Inspiaaa/ReBuild
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski
With ReBuild you can write performant regex patterns in a readable and maintainable way by using functions. It is inspired by the Simple Regex Language (SRL - Simple Regex Language).
Helps you writemaintainable regex patterns, that arereadable, regardless of how good you are at regex
ReBuild creates highlyoptimised regex patterns byanalysing the input patterns
It uses non-capturing groups as often as possible toenhance performance
Fullystring based: It'sversatile and can be used with many different tools
Lets you use your belovedautocomplete for generating regex patterns
Helps you discover new ways of writing regex patterns, that involvemore code reuse
ReBuild does a lot of processing ahead of time (AOT) to generate performant and simple regex patterns. Many of its functions can optimise specific cases, which means that many functions provide zero-cost abstraction.
A good example to illustrate the optimisation power is theeither
function, which represents a regex OR function, e.g.a|b|c
.
# Matches the string abc, 123, or xyzeither("abc","123","xyz")>>>'(?:abc|123|xyz)'# Matches any letter from a to z, any digit or the string "xyz"either("[a-z]","[0-9]","xyz")>>>'(?:[a-z0-9]|xyz)'# ReBuild detects the two character sets and combines them# Matches any letter from a to z or any digiteither("[a-z]","[0-9]")>>>'[a-z0-9]'# It detects that a OR is no longer necessary after# combining the character sets, and optimises it away# Matches the character "a", "b" or "c"either("a","b","c")>>>'[abc]'# ReBuild transforms the OR of individual characters into a single character set
For every function (likeeither
,one_of
, ...) ReBuild does the following steps:
Generate a non-optimised regex pattern (
rebuild.builder
)Parse this pattern and convert it to a tree structure (
rebuild.parser
)Analyse and optimise the tree (
rebuild.analyser
)Convert the tree back to a string regex pattern
As you can see from the aforementioned steps, ReBuild consists of 3 modules:builder
,parser
andanalyser
rebuild.builder
provides the regex building functions likeeither
,one_of
, ... which first create a non-optimised pattern, but then run them through an intermediate optimisation processprint(either("a","b",one_of("0-9"))# Before optimising (intermediate optimisations disabled for this example)>>>'(?:(?:a)|(?:b)|(?:[0-9]))'
rebuild.parser
parses string regex patterns with the amazing parsing library for PythonLark.First, it converts the regex pattern into a concrete syntax tree (CST)
# CST of '(?:(?:a)|(?:b)|(?:[0-9]))'alternationsingle_charasingle_charbchar_setrange0-9
Then this tree is converted into the another tree, now consisting of
RegexNode
s, which are defined in therebuild.analyser
module. Amongst other things, these nodes contain more information about the pattern (AST).Alternation|SingleChar:"a"|SingleChar:"b"|CharSet||IsInverted:False||Options|||Range||||FromChar:"0"||||ToChar:"9"
rebuild.analyser
defines the nodes of this tree, which each have rules for optimisations and methods for converting back to a string regex pattern.# After optimisationCharSet|IsInverted:False|Options||SingleChar:"a"||SingleChar:"b"||Range|||FromChar:"0"|||ToChar:"9"
Converted back to a string:
'[ab0-9]'
As you can see, ReBuild re-optimises the input string for every function, maximising readability and ease of use.
Should you be worried about the performance penalty of re-optimising for every single function?
No. ReBuild uses Lark's most optimised mode, a tree-less LALR(1) parser. Lark is thefastest parsing library available for Python. And after all, you should only generate the regex pattern once, store it in a constant and only use the constant for the rest of your program.
But if you want to only optimise once, you can do the following:
importrebuild.builderrebuild.builder.INTERMEDIATE_OPTIMISATION=Falsefromrebuild.builderimport*print(either("a","b","c"))# (?:(?:a)|(?:b)|(?:c))print(optimise(either("a","b","c")))# [abc]
About
A Python module for generating performant regex patterns in a readable and maintainable way