Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

A Python module for generating performant regex patterns in a readable and maintainable way

License

NotificationsYou must be signed in to change notification settings

Inspiaaa/ReBuild

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

-- Jamie Zawinski

With ReBuild you can write performant regex patterns in a readable and maintainable way by using functions. It is inspired by the Simple Regex Language (SRL - Simple Regex Language).

Why Rebuild?

  • Helps you writemaintainable regex patterns, that arereadable, regardless of how good you are at regex

  • ReBuild creates highlyoptimised regex patterns byanalysing the input patterns

  • It uses non-capturing groups as often as possible toenhance performance

  • Fullystring based: It'sversatile and can be used with many different tools

  • Lets you use your belovedautocomplete for generating regex patterns

  • Helps you discover new ways of writing regex patterns, that involvemore code reuse

Optimisation

ReBuild does a lot of processing ahead of time (AOT) to generate performant and simple regex patterns. Many of its functions can optimise specific cases, which means that many functions provide zero-cost abstraction.

A good example to illustrate the optimisation power is theeither function, which represents a regex OR function, e.g.a|b|c.

# Matches the string abc, 123, or xyzeither("abc","123","xyz")>>>'(?:abc|123|xyz)'# Matches any letter from a to z, any digit or the string "xyz"either("[a-z]","[0-9]","xyz")>>>'(?:[a-z0-9]|xyz)'# ReBuild detects the two character sets and combines them# Matches any letter from a to z or any digiteither("[a-z]","[0-9]")>>>'[a-z0-9]'# It detects that a OR is no longer necessary after# combining the character sets, and optimises it away# Matches the character "a", "b" or "c"either("a","b","c")>>>'[abc]'# ReBuild transforms the OR of individual characters into a single character set

How it works

For every function (likeeither,one_of, ...) ReBuild does the following steps:

  1. Generate a non-optimised regex pattern (rebuild.builder)

  2. Parse this pattern and convert it to a tree structure (rebuild.parser)

  3. Analyse and optimise the tree (rebuild.analyser)

  4. Convert the tree back to a string regex pattern

As you can see from the aforementioned steps, ReBuild consists of 3 modules:builder,parser andanalyser

  • rebuild.builder provides the regex building functions likeeither,one_of, ... which first create a non-optimised pattern, but then run them through an intermediate optimisation process

    print(either("a","b",one_of("0-9"))# Before optimising (intermediate optimisations disabled for this example)>>>'(?:(?:a)|(?:b)|(?:[0-9]))'
  • rebuild.parser parses string regex patterns with the amazing parsing library for PythonLark.

    First, it converts the regex pattern into a concrete syntax tree (CST)

    # CST of '(?:(?:a)|(?:b)|(?:[0-9]))'alternationsingle_charasingle_charbchar_setrange0-9

    Then this tree is converted into the another tree, now consisting ofRegexNodes, which are defined in therebuild.analyser module. Amongst other things, these nodes contain more information about the pattern (AST).

    Alternation|SingleChar:"a"|SingleChar:"b"|CharSet||IsInverted:False||Options|||Range||||FromChar:"0"||||ToChar:"9"
  • rebuild.analyser defines the nodes of this tree, which each have rules for optimisations and methods for converting back to a string regex pattern.

    # After optimisationCharSet|IsInverted:False|Options||SingleChar:"a"||SingleChar:"b"||Range|||FromChar:"0"|||ToChar:"9"

    Converted back to a string:

    '[ab0-9]'

As you can see, ReBuild re-optimises the input string for every function, maximising readability and ease of use.

Should you be worried about the performance penalty of re-optimising for every single function?

No. ReBuild uses Lark's most optimised mode, a tree-less LALR(1) parser. Lark is thefastest parsing library available for Python. And after all, you should only generate the regex pattern once, store it in a constant and only use the constant for the rest of your program.

But if you want to only optimise once, you can do the following:

importrebuild.builderrebuild.builder.INTERMEDIATE_OPTIMISATION=Falsefromrebuild.builderimport*print(either("a","b","c"))# (?:(?:a)|(?:b)|(?:c))print(optimise(either("a","b","c")))# [abc]

About

A Python module for generating performant regex patterns in a readable and maintainable way

Topics

Resources

License

Stars

Watchers

Forks

Languages


[8]ページ先頭

©2009-2025 Movatter.jp