girishji/re2Public

NotificationsYou must be signed in to change notification settings
Fork5
Star30

R interface to Google re2 (C++) regular expression engine

License

Unknown, MIT licenses found

Licenses found

30 stars 5 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
R		R
inst/examples		inst/examples
src		src
tests		tests
tools		tools
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
ChangeLog		ChangeLog
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
cleanup		cleanup

Repository files navigation

re2: R interface to Google RE2

Overview

re2 package provides pattern matching, extraction, replacement and otherstring processing operations using Google’sRE2 (C++) regular-expression library.The interface is consistent, and similar tostringr.

Why re2?

Regular expression matching can be done in two ways: using recursivebacktracking or using finite automata-based techniques.

Perl, PCRE, Python, Ruby, Java, and many other languages rely onrecursive backtracking for their regular expression implementations. Theproblem with this approach is that performance can degrade very quickly.Time complexity can be exponential. In contrast, re2 uses finiteautomata-based techniques for regular expression matching, guaranteeinglinear time execution and a fixed stack footprint. See links to RussCox’s excellent articles below.

Installation

# Install the released version from CRAN:install.packages("re2")# Install the development version from GitHub:# install.packages("devtools")devtools::install_github("girishji/re2")

Usage

re2 provides three types of regular-expression functions:

Find the presence of a pattern in string
Extract substrings that match a pattern
Replace matched groups

All functions take a vector of strings as argument. Regular-expressionpatterns can be compiled, and reused for performance.

Here are the primary verbs of re2:

re2_detect(x, pattern) finds if a pattern is present in string

re2_detect(c("barbazbla","foobar","foxy brown"),"(foo)|(bar)baz")#> [1]  TRUE  TRUE FALSE

re2_count(x, pattern) counts the number of matches in string

re2_count(c("yellowgreen","steelblue","maroon"),"e")#> [1] 3 3 0

re2_subset(x, pattern) selects strings that match

re2_subset(c("yellowgreen","steelblue","goldenrod"),"ee")#> [1] "yellowgreen" "steelblue"

re2_match(x, pattern, simplify = FALSE) extracts first matchedsubstring

re2_match("ruby:1234 68 red:92 blue:","(\\w+):(\\d+)")#>      .0          .1     .2#> [1,] "ruby:1234" "ruby" "1234"

# Groups can be named:re2_match(c("barbazbla","foobar"),"(foo)|(?P<TestGroup>bar)baz")#>      .0       .1    TestGroup#> [1,] "barbaz" NA    "bar"#> [2,] "foo"    "foo" NA

# Use pre-compiled regular expression:re<- re2_regexp("(foo)|(bar)baz",case_sensitive=FALSE)re2_match(c("BaRbazbla","Foobar"),re)#>      .0       .1    .2#> [1,] "BaRbaz" NA    "BaR"#> [2,] "Foo"    "Foo" NA

re2_match_all(x, pattern) extracts all matched substrings

re2_match_all("ruby:1234 68 red:92 blue:","(\\w+):(\\d+)")#> [[1]]#>      .0          .1     .2#> [1,] "ruby:1234" "ruby" "1234"#> [2,] "red:92"    "red"  "92"

re2_replace(x, pattern, rewrite) replaces first matched pattern instring

re2_replace("yabba dabba doo","b+","d")#> [1] "yada dabba doo"

# Use groups in rewrite:re2_replace("bunny@wunnies.pl","(.*)@([^.]*)","\\2!\\1")#> [1] "wunnies!bunny.pl"

re2_replace_all(x, pattern, rewrite) replaces all matched patternsin string, or performs multiple replacements on each element of string.

re2_replace_all("yabba dabba doo","b+","d")#> [1] "yada dada doo"re2_replace_all(c("one","two"), c("one"="1","1"="2","two"="2"))#> [1] "2" "2"

re2_extract_replace(x, pattern, rewrite) extracts and substitutes(ignores non-matching portions of x)

re2_extract_replace("bunny@wunnies.pl","(.*)@([^.]*)","\\2!\\1")#> [1] "wunnies!bunny"

re2_split(x, pattern, simplify = FALSE, n = Inf) splits stringbased on pattern

re2_split("How vexingly quick daft zebras jump!"," quick | zebras")#> [[1]]#> [1] "How vexingly" "daft"         " jump!"

re2_locate(x, pattern) seeks the start and end of pattern instring

re2_locate(c("yellowgreen","steelblue"),"l(b)?l")#>      begin end#> [1,]     3   4#> [2,]     5   7

re2_locate_all(x, pattern) locates start and end of alloccurrences of pattern in string

re2_locate_all(c("yellowgreen","steelblue"),"l")#> [[1]]#>      begin end#> [1,]     3   3#> [2,]     4   4#>#> [[2]]#>      begin end#> [1,]     5   5#> [2,]     7   7

In all the above functions, regular-expression pattern is vectorized.

Regular-expression pattern can be compiled usingre2_regexp(pattern, ...). Here are some of the options:

case_sensitive: Match is case-sensitive
encoding: UTF8 or Latin1
literal: Interpret pattern as literal, not regexp
longest_match: Search for longest match, not first match
posix_syntax: Restrict regexps to POSIX egrep syntax

help(re2_regexp) lists available options.

re2_get_options(regexp_ptr) returns a list of options stored in thecompiled regular-expression object.

Regexp Syntax

re2 supports pearl style regular expressions (with extensions like \d,\w, \s, …) and provides most of the functionality of PCRE – eschewingonly backreferences and look-around assertions.

SeeRE2 Syntax for thesyntax supported by RE2, and a comparison with PCRE and PERL regexps.

For those not familiar with Perl’s regular expressions, here are someexamples of the most commonly used extensions:


`"hello (\\w+) world"`	`\w` matches a “word” character
`"version (\\d+)"`	`\d` matches a digit
`"hello\\s+world"`	`\s` matches any whitespace character
`"\\b(\\w+)\\b"`	`\b` matches non-empty string at word boundary
`"(?i)hello"`	`(?i)` turns on case-insensitive matching
`"/\\(.?)\\*/"`	`.*?` matches . minimum no. of times possible

The double backslashes are needed when writing R string literals.However, they should not be used when writing raw string literals:


`r"(hello (\w+) world)"`	`\w` matches a “word” character
`r"(version (\d+))"`	`\d` matches a digit
`r"(hello\s+world)"`	`\s` matches any whitespace character
`r"(\b(\w+)\b)"`	`\b` matches non-empty string at word boundary
`r"((?i)hello)"`	`(?i)` turns on case-insensitive matching
`r"(/\(.?)\*/)"`	`.*?` matches`.` minimum no. of times possible

References

About

R interface to Google re2 (C++) regular expression engine

Topics

r regex regexp re2 regex-engine

Resources

Readme

License

Unknown, MIT licenses found

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Licenses found

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

re2: R interface to Google RE2

Overview

Installation

Usage

Regexp Syntax

References

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

Licenses found

girishji/re2

Folders and files

Latest commit

History

Repository files navigation

re2: R interface to Google RE2

Overview

Installation

Usage

Regexp Syntax

References

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages