Movatterモバイル変換


[0]ホーム

URL:


Uniform interface to three regexengines

2025-04-08

Several C libraries providing regular expression engines areavailable in R. The standard R distribution has included thePerl-Compatible Regular Expressions (PCRE) C library since 2002. CRANpackage re2r provides the RE2 library, and stringi provides the ICUlibrary. Each of these regex engines has a unique feature set, and maybe preferred for different applications. For example, PCRE is installedby default, RE2 guarantees matching in polynomial time, and ICU providesstrong unicode support. For a more detailed comparison of the relativestrengths of each regex library, we refer the reader to our previousresearch paper,ComparingnamedCapture with other R packages for regular expressions.

Each regex engine has a different R interface, so switching from oneengine to another may require non-trivial modifications of user code. Inorder to make switching between engines easier, the namedCapture packageprovides a uniform interface for capturing text using PCRE and RE2. Theuser may specify the desired engine via an option; the namedCapturepackage provides the output in a uniform format. However namedCapturerequires the engine to support specifying capture group names in regexpattern strings, and to support output of the group names to R (whichICU does not support).

Our proposed nc package provides support for the ICU engine inaddition to PCRE and RE2. The nc package implements this functionalityusing un-named capture groups, which are supported in all three regexengines. In particular, a regular expression is constructed in R codethat uses named arguments to indicate capturing sub-patterns, which aretranslated to un-named groups when passed to the regex engine. Forexample, consider a user who wants to capture the two pieces of thecolumn names of the iris data, e.g.,Sepal.Length. The userwould typically specify the capturing regular expression as a stringliteral, e.g.,"(.*)[.](.*)". Using nc the same pattern canbe applied to the iris data column names via

nc::capture_first_vec(names(iris),part =".*","[.]",dim =".*",engine ="ICU",nomatch.error =FALSE)#>      part    dim#>    <char> <char>#> 1:  Sepal Length#> 2:  Sepal  Width#> 3:  Petal Length#> 4:  Petal  Width#> 5:   <NA>   <NA>

Above we see an example usage ofnc:capture_first_vec,which is for capturing the first match of a regex from each element of acharacter vector subject (the first argument). There are a variablenumber of other arguments (...) which are used to definethe regex pattern. In this case there are three pattern arguments:part = ".*", "[.]", dim = ".*". Each named R argument inthe pattern generates an un-named capture group by enclosing thespecified character string in parentheses, e.g.,(.*) forbothpart anddim arguments above. All of thesub-patterns are pasted together in the sequence they appear in order tocreate the final pattern that is used with the specified regex engine.Thenomatch.error = FALSE argument is given because thedefault is to stop with an error if any subjects do not match thespecified pattern (the fifth subjectSpecies does notmatch). Under the hood, the following function is called to parse thepattern arguments:

str(compiled<- nc::var_args_list(part =".*","[.]",dim =".*"))#> List of 2#>  $ fun.list:List of 2#>   ..$ part:function (x)#>   ..$ dim :function (x)#>  $ pattern : chr "(.*)[.](.*)"

This function is intended mostly for internal use, but can be usefulfor viewing the generated regex pattern (or using it as input to anotherregex function). The return value is a named list of two elements:pattern is the capturing regular expression which isgenerated based on the input arguments, andfun.list is anamed list of type conversion functions. If the user does not specify atype conversion function for a group (as in the example code above),then the default isbase::identity, which simply returnsthe captured character strings. Group-specific type conversion functionsare useful for converting captured text into numeric output columns.Note that the order of elements infun.list corresponds tothe order of capture groups in the pattern (e.g., first capture groupnamedpart, seconddim). These data can beused with any regex engine that supports un-named capture groups(including ICU) in order to get a capture matrix with column names,e.g.

m<- stringi::stri_match_first_regex(names(iris), compiled$pattern)colnames(m)<-c("match",names(compiled$fun.list))m#>      match          part    dim#> [1,] "Sepal.Length" "Sepal" "Length"#> [2,] "Sepal.Width"  "Sepal" "Width"#> [3,] "Petal.Length" "Petal" "Length"#> [4,] "Petal.Width"  "Petal" "Width"#> [5,] NA             NA      NA

Again, this is not the recommended usage of nc, but here we givethese details in order to explain how it works. Note that the resultfrom stringi is a character matrix with three columns: first for theentire match, and another column for each capture group. Using the samepattern withbase::regexpr (PCRE engine) orre2r::re2_match (RE2 engine) yields output in varyingformats. The nc package takes care of converting these different resultsinto a standard data table format which makes it easy to switch regexengines (by changing the value of theengine argument).Most of the time the different engines give similar results, but in somecases there are differences:

u.subject<-"a\U0001F60E#"u.pattern<-list(emoji="\\p{EMOJI_Presentation}")#only supported in ICU.old.opt<-options(nc.engine="ICU")nc::capture_first_vec(u.subject, u.pattern)#>           emoji#>          <char>#> 1: <U+0001F60E>nc::capture_first_vec(u.subject, u.pattern,engine="PCRE")#>           emoji#>          <char>#> 1: <U+0001F60E>nc::capture_first_vec(u.subject, u.pattern,engine="RE2")#> re2google/re2/re2.cc:205: Error parsing '(?:(?:(\p{EMOJI_Presentation})))': invalid character class range: \p{EMOJI_Presentation}#> Error in value[[3L]](cond): (?:(?:(\p{EMOJI_Presentation})))#> when matching pattern above with RE2 engine, an error occured: invalid character class range: \p{EMOJI_Presentation}options(old.opt)

Note that the standard output format used by nc, as shown above withnc::capture_first_vec, is a data table (not a charactermatrix, as in other regex packages). The main reason that data tablesare always output by nc is in order to support output columns ofdifferent types, when type conversion functions are specified.


[8]ページ先頭

©2009-2025 Movatter.jp