- Notifications
You must be signed in to change notification settings - Fork6
Tidy output from regular expression matches
License
Unknown, MIT licenses found
Licenses found
r-lib/rematch2
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Match Regular Expressions with a Nicer ‘API’
A small wrapper on regular expression matching functionsregexpr andgregexpr to return the results in tidy data frames.
Stable version:
install.packages("rematch2")Development version:
pak::pak("r-lib/rematch2")
Note thatrematch2 is not compatible with the originalrematchpackage. There are at least three major changes:
- The order of the arguments for the functions is different. In
rematch2thetextvector is first, andpatternis second. - In the result,
.matchis the last column instead of the first. rematch2returnstibbledata frames. Seehttps://github.com/tidyverse/tibble.
library(rematch2)With capture groups:
dates<- c("2016-04-20","1977-08-08","not a date","2016","76-03-02","2012-06-30","2015-01-21 19:58")isodate<-"([0-9]{4})-([0-1][0-9])-([0-3][0-9])"re_match(text=dates,pattern=isodate)#> # A tibble: 7 × 5#> `` `` `` .text .match#> <chr> <chr> <chr> <chr> <chr>#> 1 2016 04 20 2016-04-20 2016-04-20#> 2 1977 08 08 1977-08-08 1977-08-08#> 3 <NA> <NA> <NA> not a date <NA>#> 4 <NA> <NA> <NA> 2016 <NA>#> 5 <NA> <NA> <NA> 76-03-02 <NA>#> 6 2012 06 30 2012-06-30 2012-06-30#> 7 2015 01 21 2015-01-21 19:58 2015-01-21
Named capture groups:
isodaten<-"(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])"re_match(text=dates,pattern=isodaten)#> # A tibble: 7 × 5#> year month day .text .match#> <chr> <chr> <chr> <chr> <chr>#> 1 2016 04 20 2016-04-20 2016-04-20#> 2 1977 08 08 1977-08-08 1977-08-08#> 3 <NA> <NA> <NA> not a date <NA>#> 4 <NA> <NA> <NA> 2016 <NA>#> 5 <NA> <NA> <NA> 76-03-02 <NA>#> 6 2012 06 30 2012-06-30 2012-06-30#> 7 2015 01 21 2015-01-21 19:58 2015-01-21
A slightly more complex example:
github_repos<- c("metacran/crandb","jeroenooms/curl@v0.9.3","jimhester/covr#47","hadley/dplyr@*release","r-lib/remotes@550a3c7d3f9e1493a2ba","/$&@R64&3")owner_rx<-"(?:(?<owner>[^/]+)/)?"repo_rx<-"(?<repo>[^/@#]+)"subdir_rx<-"(?:/(?<subdir>[^@#]*[^@#/]))?"ref_rx<-"(?:@(?<ref>[^*].*))"pull_rx<-"(?:#(?<pull>[0-9]+))"release_rx<-"(?:@(?<release>[*]release))"subtype_rx<- sprintf("(?:%s|%s|%s)?",ref_rx,pull_rx,release_rx)github_rx<- sprintf("^(?:%s%s%s%s|(?<catchall>.*))$",owner_rx,repo_rx,subdir_rx,subtype_rx)re_match(text=github_repos,pattern=github_rx)#> # A tibble: 6 × 9#> owner repo subdir ref pull release catchall .text .match#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>#> 1 "metacran" "crandb" "" "" "" "" "" meta… metac…#> 2 "jeroenooms" "curl" "" "v0.9.3" "" "" "" jero… jeroe…#> 3 "jimhester" "covr" "" "" "47" "" "" jimh… jimhe…#> 4 "hadley" "dplyr" "" "" "" "*rele… "" hadl… hadle…#> 5 "r-lib" "remotes" "" "550a3c7d3f… "" "" "" r-li… r-lib…#> 6 "" "" "" "" "" "" "/$&@R6… /$&@… /$&@R…
Extract all names, and also first names and last names:
name_rex<- paste0("(?<first>[[:upper:]][[:lower:]]+)","(?<last>[[:upper:]][[:lower:]]+)")notables<- c(" Ben Franklin and Jefferson Davis","\tMillard Fillmore")not<- re_match_all(notables,name_rex)not#> # A tibble: 2 × 4#> first last .text .match#> <list> <list> <chr> <list>#> 1 <chr [2]> <chr [2]> " Ben Franklin and Jefferson Davis" <chr [2]>#> 2 <chr [1]> <chr [1]> "\tMillard Fillmore" <chr [1]>
not$first#> [[1]]#> [1] "Ben" "Jefferson"#>#> [[2]]#> [1] "Millard"not$last#> [[1]]#> [1] "Franklin" "Davis"#>#> [[2]]#> [1] "Fillmore"not$.match#> [[1]]#> [1] "Ben Franklin" "Jefferson Davis"#>#> [[2]]#> [1] "Millard Fillmore"
re_exec andre_exec_all are similar tore_match andre_match_all, but they also return match positions. These functionsreturn match records. A match record has three components:match,start,end, and each component can be a vector. It is similar to adata frame in this respect.
pos<- re_exec(notables,name_rex)pos#> # A tibble: 2 × 4#> first last .text .match#> <rmtch_rc> <rmtch_rc> <chr> <rmtch_rc>#> 1 <named list [3]> <named list [3]> " Ben Franklin and Jefferson … <named list>#> 2 <named list [3]> <named list [3]> "\tMillard Fillmore" <named list>
Unfortunately R does not allow hierarchical data frames (i.e. a columnof a data frame cannot be another data frame), butrematch2 definessome special classes and an$ operator, to make it easier to extractparts ofre_exec andre_exec_all matches. You simply query thematch,start orend part of a column:
pos$first$match#> [1] "Ben" "Millard"pos$first$start#> [1] 3 2pos$first$end#> [1] 5 8
re_exec_all is very similar, but these queries return lists, witharbitrary number of matches:
allpos<- re_exec_all(notables,name_rex)allpos#> # A tibble: 2 × 4#> first last .text .match#> <rmtch_ll> <rmtch_ll> <chr> <rmtch_ll>#> 1 <named list [3]> <named list [3]> " Ben Franklin and Jefferson … <named list>#> 2 <named list [3]> <named list [3]> "\tMillard Fillmore" <named list>
allpos$first$match#> [[1]]#> [1] "Ben" "Jefferson"#>#> [[2]]#> [1] "Millard"allpos$first$start#> [[1]]#> [1] 3 20#>#> [[2]]#> [1] 2allpos$first$end#> [[1]]#> [1] 5 28#>#> [[2]]#> [1] 8
MIT © Mango Solutions, Gábor Csárdi
About
Tidy output from regular expression matches
Topics
Resources
License
Unknown, MIT licenses found
Licenses found
Code of conduct
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors6
Uh oh!
There was an error while loading.Please reload this page.