r-lib/rematch2Public

NotificationsYou must be signed in to change notification settings
Fork6
Star46

Tidy output from regular expression matches

License

Unknown, MIT licenses found

Licenses found

46 stars 6 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
.github		.github
.vscode		.vscode
R		R
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
air.toml		air.toml
codecov.yml		codecov.yml

Repository files navigation

rematch2

Match Regular Expressions with a Nicer ‘API’

A small wrapper on regular expression matching functionsregexpr andgregexpr to return the results in tidy data frames.

Installation

Stable version:

install.packages("rematch2")

Development version:

pak::pak("r-lib/rematch2")

Rematch vs rematch2

Note thatrematch2 is not compatible with the originalrematchpackage. There are at least three major changes:

The order of the arguments for the functions is different. Inrematch2 thetext vector is first, andpattern is second.
In the result,.match is the last column instead of the first.
rematch2 returnstibble data frames. Seehttps://github.com/tidyverse/tibble.

Usage

First match

library(rematch2)

With capture groups:

dates<- c("2016-04-20","1977-08-08","not a date","2016","76-03-02","2012-06-30","2015-01-21 19:58")isodate<-"([0-9]{4})-([0-1][0-9])-([0-3][0-9])"re_match(text=dates,pattern=isodate)#> # A tibble: 7 × 5#>   ``    ``    ``    .text            .match#>   <chr> <chr> <chr> <chr>            <chr>#> 1 2016  04    20    2016-04-20       2016-04-20#> 2 1977  08    08    1977-08-08       1977-08-08#> 3 <NA>  <NA>  <NA>  not a date       <NA>#> 4 <NA>  <NA>  <NA>  2016             <NA>#> 5 <NA>  <NA>  <NA>  76-03-02         <NA>#> 6 2012  06    30    2012-06-30       2012-06-30#> 7 2015  01    21    2015-01-21 19:58 2015-01-21

Named capture groups:

isodaten<-"(?<year>[0-9]{4})-(?<month>[0-1][0-9])-(?<day>[0-3][0-9])"re_match(text=dates,pattern=isodaten)#> # A tibble: 7 × 5#>   year  month day   .text            .match#>   <chr> <chr> <chr> <chr>            <chr>#> 1 2016  04    20    2016-04-20       2016-04-20#> 2 1977  08    08    1977-08-08       1977-08-08#> 3 <NA>  <NA>  <NA>  not a date       <NA>#> 4 <NA>  <NA>  <NA>  2016             <NA>#> 5 <NA>  <NA>  <NA>  76-03-02         <NA>#> 6 2012  06    30    2012-06-30       2012-06-30#> 7 2015  01    21    2015-01-21 19:58 2015-01-21

A slightly more complex example:

github_repos<- c("metacran/crandb","jeroenooms/curl@v0.9.3","jimhester/covr#47","hadley/dplyr@*release","r-lib/remotes@550a3c7d3f9e1493a2ba","/$&@R64&3")owner_rx<-"(?:(?<owner>[^/]+)/)?"repo_rx<-"(?<repo>[^/@#]+)"subdir_rx<-"(?:/(?<subdir>[^@#]*[^@#/]))?"ref_rx<-"(?:@(?<ref>[^*].*))"pull_rx<-"(?:#(?<pull>[0-9]+))"release_rx<-"(?:@(?<release>[*]release))"subtype_rx<- sprintf("(?:%s|%s|%s)?",ref_rx,pull_rx,release_rx)github_rx<- sprintf("^(?:%s%s%s%s|(?<catchall>.*))$",owner_rx,repo_rx,subdir_rx,subtype_rx)re_match(text=github_repos,pattern=github_rx)#> # A tibble: 6 × 9#>   owner        repo      subdir ref          pull  release catchall .text .match#>   <chr>        <chr>     <chr>  <chr>        <chr> <chr>   <chr>    <chr> <chr>#> 1 "metacran"   "crandb"  ""     ""           ""    ""      ""       meta… metac…#> 2 "jeroenooms" "curl"    ""     "v0.9.3"     ""    ""      ""       jero… jeroe…#> 3 "jimhester"  "covr"    ""     ""           "47"  ""      ""       jimh… jimhe…#> 4 "hadley"     "dplyr"   ""     ""           ""    "*rele… ""       hadl… hadle…#> 5 "r-lib"      "remotes" ""     "550a3c7d3f… ""    ""      ""       r-li… r-lib…#> 6 ""           ""        ""     ""           ""    ""      "/$&@R6… /$&@… /$&@R…

All matches

Extract all names, and also first names and last names:

name_rex<- paste0("(?<first>[[:upper:]][[:lower:]]+)","(?<last>[[:upper:]][[:lower:]]+)")notables<- c("  Ben Franklin and Jefferson Davis","\tMillard Fillmore")not<- re_match_all(notables,name_rex)not#> # A tibble: 2 × 4#>   first     last      .text                                .match#>   <list>    <list>    <chr>                                <list>#> 1 <chr [2]> <chr [2]> "  Ben Franklin and Jefferson Davis" <chr [2]>#> 2 <chr [1]> <chr [1]> "\tMillard Fillmore"                 <chr [1]>

not$first#> [[1]]#> [1] "Ben"       "Jefferson"#>#> [[2]]#> [1] "Millard"not$last#> [[1]]#> [1] "Franklin" "Davis"#>#> [[2]]#> [1] "Fillmore"not$.match#> [[1]]#> [1] "Ben Franklin"    "Jefferson Davis"#>#> [[2]]#> [1] "Millard Fillmore"

Match positions

re_exec andre_exec_all are similar tore_match andre_match_all, but they also return match positions. These functionsreturn match records. A match record has three components:match,start,end, and each component can be a vector. It is similar to adata frame in this respect.

pos<- re_exec(notables,name_rex)pos#> # A tibble: 2 × 4#>   first            last             .text                           .match#>   <rmtch_rc>       <rmtch_rc>       <chr>                           <rmtch_rc>#> 1 <named list [3]> <named list [3]> "  Ben Franklin and Jefferson … <named list>#> 2 <named list [3]> <named list [3]> "\tMillard Fillmore"            <named list>

Unfortunately R does not allow hierarchical data frames (i.e. a columnof a data frame cannot be another data frame), butrematch2 definessome special classes and an$ operator, to make it easier to extractparts ofre_exec andre_exec_all matches. You simply query thematch,start orend part of a column:

pos$first$match#> [1] "Ben"     "Millard"pos$first$start#> [1] 3 2pos$first$end#> [1] 5 8

re_exec_all is very similar, but these queries return lists, witharbitrary number of matches:

allpos<- re_exec_all(notables,name_rex)allpos#> # A tibble: 2 × 4#>   first            last             .text                           .match#>   <rmtch_ll>       <rmtch_ll>       <chr>                           <rmtch_ll>#> 1 <named list [3]> <named list [3]> "  Ben Franklin and Jefferson … <named list>#> 2 <named list [3]> <named list [3]> "\tMillard Fillmore"            <named list>

allpos$first$match#> [[1]]#> [1] "Ben"       "Jefferson"#>#> [[2]]#> [1] "Millard"allpos$first$start#> [[1]]#> [1]  3 20#>#> [[2]]#> [1] 2allpos$first$end#> [[1]]#> [1]  5 28#>#> [[2]]#> [1] 8

License

MIT © Mango Solutions, Gábor Csárdi

About

Tidy output from regular expression matches

Topics

Resources

Readme

License

Unknown, MIT licenses found

Releases3

v2.1.2 Latest

May 14, 2020

+ 2 releases

Packages

No packages published

Contributors6

Languages

R100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Licenses found

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

rematch2

Installation

Rematch vs rematch2

Usage

First match

All matches

Match positions

License

About

Topics

Resources

License

Licenses found

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases3

Packages

Uh oh!

Contributors6

Uh oh!

Languages

Movatterモバイル変換

License

Licenses found

r-lib/rematch2

Folders and files

Latest commit

History

Repository files navigation

rematch2

Installation

Rematch vs rematch2

Usage

First match

All matches

Match positions

License

About

Topics

Resources

License

Licenses found

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases3

Packages0

Uh oh!

Contributors6

Uh oh!

Languages

Packages