Movatterモバイル変換

nc is a package for named capture regular expressions(regex), which are useful for parsing/converting text data to tabulardata (one row per match, one column per capture group). In theterminology of regex, we attempt to match a regex/pattern to a subject,which is a string of text data. The regex/pattern is typically definedusing a single string (in other frameworks/packages/languages), but innc we use a special syntax: one or more R arguments areconcatenated to define a regex/pattern, and named arguments are used ascapture groups. For more info about regex in general seeregular-expressions.infoand/or the Friedl book. For more info about the specialncsyntax, seehelp("nc",package="nc").

Below is an index of topics which are explained in the differentvignettes, along with an overview of functionality using simpleexamples.

Capture first match in several subjects

Capture first is for thesituation when your input is a character vector (each element is adifferent subject to parse), you want find the first match of a regex toeach subject, and your desired output is a data table (one row persubject, one column per capture group in the regex).

subject.vec<-c("chr10:213054000-213,055,000","chrM:111000","chr1:110-111 chr2:220-222")nc::capture_first_vec(  subject.vec,chrom="chr.*?",":",chromStart="[0-9,]+", as.integer)#>     chrom chromStart#>    <char>      <int>#> 1:  chr10  213054000#> 2:   chrM     111000#> 3:   chr1        110

A variant is doing the same thing, but with input subjects comingfrom a data table/frame with character columns.

library(data.table)subject.dt<-data.table(JobID =c("13937810_25","14022192_1"),Elapsed =c("07:04:42","07:04:49"))int.pat<-list("[0-9]+", as.integer)nc::capture_first_df(  subject.dt,JobID=list(job=int.pat,"_",task=int.pat),Elapsed=list(hours=int.pat,":",minutes=int.pat,":",seconds=int.pat))#>          JobID  Elapsed      job  task hours minutes seconds#>         <char>   <char>    <int> <int> <int>   <int>   <int>#> 1: 13937810_25 07:04:42 13937810    25     7       4      42#> 2:  14022192_1 07:04:49 14022192     1     7       4      49

Capture all matches in a single subject

Capture all is for the situationwhen your input is a single character string or text file subject, youwant to find all matches of a regex to that subject, and your desiredoutput is a data table (one row per match, one column per capture groupin the regex).

nc::capture_all_str(  subject.vec,chrom="chr.*?",":",chromStart="[0-9,]+", as.integer)#>     chrom chromStart#>    <char>      <int>#> 1:  chr10  213054000#> 2:   chrM     111000#> 3:   chr1        110#> 4:   chr2        220

Reshape a data table with regularly named columns

Capture melt is for the situationwhen your input is a data table/frame that has regularly named columns,and your desired output is a data table with those columns reshaped intoa taller/longer form. In that case you can use a regex to identify thecolumns to reshape.

(one.iris<-data.frame(iris[1,]))#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species#> 1          5.1         3.5          1.4         0.2  setosanc::capture_melt_single  (one.iris,part  =".*","[.]",dim   =".*")#>    Species   part    dim value#>     <fctr> <char> <char> <num>#> 1:  setosa  Sepal Length   5.1#> 2:  setosa  Sepal  Width   3.5#> 3:  setosa  Petal Length   1.4#> 4:  setosa  Petal  Width   0.2nc::capture_melt_multiple(one.iris,column=".*","[.]",dim   =".*")#>    Species    dim Petal Sepal#>     <fctr> <char> <num> <num>#> 1:  setosa Length   1.4   5.1#> 2:  setosa  Width   0.2   3.5nc::capture_melt_multiple(one.iris,part  =".*","[.]",column=".*")#>    Species   part Length Width#>     <fctr> <char>  <num> <num>#> 1:  setosa  Petal    1.4   0.2#> 2:  setosa  Sepal    5.1   3.5

Reading regularly named data files

Capture glob is for the situationwhen you have several data files on disk, with regular names that youcan match with a glob/regex. In the example below we first write one CSVfile for each iris Species,

dir.create(iris.dir<-tempfile())icsv<-function(sp)file.path(iris.dir,paste0(sp,".csv"))data.table(iris)[,fwrite(.SD,icsv(Species)), by=Species]#> Empty data.table (0 rows and 1 cols): Speciesdir(iris.dir)#> [1] "setosa.csv"     "versicolor.csv" "virginica.csv"

We then use a glob and a regex to read those files in the codebelow:

nc::capture_first_glob(file.path(iris.dir,"*.csv"),Species="[^/]+","[.]csv")#>        Species Sepal.Length Sepal.Width Petal.Length Petal.Width#>         <char>        <num>       <num>        <num>       <num>#>   1:    setosa          5.1         3.5          1.4         0.2#>   2:    setosa          4.9         3.0          1.4         0.2#>   3:    setosa          4.7         3.2          1.3         0.2#>   4:    setosa          4.6         3.1          1.5         0.2#>   5:    setosa          5.0         3.6          1.4         0.2#>  ---#> 146: virginica          6.7         3.0          5.2         2.3#> 147: virginica          6.3         2.5          5.0         1.9#> 148: virginica          6.5         3.0          5.2         2.0#> 149: virginica          6.2         3.4          5.4         2.3#> 150: virginica          5.9         3.0          5.1         1.8

Helper functions for defining complex pattterns

Helpers describes various functionsthat simplify the definition of complex regex patterns. For examplenc::field helps avoid repetition below,

subject.vec<-c("sex_child1","age_child1","sex_child2")pattern<-list(variable="age|sex","_",  nc::field("child","","[12]", as.integer))nc::capture_first_vec(subject.vec, pattern)#>    variable child#>      <char> <int>#> 1:      sex     1#> 2:      age     1#> 3:      sex     2

It also explains how to define common sub-patterns which are used inseveral different alternatives.

subject.vec<-c("mar 17, 1983","26 sep 2017","17 mar 1984")pattern<- nc::alternatives_with_shared_groups(month="[a-z]{3}",day="[0-9]{2}",year="[0-9]{4}",list(month," ", day,", ", year),list(day," ", month," ", year))nc::capture_first_vec(subject.vec, pattern)#>     month    day   year#>    <char> <char> <char>#> 1:    mar     17   1983#> 2:    sep     26   2017#> 3:    mar     17   1984