nc is a package for named capture regular expressions(regex), which are useful for parsing/converting text data to tabulardata (one row per match, one column per capture group). In theterminology of regex, we attempt to match a regex/pattern to a subject,which is a string of text data. The regex/pattern is typically definedusing a single string (in other frameworks/packages/languages), but innc we use a special syntax: one or more R arguments areconcatenated to define a regex/pattern, and named arguments are used ascapture groups. For more info about regex in general seeregular-expressions.infoand/or the Friedl book. For more info about the specialncsyntax, seehelp("nc",package="nc").
Below is an index of topics which are explained in the differentvignettes, along with an overview of functionality using simpleexamples.
Capture first is for thesituation when your input is a character vector (each element is adifferent subject to parse), you want find the first match of a regex toeach subject, and your desired output is a data table (one row persubject, one column per capture group in the regex).
subject.vec<-c("chr10:213054000-213,055,000","chrM:111000","chr1:110-111 chr2:220-222")nc::capture_first_vec( subject.vec,chrom="chr.*?",":",chromStart="[0-9,]+", as.integer)#> chrom chromStart#> <char> <int>#> 1: chr10 213054000#> 2: chrM 111000#> 3: chr1 110A variant is doing the same thing, but with input subjects comingfrom a data table/frame with character columns.
library(data.table)subject.dt<-data.table(JobID =c("13937810_25","14022192_1"),Elapsed =c("07:04:42","07:04:49"))int.pat<-list("[0-9]+", as.integer)nc::capture_first_df( subject.dt,JobID=list(job=int.pat,"_",task=int.pat),Elapsed=list(hours=int.pat,":",minutes=int.pat,":",seconds=int.pat))#> JobID Elapsed job task hours minutes seconds#> <char> <char> <int> <int> <int> <int> <int>#> 1: 13937810_25 07:04:42 13937810 25 7 4 42#> 2: 14022192_1 07:04:49 14022192 1 7 4 49Capture all is for the situationwhen your input is a single character string or text file subject, youwant to find all matches of a regex to that subject, and your desiredoutput is a data table (one row per match, one column per capture groupin the regex).
Capture melt is for the situationwhen your input is a data table/frame that has regularly named columns,and your desired output is a data table with those columns reshaped intoa taller/longer form. In that case you can use a regex to identify thecolumns to reshape.
(one.iris<-data.frame(iris[1,]))#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species#> 1 5.1 3.5 1.4 0.2 setosanc::capture_melt_single (one.iris,part =".*","[.]",dim =".*")#> Species part dim value#> <fctr> <char> <char> <num>#> 1: setosa Sepal Length 5.1#> 2: setosa Sepal Width 3.5#> 3: setosa Petal Length 1.4#> 4: setosa Petal Width 0.2nc::capture_melt_multiple(one.iris,column=".*","[.]",dim =".*")#> Species dim Petal Sepal#> <fctr> <char> <num> <num>#> 1: setosa Length 1.4 5.1#> 2: setosa Width 0.2 3.5nc::capture_melt_multiple(one.iris,part =".*","[.]",column=".*")#> Species part Length Width#> <fctr> <char> <num> <num>#> 1: setosa Petal 1.4 0.2#> 2: setosa Sepal 5.1 3.5Capture glob is for the situationwhen you have several data files on disk, with regular names that youcan match with a glob/regex. In the example below we first write one CSVfile for each iris Species,
dir.create(iris.dir<-tempfile())icsv<-function(sp)file.path(iris.dir,paste0(sp,".csv"))data.table(iris)[,fwrite(.SD,icsv(Species)), by=Species]#> Empty data.table (0 rows and 1 cols): Speciesdir(iris.dir)#> [1] "setosa.csv" "versicolor.csv" "virginica.csv"We then use a glob and a regex to read those files in the codebelow:
nc::capture_first_glob(file.path(iris.dir,"*.csv"),Species="[^/]+","[.]csv")#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width#> <char> <num> <num> <num> <num>#> 1: setosa 5.1 3.5 1.4 0.2#> 2: setosa 4.9 3.0 1.4 0.2#> 3: setosa 4.7 3.2 1.3 0.2#> 4: setosa 4.6 3.1 1.5 0.2#> 5: setosa 5.0 3.6 1.4 0.2#> ---#> 146: virginica 6.7 3.0 5.2 2.3#> 147: virginica 6.3 2.5 5.0 1.9#> 148: virginica 6.5 3.0 5.2 2.0#> 149: virginica 6.2 3.4 5.4 2.3#> 150: virginica 5.9 3.0 5.1 1.8Helpers describes various functionsthat simplify the definition of complex regex patterns. For examplenc::field helps avoid repetition below,
subject.vec<-c("sex_child1","age_child1","sex_child2")pattern<-list(variable="age|sex","_", nc::field("child","","[12]", as.integer))nc::capture_first_vec(subject.vec, pattern)#> variable child#> <char> <int>#> 1: sex 1#> 2: age 1#> 3: sex 2It also explains how to define common sub-patterns which are used inseveral different alternatives.
subject.vec<-c("mar 17, 1983","26 sep 2017","17 mar 1984")pattern<- nc::alternatives_with_shared_groups(month="[a-z]{3}",day="[0-9]{2}",year="[0-9]{4}",list(month," ", day,", ", year),list(day," ", month," ", year))nc::capture_first_vec(subject.vec, pattern)#> month day year#> <char> <char> <char>#> 1: mar 17 1983#> 2: sep 26 2017#> 3: mar 17 1984