This R package contains tools to construct parser combinatorfunctions, higher order functions that parse input. The main goal ofthis package is to simplify the creation oftransparent parsersfor structured text files generated by machines like laboratoryinstruments. Such files consist of lines of text organized inhigher-order structures like headers with metadata and blocks ofmeasured values. To read these data into R you first need to create aparser that processes these files and creates R-objects as output. Theparcr package simplifies the task of creating suchparsers.
This package was inspired by the package“Ramble” by Chapman Siuand co-workers and by the paper“Higher-order functionsfor parsing” byGraham Hutton(1992).
Install the stable version from CRAN
install.packages("parcr")To install the development version including its vignette run thefollowing command
install_github("SystemsBioinformatics/parcr", build_vignettes=TRUE)As an example of a realistic application we write a parser forfasta-formatted files for nucleotide and protein sequences. We use a fewsimplifying assumptions about this format for the sake of the example.Real fasta files are more complex than we pretend here.
Please note that more background about the functions that we usehere is available in the package documentation. Here we only present asummary.
A fasta file with mixed sequence types could look like the examplebelow:
>sequence_AGGTAAGTCCTCTAGTACAAACACCCCCAATTCTGTTGCCAGAAAAAACACTTTTAGGCTA>sequence_BATTGTGATATAATTAAAATTATATTCATATTATTAGAGCCATCTTCTTTGAAGCGTTGTCTATGCATCGATC>sequence_CMTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGENEYKALVAELEKESince fasta files are text files we could read such a file usingreadLines() into a character vector. The package providesthe data setfastafile which contains that charactervector.
data("fastafile")We can distinguish the following higher order components in a fastafile:
{G,A,T,C}.{A,R,N,D,B,C,E,Q,Z,G,H,I,L,K,M,F,P,S,T,W,Y,V}.It now becomes clear what we mean when we say that the package allowsus to writetransparent parsers: the description above of thestructure of fasta files can be put straight into code for aFasta() parser:
Fasta<-function() {one_or_more(SequenceBlock())%then%eof()}SequenceBlock<-function() {MaybeEmpty()%then%Header()%then% (NuclSequence()%or%ProtSequence())%using%function(x)list(x)}NuclSequence<-function() {one_or_more(NuclSequenceString())%using%function(x)list(type ="Nucl",sequence =paste(x,collapse=""))}ProtSequence<-function() {one_or_more(ProtSequenceString())%using%function(x)list(type ="Prot",sequence =paste(x,collapse=""))}Functions likeone_or_more(),%then%,%or%,%using%,eof() andMaybeEmpty() are defined in the package and are the basicparsers with which the package user can build complex parsers. The%using% operator uses the function on its right-hand sideto modify parser output on its left hand side. Please see the vignettein theparcr package for more explanation why this isuseful or necessary even.
Notice that the new parser functions that we define above are higherorder functions taking no input, hence the empty argument brackets() behind their names.
Now we need to define the parsersHeader(),NuclSequenceString() andProtSequenceString()that actually recognize and process the header line string and stringsof nucleotide or protein sequences in the character vectorfastafile. We use the function constructorstringparser() from the package to construct helperfunctions that recognize and capture the desired matches, and we usematch_s() to to createparcr compliant parsersfrom these.
Header<-function() {match_s(stringparser("^>(\\w+)"))%using%function(x)list(title =unlist(x))}NuclSequenceString<-function() {match_s(stringparser("^([GATC]+)$"))}ProtSequenceString<-function() {match_s(stringparser("^([ARNDBCEQZGHILKMFPSTWYV]+)$"))}Now we have all the elements that we need to apply theFasta() parser.
Fasta()(fastafile)#> $L#> $L[[1]]#> $L[[1]]$title#> [1] "sequence_A"#>#> $L[[1]]$type#> [1] "Nucl"#>#> $L[[1]]$sequence#> [1] "GGTAAGTCCTCTAGTACAAACACCCCCAATTCTGTTGCCAGAAAAAACACTTTTAGGCTA"#>#>#> $L[[2]]#> $L[[2]]$title#> [1] "sequence_B"#>#> $L[[2]]$type#> [1] "Nucl"#>#> $L[[2]]$sequence#> [1] "ATTGTGATATAATTAAAATTATATTCATATTATTAGAGCCATCTTCTTTGAAGCGTTGTCTATGCATCGATC"#>#>#> $L[[3]]#> $L[[3]]$title#> [1] "sequence_C"#>#> $L[[3]]$type#> [1] "Prot"#>#> $L[[3]]$sequence#> [1] "MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGENEYKALVAELEKE"#>#>#>#> $R#> list()The output of the parser consists of two elements,L andR, whereL contains the parsed and processedpart of the input andR the remaining un-parsed part of theinput. Since we explicitly demanded to parse until the end of the fileby theeof() function in the definition of theFasta() parser, theR element contains anempty list to signal that the parser was indeed at the end of the input.Please see the package documentation for more examples andexplanation.
Finally, let’s present the result of the parse more concisely usingthe names of the elements inside theL element:
d<-Fasta()(fastafile)[["L"]]invisible(lapply(d,function(x) {cat(x$type, x$title, x$sequence,"\n")}))#> Nucl sequence_A GGTAAGTCCTCTAGTACAAACACCCCCAATTCTGTTGCCAGAAAAAACACTTTTAGGCTA#> Nucl sequence_B ATTGTGATATAATTAAAATTATATTCATATTATTAGAGCCATCTTCTTTGAAGCGTTGTCTATGCATCGATC#> Prot sequence_C MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGENEYKALVAELEKE