Movatterモバイル変換

Constructparser combinator functions for parsing character vectors

This R package contains tools to construct parser combinatorfunctions, higher order functions that parse input. The main goal ofthis package is to simplify the creation oftransparent parsersfor structured text files generated by machines like laboratoryinstruments. Such files consist of lines of text organized inhigher-order structures like headers with metadata and blocks ofmeasured values. To read these data into R you first need to create aparser that processes these files and creates R-objects as output. Theparcr package simplifies the task of creating suchparsers.

This package was inspired by the package“Ramble” by Chapman Siuand co-workers and by the paper“Higher-order functionsfor parsing” byGraham Hutton(1992).

Installation

Install the stable version from CRAN

install.packages("parcr")

To install the development version including its vignette run thefollowing command

install_github("SystemsBioinformatics/parcr", build_vignettes=TRUE)

Exampleapplication: a parser forfasta sequence files

As an example of a realistic application we write a parser forfasta-formatted files for nucleotide and protein sequences. We use a fewsimplifying assumptions about this format for the sake of the example.Real fasta files are more complex than we pretend here.

Please note that more background about the functions that we usehere is available in the package documentation. Here we only present asummary.

A fasta file with mixed sequence types could look like the examplebelow:

>sequence_AGGTAAGTCCTCTAGTACAAACACCCCCAATTCTGTTGCCAGAAAAAACACTTTTAGGCTA>sequence_BATTGTGATATAATTAAAATTATATTCATATTATTAGAGCCATCTTCTTTGAAGCGTTGTCTATGCATCGATC>sequence_CMTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGENEYKALVAELEKE

Since fasta files are text files we could read such a file usingreadLines() into a character vector. The package providesthe data setfastafile which contains that charactervector.

data("fastafile")

We can distinguish the following higher order components in a fastafile:

Afasta file: consists of one or moresequence blocks until theend of thefile.
Asequence block: consist of aheader and anucleotide sequence or aprotein sequence. A sequence block could be preceded byzero or moreempty lines.
Anucleotide sequence: consists of one or morenucleotide sequence strings.
Aprotein sequence: consists of one or moreprotein sequence strings.
Aheader is astring that starts with a“>” immediately followed by atitle withoutspaces.
Anucleotide sequence string is astringwithout spaces that consistsentirely of symbols from the set{G,A,T,C}.
Aprotein sequence string is astringwithout spaces that consistsentirely of symbols from the set{A,R,N,D,B,C,E,Q,Z,G,H,I,L,K,M,F,P,S,T,W,Y,V}.

It now becomes clear what we mean when we say that the package allowsus to writetransparent parsers: the description above of thestructure of fasta files can be put straight into code for aFasta() parser:

Fasta<-function() {one_or_more(SequenceBlock())%then%eof()}SequenceBlock<-function() {MaybeEmpty()%then%Header()%then%    (NuclSequence()%or%ProtSequence())%using%function(x)list(x)}NuclSequence<-function() {one_or_more(NuclSequenceString())%using%function(x)list(type ="Nucl",sequence =paste(x,collapse=""))}ProtSequence<-function() {one_or_more(ProtSequenceString())%using%function(x)list(type ="Prot",sequence =paste(x,collapse=""))}

Functions likeone_or_more(),%then%,%or%,%using%,eof() andMaybeEmpty() are defined in the package and are the basicparsers with which the package user can build complex parsers. The%using% operator uses the function on its right-hand sideto modify parser output on its left hand side. Please see the vignettein theparcr package for more explanation why this isuseful or necessary even.

Notice that the new parser functions that we define above are higherorder functions taking no input, hence the empty argument brackets() behind their names.

Now we need to define the parsersHeader(),NuclSequenceString() andProtSequenceString()that actually recognize and process the header line string and stringsof nucleotide or protein sequences in the character vectorfastafile. We use the function constructorstringparser() from the package to construct helperfunctions that recognize and capture the desired matches, and we usematch_s() to to createparcr compliant parsersfrom these.

Header<-function() {match_s(stringparser("^>(\\w+)"))%using%function(x)list(title =unlist(x))}NuclSequenceString<-function() {match_s(stringparser("^([GATC]+)$"))}ProtSequenceString<-function() {match_s(stringparser("^([ARNDBCEQZGHILKMFPSTWYV]+)$"))}

Now we have all the elements that we need to apply theFasta() parser.

Fasta()(fastafile)#> $L#> $L[[1]]#> $L[[1]]$title#> [1] "sequence_A"#>#> $L[[1]]$type#> [1] "Nucl"#>#> $L[[1]]$sequence#> [1] "GGTAAGTCCTCTAGTACAAACACCCCCAATTCTGTTGCCAGAAAAAACACTTTTAGGCTA"#>#>#> $L[[2]]#> $L[[2]]$title#> [1] "sequence_B"#>#> $L[[2]]$type#> [1] "Nucl"#>#> $L[[2]]$sequence#> [1] "ATTGTGATATAATTAAAATTATATTCATATTATTAGAGCCATCTTCTTTGAAGCGTTGTCTATGCATCGATC"#>#>#> $L[[3]]#> $L[[3]]$title#> [1] "sequence_C"#>#> $L[[3]]$type#> [1] "Prot"#>#> $L[[3]]$sequence#> [1] "MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGENEYKALVAELEKE"#>#>#>#> $R#> list()

The output of the parser consists of two elements,L andR, whereL contains the parsed and processedpart of the input andR the remaining un-parsed part of theinput. Since we explicitly demanded to parse until the end of the fileby theeof() function in the definition of theFasta() parser, theR element contains anempty list to signal that the parser was indeed at the end of the input.Please see the package documentation for more examples andexplanation.

Finally, let’s present the result of the parse more concisely usingthe names of the elements inside theL element:

d<-Fasta()(fastafile)[["L"]]invisible(lapply(d,function(x) {cat(x$type, x$title, x$sequence,"\n")}))#> Nucl sequence_A GGTAAGTCCTCTAGTACAAACACCCCCAATTCTGTTGCCAGAAAAAACACTTTTAGGCTA#> Nucl sequence_B ATTGTGATATAATTAAAATTATATTCATATTATTAGAGCCATCTTCTTTGAAGCGTTGTCTATGCATCGATC#> Prot sequence_C MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGENEYKALVAELEKE

[8]ページ先頭