grep: Pattern Matching and Replacement

grep	R Documentation

Pattern Matching and Replacement

Description

grep,grepl,regexpr,gregexpr,regexec andgregexec search for matches to argumentpattern within each element of a character vector: they differ inthe format of and amount of detail in the results.

sub andgsub perform replacement of the first and allmatches respectively.

Usage

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,     fixed = FALSE, useBytes = FALSE, invert = FALSE)grepl(pattern, x, ignore.case = FALSE, perl = FALSE,      fixed = FALSE, useBytes = FALSE)sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,    fixed = FALSE, useBytes = FALSE)gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,     fixed = FALSE, useBytes = FALSE)regexpr(pattern, text, ignore.case = FALSE, perl = FALSE,        fixed = FALSE, useBytes = FALSE)gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE,         fixed = FALSE, useBytes = FALSE)regexec(pattern, text, ignore.case = FALSE, perl = FALSE,        fixed = FALSE, useBytes = FALSE)gregexec(pattern, text, ignore.case = FALSE, perl = FALSE,        fixed = FALSE, useBytes = FALSE)

Arguments

`pattern`	character string containing a regular expression(or character string for`fixed = TRUE`) to be matchedin the given character vector. Coerced by`as.character` to a character string if possible. If acharacter vector of length 2 or more is supplied, the first elementis used with a warning. Missing values are allowed except for`regexpr`,`gregexpr` and`regexec`.
`x, text`	a character vector where matches are sought, or anobject which can be coerced by`as.character` to a charactervector. Long vectors are supported.
`ignore.case`	if`FALSE`, the pattern matching iscasesensitive and if`TRUE`, case is ignored during matching.
`perl`	logical. Should Perl-compatible regexps be used?
`value`	if`FALSE`, a vector containing the (`integer`)indices of the matches determined by`grep` is returned, and if`TRUE`, a vector containing the matching elements themselves isreturned.
`fixed`	logical. If`TRUE`,`pattern` is a string to bematched as is. Overrides all conflicting arguments.
`useBytes`	logical. If`TRUE` the matching is donebyte-by-byte rather than character-by-character. See‘Details’.
`invert`	logical. If`TRUE` return indices or values forelements that donot match.
`replacement`	a replacement for matched pattern in`sub` and`gsub`. Coerced to character if possible. For`fixed = FALSE` this can include backreferences`"\1"` to`"\9"` to parenthesized subexpressions of`pattern`. For`perl = TRUE` only, it can also contain`"\U"` or`"\L"` to convert the rest of the replacement to upper orlower case and`"\E"` to end case conversion. If acharacter vector of length 2 or more is supplied, the first elementis used with a warning. If`NA`, all elements in the resultcorresponding to matches will be set to`NA`.

Details

Arguments which should be character strings or character vectors arecoerced to character if possible.

Each of these functions operates in one of three modes:

fixed = TRUE: use exact matching.
perl = TRUE: use Perl-style regular expressions.
fixed = FALSE, perl = FALSE: use POSIX 1003.2extended regular expressions (the default).

See the help pages on regular expression for details of thedifferent types of regular expressions.

The two*sub functions differ only in thatsub replacesonly the first occurrence of apattern whereasgsubreplaces all occurrences. Ifreplacement containsbackreferences which are not defined inpattern the result isundefined (but most often the backreference is taken to be"").

Forregexpr,gregexpr,regexec andgregexecit is an error forpattern to beNA, otherwiseNAis permitted and gives anNA match.

Bothgrep andgrepl take missing values inx asnot matching a non-missingpattern.

The main effect ofuseBytes = TRUE is to avoid errors/warningsabout invalid inputs and spurious matches in multibyte locales, butforregexpr it changes the interpretation of the output. Itinhibits the conversion of inputs with marked encodings, and is forcedif any input is found which is marked as"bytes" (seeEncoding).

Caseless matching does not make much sense for bytes in a multibytelocale, and you should expect it only to work for ASCII characters ifuseBytes = TRUE.

regexpr andgregexpr withperl = TRUE allowPython-style named captures, but not forlong vector inputs.

Invalid inputs in the current locale are warned about up to 5 times.

Caseless matching withperl = TRUE for non-ASCII charactersdepends on the PCRE library being compiled with ‘Unicodeproperty support’, which PCRE2 is by default.

Value

grep(value = FALSE) returns a vector of the indicesof the elements ofx that yielded a match (or not, forinvert = TRUE). This will be an integer vector unless the inputis along vector, when it will be a double vector.

grep(value = TRUE) returns a character vector containing theselected elements ofx (after coercion, preserving names but noother attributes).

grepl returns a logical vector (match or not for each element ofx).

sub andgsub return a character vector of the samelength and with the same attributes asx (after possiblecoercion to character). Elements of character vectorsx whichare not substituted will be returned unchanged (including any declaredencoding). IfuseBytes = FALSE a non-ASCII substituted resultwill often be in UTF-8 with a marked encoding (e.g., if there is aUTF-8 input, and in a multibyte locale unlessfixed = TRUE).Such strings can be re-encoded byenc2native.

regexpr returns an integer vector of the same length astext giving the starting position of the first match or-1 if there is none, with attribute"match.length", aninteger vector giving the length of the matched text (or-1 forno match). The match positions and lengths are in characters unlessuseBytes = TRUE is used, when they are in bytes (as they arefor ASCII-only matching: in either case an attributeuseBytes with valueTRUE is set on the result). Ifnamed capture is used there are further attributes"capture.start","capture.length" and"capture.names".

gregexpr returns a list of the same length astext eachelement of which is of the same form as the return value forregexpr, except that the starting positions of every (disjoint)match are given.

regexec returns a list of the same length astext eachelement of which is either-1 if there is no match, or asequence of integers with the starting positions of the match and allsubstrings corresponding to parenthesized subexpressions ofpattern, with attribute"match.length" a vectorgiving the lengths of the matches (or-1 for no match). Theinterpretation of positions and length and the attributes followsregexpr.

gregexec returns the same asregexec, except that toaccommodate multiple matches per element oftext, the integersequences for each match are made into columns of a matrix, with onematrix per element oftext with matches.

Where matching failed because of resource limits (especially forperl = TRUE) this is regarded as a non-match, usually with awarning.

Warning

The POSIX 1003.2 mode ofgsub andgregexpr does notwork correctly with repeated word-boundaries (e.g.,pattern = "\b").Useperl = TRUE for such matches (but that may notwork as expected with non-ASCII inputs, as the meaning of‘word’ is system-dependent).

Performance considerations

If you are doing a lot of regular expression matching, including onvery long strings, you will want to consider the options used.Generallyperl = TRUE will be faster than the default regularexpression engine, andfixed = TRUE faster still (especiallywhen each pattern is matched only a few times).

If you are working in a single-byte locale and have marked UTF-8strings that are representable in that locale, convert them first asjust one UTF-8 string will force all the matching to be done inUnicode, which attracts a penalty of around3x forthe default POSIX 1003.2 mode.

If you can make use ofuseBytes = TRUE, the strings will not bechecked before matching, and the actual matching will be faster.Often byte-based matching suffices in a UTF-8 locale since bytepatterns of one character never match part of another. Character rangesmay produce unexpected results.

PCRE-based matching by default used to put additional effort into‘studying’ the compiled pattern whenx/text haslength 10 or more. That study may use the PCRE JIT compiler onplatforms where it is available (seepcre_config). Asfrom PCRE2 (PCRE version >= 10.00 as reported byextSoftVersion), there is no study phase, but thepatterns are optimized automatically when possible, and PCRE JIT isused when enabled. The details are controlled byoptionsPCRE_study andPCRE_use_JIT.(Some timing comparisons can be seen by running file‘tests/PCRE.R’ in theR sources (and perhaps installed).)People working with PCRE and very long strings can adjust the maximumsize of the JIT stack by setting environment variableR_PCRE_JIT_STACK_MAXSIZE before JIT is used to a value between1 and1000 in MB: the default is64. When JIT isnot used with PCRE version < 10.30 (that is with PCRE1 and oldversions of PCRE2), it might also be wise to set the optionPCRE_limit_recursion.

Note

Aspects will be platform-dependent as well as local-dependent: forexample the implementation of character classes (except[:digit:] and[:xdigit:]). One can expect results to beconsistent for ASCII inputs and when working in UTF-8 mode (when mostplatforms will use Unicode character tables, although those areupdated frequently and subject to some degree of interpretation – isa circled capital letter alphabetic or a symbol?). However, resultsin 8-bit encodings can differ considerably between platforms, modesand from the UTF-8 versions.

Source

The C code for POSIX-style regular expression matching has changedover the years. As fromR 2.10.0 (Oct 2009) the TRE library of VilleLaurikari (https://github.com/laurikari/tre) is used. The POSIXstandard does give some room for interpretation, especially in thehandling of invalid regular expressions and the collation of characterranges, so the results will have changed slightly over the years.

For Perl-style matching PCRE2 or PCRE (https://www.pcre.org) isused: again the results may depend (slightly) on the version of PCREin use.

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)The New S Language.Wadsworth & Brooks/Cole (grep)

Examples

grep("[a-z]", letters)txt <- c("arm","foot","lefroo", "bafoobar")if(length(i <- grep("foo", txt)))   cat("'foo' appears at least once in\n\t", txt, "\n")i # 2 and 4txt[i]## Double all 'a' or 'b's;  "\" must be escaped, i.e., 'doubled'gsub("([ab])", "\\1_\\1_", "abc and ABC")txt <- c("The", "licenses", "for", "most", "software", "are",  "designed", "to", "take", "away", "your", "freedom",  "to", "share", "and", "change", "it.",  "", "By", "contrast,", "the", "GNU", "General", "Public", "License",  "is", "intended", "to", "guarantee", "your", "freedom", "to",  "share", "and", "change", "free", "software", "--",  "to", "make", "sure", "the", "software", "is",  "free", "for", "all", "its", "users")( i <- grep("[gu]", txt) ) # indicesstopifnot( txt[i] == grep("[gu]", txt, value = TRUE) )## Note that for some implementations character ranges are## locale-dependent (but not currently).  Then [b-e] in locales such as## en_US may include B as the collation order is aAbBcCdDe ...(ot <- sub("[b-e]",".", txt))txt[ot != gsub("[b-e]",".", txt)]#- gsub does "global" substitution## In caseless matching, ranges include both cases:a <- grep("[b-e]", txt, value = TRUE)b <- grep("[b-e]", txt, ignore.case = TRUE, value = TRUE)setdiff(b, a)txt[gsub("g","#", txt) !=    gsub("g","#", txt, ignore.case = TRUE)] # the "G" wordsregexpr("en", txt)gregexpr("e", txt)## Using grepl() for filtering## Find functions with argument names matching "warn":findArgs <- function(env, pattern) {  nms <- ls(envir = as.environment(env))  nms <- nms[is.na(match(nms, c("F","T")))] # <-- work around "checking hack"  aa <- sapply(nms, function(.) { o <- get(.)               if(is.function(o)) names(formals(o)) })  iw <- sapply(aa, function(a) any(grepl(pattern, a, ignore.case=TRUE)))  aa[iw]}findArgs("package:base", "warn")## trim trailing white spacestr <- "Now is the time      "sub(" +$", "", str)  ## spaces only## what is considered 'white space' depends on the locale.sub("[[:space:]]+$", "", str) ## white space, POSIX-style## what PCRE considered white space changed in version 8.34: see ?regexsub("\\s+$", "", str, perl = TRUE) ## PCRE-style white space## capitalizingtxt <- "a test of capitalizing"gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", txt, perl=TRUE)gsub("\\b(\\w)",    "\\U\\1",       txt, perl=TRUE)txt2 <- "useRs may fly into JFK or laGuardia"gsub("(\\w)(\\w*)(\\w)", "\\U\\1\\E\\2\\U\\3", txt2, perl=TRUE) sub("(\\w)(\\w*)(\\w)", "\\U\\1\\E\\2\\U\\3", txt2, perl=TRUE)## named capturenotables <- c("  Ben Franklin and Jefferson Davis",              "\tMillard Fillmore")# name groups 'first' and 'last'name.rex <- "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)"(parsed <- regexpr(name.rex, notables, perl = TRUE))gregexpr(name.rex, notables, perl = TRUE)[[2]]parse.one <- function(res, result) {  m <- do.call(rbind, lapply(seq_along(res), function(i) {    if(result[i] == -1) return("")    st <- attr(result, "capture.start")[i, ]    substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)  }))  colnames(m) <- attr(result, "capture.names")  m}parse.one(notables, parsed)## Decompose a URL into its components.## Example by LT (http://www.cs.uiowa.edu/~luke/R/regexp.html).x <- "http://stat.umn.edu:80/xyz"m <- regexec("^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)", x)mregmatches(x, m)## Element 3 is the protocol, 4 is the host, 6 is the port, and 7## is the path.  We can use this to make a function for extracting the## parts of a URL:URL_parts <- function(x) {    m <- regexec("^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)", x)    parts <- do.call(rbind,                     lapply(regmatches(x, m), `[`, c(3L, 4L, 6L, 7L)))    colnames(parts) <- c("protocol","host","port","path")    parts}URL_parts(x)## gregexec() may match multiple times within a single string.pattern <- "([[:alpha:]]+)([[:digit:]]+)"s <- "Test: A1 BC23 DEF456"m <- gregexec(pattern, s)mregmatches(s, m)## Before gregexec() was implemented, one could emulate it by running## regexec() on the regmatches obtained via gregexpr().  E.g.:lapply(regmatches(s, gregexpr(pattern, s)),       function(e) regmatches(e, regexec(pattern, e)))

Movatterモバイル変換