| Encoding {base} | R Documentation |
Read or Set the Declared Encodings for a Character Vector
Description
Read or set the declared encodings for a character vector.
Usage
Encoding(x)Encoding(x) <- valueenc2native(x)enc2utf8(x)Arguments
x | A character vector. |
value | A character vector of positive length. |
Details
Character strings inR can be declared to be encoded in"latin1" or"UTF-8" or as"bytes". Thesedeclarations can be read byEncoding, which will return acharacter vector of values"latin1","UTF-8""bytes" or"unknown", or set, whenvalue isrecycled as needed and other values are silently treated as"unknown". ASCII strings will never be marked with a declaredencoding, since their representation is the same in all supportedencodings. Strings marked as"bytes" are intended to benon-ASCII strings which should be manipulated as bytes, and neverconverted to a character encoding (so writing them to a text file issupported only bywriteLines(useBytes = TRUE)).
enc2native andenc2utf8 convert elements of charactervectors to the native encoding or UTF-8 respectively, taking anymarked encoding into account. They areprimitive functions,designed to do minimal copying.
There are other ways for character strings to acquire a declaredencoding apart from explicitly setting it (and these have changed asR has evolved). The parser marks strings containing ‘\u’ or‘\U’ escapes. Functionsscan,read.table,readLines, andparse have anencoding argument that is used todeclare encodings,iconv declares encodings from itsto argument, and console input in suitable locales is alsodeclared.intToUtf8 declares its output as"UTF-8", and output text connections (seetextConnection) are marked if running in asuitable locale. Under some circumstances (see its help page)source(encoding=) will mark encodings of characterstrings it outputs.
Most character manipulation functions will set the encoding on outputstrings if it was declared on the corresponding input. These includechartr,strsplit(useBytes = FALSE),tolower andtoupper as well assub(useBytes = FALSE) andgsub(useBytes = FALSE). Note that such functions do notpreserve theencoding, but if they know the input encoding and that the string hasbeen successfully re-encoded (to the current encoding or UTF-8), theymark the output.
substr does preserve the encoding, andchartr,tolower andtoupperpreserve UTF-8 encoding on systems with Unicode wide characters. Withtheirfixed andperl options,strsplit,sub andgsub will give a marked UTF-8 result ifany of the inputs are UTF-8.
paste andsprintf return elements markedas bytes if any of the corresponding inputs is marked as bytes, andotherwise marked as UTF-8 if any of the inputs is marked as UTF-8.
match,pmatch,charmatch,duplicated andunique all match in UTF-8if any of the elements are marked as UTF-8.
Changing the current encoding from a running R session may lead toconfusion (seeSys.setlocale).
There is some ambiguity as to what is meant by a ‘Latin-1’locale, since some OSes (notably Windows) make use of characterpositions undefined (or used for control characters) in the ISO 8859-1character set. How such characters are interpreted issystem-dependent but as fromR 3.5.0 they are if possible interpretedas per Windows codepage 1252 (which Microsoft calls ‘WindowsLatin 1 (ANSI)’) when converting to e.g. UTF-8.
Value
A character vector.
Forenc2utf8 encodings are always marked: they are forenc2native in UTF-8 and Latin-1 locales.
Examples
## x is intended to be in latin1x. <- x <- "fran\xE7ais"Encoding(x.) # "unknown" (UTF-8 loc.) | "latin1" (8859-1/CP-1252 loc.) | ....Encoding(x) <- "latin1"xxx <- iconv(x, "latin1", "UTF-8")Encoding(c(x., x, xx))c(x, xx)xb <- xx; Encoding(xb) <- "bytes"xb # will be encoded in hexcat("x = ", x, ", xx = ", xx, ", xb = ", xb, "\n", sep = "")(Ex <- Encoding(c(x.,x,xx,xb)))stopifnot(identical(Ex, c(Encoding(x.), Encoding(x), Encoding(xx), Encoding(xb))))