Movatterモバイル変換

Encoding {base}

R Documentation

Read or Set the Declared Encodings for a Character Vector

Description

Read or set the declared encodings for a character vector.

Usage

Encoding(x)Encoding(x) <- valueenc2native(x)enc2utf8(x)

Arguments

x

A character vector.

value

A character vector of positive length.

Details

Character strings inR can be declared to be encoded in"latin1" or"UTF-8" or as"bytes". Thesedeclarations can be read byEncoding, which will return acharacter vector of values"latin1","UTF-8""bytes" or"unknown", or set, whenvalue isrecycled as needed and other values are silently treated as"unknown". ASCII strings will never be marked with a declaredencoding, since their representation is the same in all supportedencodings. Strings marked as"bytes" are intended to benon-ASCII strings which should be manipulated as bytes, and neverconverted to a character encoding (so writing them to a text file issupported only bywriteLines(useBytes = TRUE)).

enc2native andenc2utf8 convert elements of charactervectors to the native encoding or UTF-8 respectively, taking anymarked encoding into account. They areprimitive functions,designed to do minimal copying.

There are other ways for character strings to acquire a declaredencoding apart from explicitly setting it (and these have changed asR has evolved). The parser marks strings containing ‘⁠\u⁠’ or‘⁠\U⁠’ escapes. Functionsscan,read.table,readLines, andparse have anencoding argument that is used todeclare encodings,iconv declares encodings from itsto argument, and console input in suitable locales is alsodeclared.intToUtf8 declares its output as"UTF-8", and output text connections (seetextConnection) are marked if running in asuitable locale. Under some circumstances (see its help page)source(encoding=) will mark encodings of characterstrings it outputs.

Most character manipulation functions will set the encoding on outputstrings if it was declared on the corresponding input. These includechartr,strsplit(useBytes = FALSE),tolower andtoupper as well assub(useBytes = FALSE) andgsub(useBytes = FALSE). Note that such functions do notpreserve theencoding, but if they know the input encoding and that the string hasbeen successfully re-encoded (to the current encoding or UTF-8), theymark the output.

substr does preserve the encoding, andchartr,tolower andtoupperpreserve UTF-8 encoding on systems with Unicode wide characters. Withtheirfixed andperl options,strsplit,sub andgsub will give a marked UTF-8 result ifany of the inputs are UTF-8.

paste andsprintf return elements markedas bytes if any of the corresponding inputs is marked as bytes, andotherwise marked as UTF-8 if any of the inputs is marked as UTF-8.

match,pmatch,charmatch,duplicated andunique all match in UTF-8if any of the elements are marked as UTF-8.

Changing the current encoding from a running R session may lead toconfusion (seeSys.setlocale).

There is some ambiguity as to what is meant by a ‘Latin-1’locale, since some OSes (notably Windows) make use of characterpositions undefined (or used for control characters) in the ISO 8859-1character set. How such characters are interpreted issystem-dependent but as fromR 3.5.0 they are if possible interpretedas per Windows codepage 1252 (which Microsoft calls ‘WindowsLatin 1 (ANSI)’) when converting to e.g. UTF-8.

Value

A character vector.

Forenc2utf8 encodings are always marked: they are forenc2native in UTF-8 and Latin-1 locales.

Examples

## x is intended to be in latin1x. <- x <- "fran\xE7ais"Encoding(x.) # "unknown" (UTF-8 loc.) | "latin1" (8859-1/CP-1252 loc.) | ....Encoding(x) <- "latin1"xxx <- iconv(x, "latin1", "UTF-8")Encoding(c(x., x, xx))c(x, xx)xb <- xx; Encoding(xb) <- "bytes"xb # will be encoded in hexcat("x = ", x, ", xx = ", xx, ", xb = ", xb, "\n", sep = "")(Ex <- Encoding(c(x.,x,xx,xb)))stopifnot(identical(Ex, c(Encoding(x.), Encoding(x),                          Encoding(xx), Encoding(xb))))

[Packagebase version 4.6.0Index]