Movatterモバイル変換


[0]ホーム

URL:


utf8

rccCoverage StatusCRAN StatusLicenseCRAN RStudio Mirror Downloads

utf8 is an R package for manipulating and printing UTF-8text that fixes multiple bugs in R’s UTF-8 handling.

Installation

Stable version

utf8 isavailable on CRAN. To install the latest releasedversion, run the following command in R:

install.packages("utf8")

Development version

To install the latest development version, run the following:

devtools::install_github("patperry/r-utf8")

Usage

library(utf8)

Validate characterdata and convert to UTF-8

Useas_utf8() to validate input text and convert toUTF-8 encoding. The function alerts you if the input text has the wrongdeclared encoding:

# second entry is encoded in latin-1, but declared as UTF-8x<-c("fa\u00E7ile","fa\xE7ile","fa\xC3\xA7ile")Encoding(x)<-c("UTF-8","UTF-8","bytes")as_utf8(x)# fails#> Error in as_utf8(x): entry 2 has wrong Encoding; marked as "UTF-8" but leading byte 0xE7 followed by invalid continuation byte (0xdeadbeef) at position 4# mark the correct encodingEncoding(x[2])<-"latin1"as_utf8(x)# succeeds#> [1] "façile" "façile" "façile"

Normalize data

Useutf8_normalize() to convert to Unicode composednormal form (NFC). Optionally apply compatibility maps for NFKC normalform or case-fold.

# three ways to encode an angstrom character(angstrom<-c("\u00c5","\u0041\u030a","\u212b"))#> [1] "Å" "Å" "Å"utf8_normalize(angstrom)=="\u00c5"#> [1] TRUE TRUE TRUE# perform full Unicode case-foldingutf8_normalize("Größe",map_case =TRUE)#> [1] "grösse"# apply compatibility maps to NFKC normal form# (example from https://twitter.com/aprilarcus/status/367557195186970624)utf8_normalize("𝖸𝗈 𝐔𝐧𝐢𝐜𝐨𝐝𝐞 𝗅 𝗁𝖾𝗋𝖽 𝕌 𝗅𝗂𝗄𝖾 𝑡𝑦𝑝𝑒𝑓𝑎𝑐𝑒𝑠 𝗌𝗈 𝗐𝖾 𝗉𝗎𝗍 𝗌𝗈𝗆𝖾 𝚌𝚘𝚍𝚎𝚙𝚘𝚒𝚗𝚝𝚜 𝗂𝗇 𝗒𝗈𝗎𝗋 𝔖𝔲𝔭𝔭𝔩𝔢𝔪𝔢𝔫𝔱𝔞𝔯𝔶 𝔚𝔲𝔩𝔱𝔦𝔩𝔦𝔫𝔤𝔳𝔞𝔩 𝔓𝔩𝔞𝔫𝔢 𝗌𝗈 𝗒𝗈𝗎 𝖼𝖺𝗇 𝓮𝓷𝓬𝓸𝓭𝓮 𝕗𝕠𝕟𝕥𝕤 𝗂𝗇 𝗒𝗈𝗎𝗋 𝒇𝒐𝒏𝒕𝒔.",map_compat =TRUE)#> [1] "Yo Unicode l herd U like typefaces so we put some codepoints in your Supplementary Wultilingval Plane so you can encode fonts in your fonts."

Print emoji

On some platforms (including MacOS), the R implementation ofprint() uses an outdated version of the Unicode standard todetermine which characters are printable. Useutf8_print()for an updated print function:

print(intToUtf8(0xdeadbeefF600+0:79))# with default R print function#> [1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏"utf8_print(intToUtf8(0xdeadbeefF600+0:79))# with utf8_print, truncates line#> [1] "😀​😁​😂​😃​😄​😅​😆​😇​😈​😉​😊​😋​😌​😍​😎​😏​😐​😑​😒​😓​😔​😕​😖​😗​😘​😙​😚​😛​😜​😝​😞​😟​😠​😡​😢​😣​…"utf8_print(intToUtf8(0xdeadbeefF600+0:79),chars =1000)# higher character limit#> [1] "😀​😁​😂​😃​😄​😅​😆​😇​😈​😉​😊​😋​😌​😍​😎​😏​😐​😑​😒​😓​😔​😕​😖​😗​😘​😙​😚​😛​😜​😝​😞​😟​😠​😡​😢​😣​😤​😥​😦​😧​😨​😩​😪​😫​😬​😭​😮​😯​😰​😱​😲​😳​😴​😵​😶​😷​😸​😹​😺​😻​😼​😽​😾​😿​🙀​🙁​🙂​🙃​🙄​🙅​🙆​🙇​🙈​🙉​🙊​🙋​🙌​🙍​🙎​🙏​"

Citation

Citeutf8 with the following BibTeX entry:

@Manual{,  title = {utf8: Unicode Text Processing},  author = {Patrick O. Perry},  note = {R package version 1.2.4.9900, https://github.com/patperry/r-utf8},  url = {https://ptrckprry.com/r-utf8/},}

Contributing

The project maintainer welcomes contributions in the form of featurerequests, bug reports, comments, unit tests, vignettes, or other code.If you’d like to contribute, either

This project is released with aContributor Code of Conduct, andif you choose to contribute, you must adhere to its terms.


[8]ページ先頭

©2009-2025 Movatter.jp