utf8 is an R package for manipulating and printing UTF-8text that fixes multiple bugs in R’s UTF-8 handling.
utf8 isavailable on CRAN. To install the latest releasedversion, run the following command in R:
install.packages("utf8")To install the latest development version, run the following:
devtools::install_github("patperry/r-utf8")library(utf8)Useas_utf8() to validate input text and convert toUTF-8 encoding. The function alerts you if the input text has the wrongdeclared encoding:
# second entry is encoded in latin-1, but declared as UTF-8x<-c("fa\u00E7ile","fa\xE7ile","fa\xC3\xA7ile")Encoding(x)<-c("UTF-8","UTF-8","bytes")as_utf8(x)# fails#> Error in as_utf8(x): entry 2 has wrong Encoding; marked as "UTF-8" but leading byte 0xE7 followed by invalid continuation byte (0xdeadbeef) at position 4# mark the correct encodingEncoding(x[2])<-"latin1"as_utf8(x)# succeeds#> [1] "façile" "façile" "façile"Useutf8_normalize() to convert to Unicode composednormal form (NFC). Optionally apply compatibility maps for NFKC normalform or case-fold.
# three ways to encode an angstrom character(angstrom<-c("\u00c5","\u0041\u030a","\u212b"))#> [1] "Å" "Å" "Å"utf8_normalize(angstrom)=="\u00c5"#> [1] TRUE TRUE TRUE# perform full Unicode case-foldingutf8_normalize("Größe",map_case =TRUE)#> [1] "grösse"# apply compatibility maps to NFKC normal form# (example from https://twitter.com/aprilarcus/status/367557195186970624)utf8_normalize("𝖸𝗈 𝐔𝐧𝐢𝐜𝐨𝐝𝐞 𝗅 𝗁𝖾𝗋𝖽 𝕌 𝗅𝗂𝗄𝖾 𝑡𝑦𝑝𝑒𝑓𝑎𝑐𝑒𝑠 𝗌𝗈 𝗐𝖾 𝗉𝗎𝗍 𝗌𝗈𝗆𝖾 𝚌𝚘𝚍𝚎𝚙𝚘𝚒𝚗𝚝𝚜 𝗂𝗇 𝗒𝗈𝗎𝗋 𝔖𝔲𝔭𝔭𝔩𝔢𝔪𝔢𝔫𝔱𝔞𝔯𝔶 𝔚𝔲𝔩𝔱𝔦𝔩𝔦𝔫𝔤𝔳𝔞𝔩 𝔓𝔩𝔞𝔫𝔢 𝗌𝗈 𝗒𝗈𝗎 𝖼𝖺𝗇 𝓮𝓷𝓬𝓸𝓭𝓮 𝕗𝕠𝕟𝕥𝕤 𝗂𝗇 𝗒𝗈𝗎𝗋 𝒇𝒐𝒏𝒕𝒔.",map_compat =TRUE)#> [1] "Yo Unicode l herd U like typefaces so we put some codepoints in your Supplementary Wultilingval Plane so you can encode fonts in your fonts."On some platforms (including MacOS), the R implementation ofprint() uses an outdated version of the Unicode standard todetermine which characters are printable. Useutf8_print()for an updated print function:
print(intToUtf8(0xdeadbeefF600+0:79))# with default R print function#> [1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏"utf8_print(intToUtf8(0xdeadbeefF600+0:79))# with utf8_print, truncates line#> [1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣…"utf8_print(intToUtf8(0xdeadbeefF600+0:79),chars =1000)# higher character limit#> [1] "😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏😐😑😒😓😔😕😖😗😘😙😚😛😜😝😞😟😠😡😢😣😤😥😦😧😨😩😪😫😬😭😮😯😰😱😲😳😴😵😶😷😸😹😺😻😼😽😾😿🙀🙁🙂🙃🙄🙅🙆🙇🙈🙉🙊🙋🙌🙍🙎🙏"Citeutf8 with the following BibTeX entry:
@Manual{, title = {utf8: Unicode Text Processing}, author = {Patrick O. Perry}, note = {R package version 1.2.4.9900, https://github.com/patperry/r-utf8}, url = {https://ptrckprry.com/r-utf8/},}The project maintainer welcomes contributions in the form of featurerequests, bug reports, comments, unit tests, vignettes, or other code.If you’d like to contribute, either
fork the repository and submit a pull request
or contact the maintainer via e-mail.
This project is released with aContributor Code of Conduct, andif you choose to contribute, you must adhere to its terms.