Movatterモバイル変換


[0]ホーム

URL:


Introduction

This is a package where I collected some of the function I have used when dealing with data.

library(xutils)

Text-related Functions

html_decode

Currently, there is only one function:html_decode which will replace the HTML entities like& into their original form&.

This function is a thin-wrapper of C++ code provided byChristoph onStack Overflow.

Example

An example would be

strings<-c("abcd","&amp; &apos; &gt;","&amp;","&euro; &lt;")html_decode(strings)#> [1] "abcd"  "& ' >" "&"     "€ <"

It works very well!

Comparison with Existing Solutions

To the best of my knowledge, there are already several solutions to this problem, and why do I need to wrap up a new function to do this? Because of performance.

First of all, there is an existing packagetextutils that contains lots of functions dealing with data. The one of our interest isHTMLdecode.

Second, there is a function by SO userStibuhere that usesxml2 package. And the function is:

unescape_html2<-function(str){  html<-paste0("<x>",paste0(str,collapse ="#_|"),"</x>")  parsed<- xml2::xml_text(xml2::read_html(html))strsplit(parsed,"#_|",fixed =TRUE)[[1]]}

Third, I took the code fromChristoph (here) and wrote a R wrapper for the C function. This function isxutils::html_decode.

Now, let’s test the performance and I usebench package here.

bench::mark(html_decode(strings),unescape_html2(strings),  textutils::HTMLdecode(strings))#> # A tibble: 3 x 6#>   expression                          min   median `itr/sec` mem_alloc `gc/sec`#>   <bch:expr>                     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>#> 1 html_decode(strings)             6.28µs   7.03µs   137740.    2.49KB     13.8#> 2 unescape_html2(strings)        101.68µs 106.37µs     9256.  445.59KB     18.9#> 3 textutils::HTMLdecode(strings)   4.33ms   4.47ms      223.  379.19KB     35.1

Clearly, the speed ofhtml_decode function is unparalleled.

Note:

When testing the results, I discovered a bug intextutils::HTMLdecode and reported ithere. The maintainer fixed it immediately. As of this writing (Feb. 16, 2021), the development version oftextutils has this bug fixed, but the CRAN version may not. This means that if you test the performance yourself with a previous version oftextutils, you may run into error and installing the development version will solve for it.


[8]ページ先頭

©2009-2025 Movatter.jp