Movatterモバイル変換


[0]ホーム

URL:


stringfish

R-CMD-checkCRAN-Status-BadgeCRAN-Downloads-BadgeCRAN-Downloads-Total-Badge

stringfish is a framework for performing string andsequence operations using the ALTREP system to speed up the computationof common string operations.

The ultimate goal of the package is to unify ALTREP stringimplementations under a common framework.

The ALTREP system (new as of R 3.5.0) allows package developers torepresent R objects using their own custom memory layout, completelyinvisible to the user.stringfish represents string data asa simple C++/STL vector, which is very fast and lightweight.

Using normal R functions to process string data(e.g. substr,gsub,paste, etc.)causes “materialization” of ALTREP vectors to normal R data, which canbe a slow process. Therefore, in order to take full advantage of theALTREP framework, string processing functions need to be re-written tobe ALTREP aware. This package hopes to fulfill that purpose.

Installation

install.packages("stringfish",type="source",configure.args="--with-simd=AVX2")

Benchmark

The simplest way to show the utility of the ALTREP framework isthrough a quick benchmark comparingstringfish and baseR.

Yes you are reading the graph correctly: some functions instringfish are more than an order of magnitude faster thanvectorized base R operations (and even faster with some build inmultithreading). On large text datasets, this can turn minutes ofcomputation into seconds.

Currently implementedfunctions

A list of implementedstringfish functions and analogousbase R functions:

Utility functions:

In addition, many R operations in base R and other packages arealready ALTREP-aware (i.e. they don’t cause materialization). Functionsthat subset or index into string vectors generally do notmaterialize.

stringfish functions are not intended to exactlyreplicate their base R analogues. One difference is thatsubject parameters are always the first argument, which iseasier to use with pipes (%>%). E.g.,gsub(pattern, replacement, subject) becomessf_gsub(subject, pattern, replacement).

Extensibility

stringfish as a framework is intended to be easilyextensible. Stringfish vectors can be worked intoRcppscripts or even into other packages (see theqs2 packagefor an example).

Below is a detailedRcpp script that creates a functionto alternate upper and lower case of strings.

// [[Rcpp::plugins(cpp11)]]// [[Rcpp::depends(stringfish)]]#include<Rcpp.h>#include"sf_external.h"using namespace Rcpp;// [[Rcpp::export]]SEXP sf_alternate_case(SEXP x){// Iterate through a character vector using the RStringIndexer class// If the input vector x is a stringfish character vector it will do so without materialization  RStringIndexer r(x);size_t len= r.size();// Create an output stringfish vector// Like all R objects, it must be protected from garbage collection  SEXP output= PROTECT(sf_vector(len));// Obtain a reference to the underlying output data  sf_vec_data& output_data= sf_vec_data_ref(output);// You can use range based for loop via an iterator class that returns RStringIndexer::rstring_info e// rstring info is a struct containing const char * ptr (null terminated), int len, and cetype_t enc// a NA string is represented by a nullptr// Alternatively, access the data via the function r.getCharLenCE(i)size_t i=0;for(auto e: r){// check if string is NA and go to next if it isif(e.ptr== nullptr){      i++;// increment output indexcontinue;}// create a temporary output string and process the results    std::string temp(e.len,'\0');bool case_switch= false;for(int j=0; j<e.len; j++){if((e.ptr[j]>=65)&(e.ptr[j]<=90)){// char j is upper caseif((case_switch=!case_switch)){// check if we should convert to lower case          temp[j]= e.ptr[j]+32;continue;}}elseif((e.ptr[j]>=97)&(e.ptr[j]<=122)){// char j is lower caseif(!(case_switch=!case_switch)){// check if we should convert to upper case          temp[j]= e.ptr[j]-32;continue;}}elseif(e.ptr[j]==32){        case_switch= false;}      temp[j]= e.ptr[j];}// Create a new vector element sfstring and insert the processed string into the stringfish vector// sfstring has three constructors, 1) taking a std::string and encoding,// 2) a char pointer and encoding, or 3) a CHARSXP object (e.g. sfstring(NA_STRING))    output_data[i]= sfstring(temp, e.enc);    i++;// increment output index}// Finally, call unprotect and return result  UNPROTECT(1);return output;}

Example function call:

sf_alternate_case("hello world")[1]"hElLo wOrLd"

To do


[8]ページ先頭

©2009-2026 Movatter.jp