stringfish is a framework for performing string andsequence operations using the ALTREP system to speed up the computationof common string operations.
The ultimate goal of the package is to unify ALTREP stringimplementations under a common framework.
The ALTREP system (new as of R 3.5.0) allows package developers torepresent R objects using their own custom memory layout, completelyinvisible to the user.stringfish represents string data asa simple C++/STL vector, which is very fast and lightweight.
Using normal R functions to process string data(e.g. substr,gsub,paste, etc.)causes “materialization” of ALTREP vectors to normal R data, which canbe a slow process. Therefore, in order to take full advantage of theALTREP framework, string processing functions need to be re-written tobe ALTREP aware. This package hopes to fulfill that purpose.
install.packages("stringfish",type="source",configure.args="--with-simd=AVX2")The simplest way to show the utility of the ALTREP framework isthrough a quick benchmark comparingstringfish and baseR.

Yes you are reading the graph correctly: some functions instringfish are more than an order of magnitude faster thanvectorized base R operations (and even faster with some build inmultithreading). On large text datasets, this can turn minutes ofcomputation into seconds.
A list of implementedstringfish functions and analogousbase R functions:
sf_iconv (iconv)sf_nchar (nchar)sf_substr (substr)sf_paste (paste0)sf_collapse (paste0)sf_readLines (readLines)sf_writeLines (writeLines)sf_grepl (grepl)sf_gsub (gsub)sf_toupper (toupper)sf_tolower (tolower)sf_starts (startsWith)sf_ends (endsWith)sf_trim (trimws)sf_split (strsplit)sf_match (match for strings only)sf_compare/sf_equals (==,ALTREP-aware string equality)Utility functions:
sf_vector – creates a new and emptystringfish vectorsf_assign – assign strings into astringfish vector in place (likex[i] <- "mystring")sf_convert/convert_to_sf – converts acharacter vector to astringfish vectorget_string_type – determines string type (whetherALTREP or normal)materialize – converts any ALTREP object into a normalR objectrandom_strings – creates random strings as either astringfish or normal R vectorstring_identical – likeidentical forstrings but also requires identical encoding (i.e. latin1 and UTF-8strings will not match)In addition, many R operations in base R and other packages arealready ALTREP-aware (i.e. they don’t cause materialization). Functionsthat subset or index into string vectors generally do notmaterialize.
sampleheadtail[ – e.g. x[20:30]dplyr::filter –e.g. dplyr::filter(df, sf_starts("a"))stringfish functions are not intended to exactlyreplicate their base R analogues. One difference is thatsubject parameters are always the first argument, which iseasier to use with pipes (%>%). E.g.,gsub(pattern, replacement, subject) becomessf_gsub(subject, pattern, replacement).
stringfish as a framework is intended to be easilyextensible. Stringfish vectors can be worked intoRcppscripts or even into other packages (see theqs2 packagefor an example).
Below is a detailedRcpp script that creates a functionto alternate upper and lower case of strings.
// [[Rcpp::plugins(cpp11)]]// [[Rcpp::depends(stringfish)]]#include<Rcpp.h>#include"sf_external.h"using namespace Rcpp;// [[Rcpp::export]]SEXP sf_alternate_case(SEXP x){// Iterate through a character vector using the RStringIndexer class// If the input vector x is a stringfish character vector it will do so without materialization RStringIndexer r(x);size_t len= r.size();// Create an output stringfish vector// Like all R objects, it must be protected from garbage collection SEXP output= PROTECT(sf_vector(len));// Obtain a reference to the underlying output data sf_vec_data& output_data= sf_vec_data_ref(output);// You can use range based for loop via an iterator class that returns RStringIndexer::rstring_info e// rstring info is a struct containing const char * ptr (null terminated), int len, and cetype_t enc// a NA string is represented by a nullptr// Alternatively, access the data via the function r.getCharLenCE(i)size_t i=0;for(auto e: r){// check if string is NA and go to next if it isif(e.ptr== nullptr){ i++;// increment output indexcontinue;}// create a temporary output string and process the results std::string temp(e.len,'\0');bool case_switch= false;for(int j=0; j<e.len; j++){if((e.ptr[j]>=65)&(e.ptr[j]<=90)){// char j is upper caseif((case_switch=!case_switch)){// check if we should convert to lower case temp[j]= e.ptr[j]+32;continue;}}elseif((e.ptr[j]>=97)&(e.ptr[j]<=122)){// char j is lower caseif(!(case_switch=!case_switch)){// check if we should convert to upper case temp[j]= e.ptr[j]-32;continue;}}elseif(e.ptr[j]==32){ case_switch= false;} temp[j]= e.ptr[j];}// Create a new vector element sfstring and insert the processed string into the stringfish vector// sfstring has three constructors, 1) taking a std::string and encoding,// 2) a char pointer and encoding, or 3) a CHARSXP object (e.g. sfstring(NA_STRING)) output_data[i]= sfstring(temp, e.enc); i++;// increment output index}// Finally, call unprotect and return result UNPROTECT(1);return output;}Example function call:
sf_alternate_case("hello world")[1]"hElLo wOrLd"