qs2: a framework for efficient serialization
qs2 is the successor to theqs package. Thegoal is to have reliable and fast performance for saving and loadingobjects in R.
Theqs2 format directly uses R serialization (via theR_Serialize/R_Unserialize C API) whileimproving underlying compression and disk IO patterns. If you arefamiliar with theqs package, the benefits and usage arethe same.
qs_save(data,"myfile.qs2")data<-qs_read("myfile.qs2")Use the file extensionqs2 to distinguish it from theoriginalqs package. It is not compatible with the originalqs format.
install.packages("qs2")On x64 Mac or Linux, you can enable multi-threading by compiling fromsource. It is enabled by default on Windows.
remotes::install_cran("qs2",type ="source",configure.args ="--with-TBB --with-simd=AVX2")On non-x64 systems (e.g. Mac ARM) remove the AVX2 flag.
remotes::install_cran("qs2",type ="source",configure.args ="--with-TBB")Multi-threading inqs2 uses theIntel Thread Building Blocks framework via theRcppParallel package.
Because theqs2 format directly uses R serialization,you can convert it to RDS and vice versa.
file_qs2<-tempfile(fileext =".qs2")file_rds<-tempfile(fileext =".RDS")x<-runif(1e6)# save `x` with qs_saveqs_save(x, file_qs2)# convert the file to RDSqs_to_rds(input_file = file_qs2,output_file = file_rds)# read `x` back in with `readRDS`xrds<-readRDS(file_rds)stopifnot(identical(x, xrds))Theqs2 format saves an internal checksum. This can beused to test for file corruption before deserialization via thevalidate_checksum parameter, but has a minor performancepenalty.
qs_save(data,"myfile.qs2")data<-qs_read("myfile.qs2",validate_checksum =TRUE)The package also introduces theqdata format which hasits own serialization layout and works with only data types (vectors,lists, data frames, matrices).
It will replace internal types (functions, promises, externalpointers, environments, objects) with NULL. Theqdataformat differs from theqs2 format in that it is NOT ageneral.
The eventual goal ofqdata is to also haveinteroperability with other languages, particularlyPython.
qd_save(data,"myfile.qs2")data<-qd_read("myfile.qs2")A summary across 4 datasets is presented below.
| Algorithm | Compression | Save Time (s) | Read Time (s) |
|---|---|---|---|
| qs2 | 7.96 | 13.4 | 50.4 |
| qdata | 8.45 | 10.5 | 34.8 |
| base::serialize | 1.1 | 8.87 | 51.4 |
| saveRDS | 8.68 | 107 | 63.7 |
| fst | 2.59 | 5.09 | 46.3 |
| parquet | 8.29 | 20.3 | 38.4 |
| qs (legacy) | 7.97 | 9.13 | 48.1 |
| Algorithm | Compression | Save Time (s) | Read Time (s) |
|---|---|---|---|
| qs2 | 7.96 | 3.79 | 48.1 |
| qdata | 8.45 | 1.98 | 33.1 |
| fst | 2.59 | 5.05 | 46.6 |
| parquet | 8.29 | 20.2 | 37.0 |
| qs (legacy) | 7.97 | 3.21 | 52.0 |
qs2,qdata andqs withcompress_level = 3parquet via thearrow package using zstdcompression_level = 3base::serialize withascii = FALSE andxdr = FALSEDatasets used
1000 genomes non-coding VCF 1000 genomes non-codingvariants (2743 MB)B-cell data B-cell mouse data, Greiff 2017 (1057MB)IP location IPV4 range data with location information(198 MB)Netflix movie ratings Netflix ML prediction dataset(571 MB)These datasets are openly licensed and represent a combination ofnumeric and text data across multiple domains. Seeinst/analysis/datasets.R on Github.
Serialization functions can be accessed in compiled code. Below is anexample using Rcpp.
// [[Rcpp::depends(qs2)]]#include<Rcpp.h>#include"qs2_external.h"usingnamespace Rcpp;// [[Rcpp::export]]SEXP test_qs_serialize(SEXP x){size_t len=0;unsignedchar* buffer= c_qs_serialize(x,&len,10,true,4);// object, buffer length, compress_level, shuffle, nthreads SEXP y= c_qs_deserialize(buffer, len,false,4);// buffer, buffer length, validate_checksum, nthreads c_qs_free(buffer);// must manually free bufferreturn y;}// [[Rcpp::export]]SEXP test_qd_serialize(SEXP x){size_t len=0;unsignedchar* buffer= c_qd_serialize(x,&len,10,true,4);// object, buffer length, compress_level, shuffle, nthreads SEXP y= c_qd_deserialize(buffer, len,false,false,4);// buffer, buffer length, use_alt_rep, validate_checksum, nthreads c_qd_free(buffer);// must manually free bufferreturn y;}/*** Rx <- runif(1e7)stopifnot(test_qs_serialize(x) == x)stopifnot(test_qd_serialize(x) == x)*/The following global options control the behavior of theqs2 functions. These global options can be queried ormodified usingqopt function.
compress_level
The default compression level used when compressing data.
Default:3L
shuffle
A logical flag indicating whether to allow byte shuffling duringcompression.
Default:TRUE
nthreads
The number of threads used for compression and decompression.
Default:1L
validate_checksum
A logical flag indicating whether to validate the stored checksum whenreading data.
Default:FALSE
warn_unsupported_types
Forqd_save, a logical flag indicating whether to warn whensaving an object with unsupported types.
Default:TRUE
use_alt_rep
Forqd_read, a logical flag indicating whether to useALTREP when reading in string data.
Default:FALSE