Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
/qs2Public
NotificationsYou must be signed in to change notification settings

qsbase/qs2

Repository files navigation

R-CMD-checkCRAN-Status-BadgeCRAN-Downloads-BadgeCRAN-Downloads-Total-Badge

qs2: a framework for efficient serialization

qs2 is the successor to theqs package. The goal is to have reliableand fast performance for saving and loading objects in R.

Theqs2 format directly uses R serialization (via theR_Serialize/R_Unserialize C API) while improving underlyingcompression and disk IO patterns. If you are familiar with theqspackage, the benefits and usage are the same.

qs_save(data,"myfile.qs2")data<- qs_read("myfile.qs2")

Use the file extensionqs2 to distinguish it from the originalqspackage. It is not compatible with the originalqs format.

Installation

install.packages("qs2")

On x64 Mac or Linux, you can enable multi-threading by compiling fromsource. It is enabled by default on Windows.

remotes::install_cran("qs2",type="source",configure.args="--with-TBB --with-simd=AVX2")

On non-x64 systems (e.g. Mac ARM) remove the AVX2 flag.

remotes::install_cran("qs2",type="source",configure.args="--with-TBB")

Multi-threading inqs2 uses theIntel Thread Building Blocksframework via theRcppParallel package.

Converting qs2 to RDS

Because theqs2 format directly uses R serialization, you can convertit to RDS and vice versa.

file_qs2<- tempfile(fileext=".qs2")file_rds<- tempfile(fileext=".RDS")x<- runif(1e6)# save `x` with qs_saveqs_save(x,file_qs2)# convert the file to RDSqs_to_rds(input_file=file_qs2,output_file=file_rds)# read `x` back in with `readRDS`xrds<- readRDS(file_rds)stopifnot(identical(x,xrds))

Validating file integrity

Theqs2 format saves an internal checksum. This can be used to testfor file corruption before deserialization via thevalidate_checksumparameter, but has a minor performance penalty.

qs_save(data,"myfile.qs2")data<- qs_read("myfile.qs2",validate_checksum=TRUE)

The qdata format

The package also introduces theqdata format which has its ownserialization layout and works with only data types (vectors, lists,data frames, matrices).

It will replace internal types (functions, promises, external pointers,environments, objects) with NULL. Theqdata format differs from theqs2 format in that it is NOT a general.

The eventual goal ofqdata is to also have interoperability with otherlanguages, particularlyPython.

qd_save(data,"myfile.qs2")data<- qd_read("myfile.qs2")

Benchmarks

A summary across 4 datasets is presented below.

Single-threaded

AlgorithmCompressionSave Time (s)Read Time (s)
qs27.9613.450.4
qdata8.4510.534.8
base::serialize1.18.8751.4
saveRDS8.6810763.7
fst2.595.0946.3
parquet8.2920.338.4
qs (legacy)7.979.1348.1

Multi-threaded (8 threads)

AlgorithmCompressionSave Time (s)Read Time (s)
qs27.963.7948.1
qdata8.451.9833.1
fst2.595.0546.6
parquet8.2920.237.0
qs (legacy)7.973.2152.0
  • qs2,qdata andqs withcompress_level = 3
  • parquet via thearrow package using zstdcompression_level = 3
  • base::serialize withascii = FALSE andxdr = FALSE

Datasets used

  • 1000 genomes non-coding VCF 1000 genomes non-coding variants (2743MB)
  • B-cell data B-cell mouse data, Greiff 2017 (1057 MB)
  • IP location IPV4 range data with location information (198 MB)
  • Netflix movie ratings Netflix ML prediction dataset (571 MB)

These datasets are openly licensed and represent a combination ofnumeric and text data across multiple domains. Seeinst/analysis/datasets.R on Github.

Usage in C/C++

Serialization functions can be accessed in compiled code. Below is anexample using Rcpp.

// [[Rcpp::depends(qs2)]]#include<Rcpp.h>#include"qs2_external.h"usingnamespaceRcpp;// [[Rcpp::export]]SEXPtest_qs_serialize(SEXP x) {size_t len =0;unsignedchar * buffer =c_qs_serialize(x, &len,10,true,4);// object, buffer length, compress_level, shuffle, nthreads  SEXP y =c_qs_deserialize(buffer, len,false,4);// buffer, buffer length, validate_checksum, nthreadsc_qs_free(buffer);// must manually free bufferreturn y;}// [[Rcpp::export]]SEXPtest_qd_serialize(SEXP x) {size_t len =0;unsignedchar * buffer =c_qd_serialize(x, &len,10,true,4);// object, buffer length, compress_level, shuffle, nthreads  SEXP y =c_qd_deserialize(buffer, len,false,false,4);// buffer, buffer length, use_alt_rep, validate_checksum, nthreadsc_qd_free(buffer);// must manually free bufferreturn y;}/*** Rx <- runif(1e7)stopifnot(test_qs_serialize(x) == x)stopifnot(test_qd_serialize(x) == x)*/

Global Options for qs2

The following global options control the behavior of theqs2functions. These global options can be queried or modified usingqoptfunction.

  • compress_level
    The default compression level used when compressing data.
    Default:3L

  • shuffle
    A logical flag indicating whether to allow byte shuffling duringcompression.
    Default:TRUE

  • nthreads
    The number of threads used for compression and decompression.
    Default:1L

  • validate_checksum
    A logical flag indicating whether to validate the stored checksumwhen reading data.
    Default:FALSE

  • warn_unsupported_types
    Forqd_save, a logical flag indicating whether to warn when savingan object with unsupported types.
    Default:TRUE

  • use_alt_rep
    Forqd_read, a logical flag indicating whether to use ALTREP whenreading in string data.
    Default:FALSE


About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp