Movatterモバイル変換


[0]ホーム

URL:


jmatrixsc

Juan Domingo

Please, read this

This is a copy of the vignette of packagejmatrix. It isincluded here sincejmatrix is underlying this package andyou will need to know how to prepare your data to be processed byscellpam. But you must NOT load packagejmatrix; all the functions detailed below are alreadyincluded intoscellpam and are available just by loading itwithlibrary(scellpam).

library(scellpam)

Purpose

The packagejmatrix (Domingo(2023b)) was originally conceived as a tool for other packages,namelyparallelpam (Domingo(2023c)) andscellpam (Domingo (2023a)) which needed to deal with verybig matrices which might not fit in the memory of the computer,particularly if their elements are of typedouble (in mostmodern machines, 8 bytes per element) whereas they could fit if theywere matrices of other data types, in particular of floats (4 bytes perelement).

Unfortunately, R is not a strongly typed language. Double is thedefault type in R and it is not easy to work with other data types.Trials like the package float (Schmidt(2022)) have been done, but to use them you have to coerce amatrix already loaded in R memory to a float matrix, and then you candelete it. But, what happens if you computer has not memory enough tohold the matrix in the first place?. This is the problem this packagetries to address.

Our idea is to use the disk as temporarily storage of the matrix in afile with a internal binary format (jmatrix format). Thisformat has a header of 128 bytes with information like type of matrix(full, sparse or symmetric), data type of each element (char, short,int, long, float, double or long double), number of rows and columns andendianness; then comes the content as binary data (in sparse matriceszeros are not stored; in symmetric matrices only the lower-diagonal isstored) and finally the metadata (currently, names for rows/columns ifneeded and an optional comment).

Such files are created and loaded by functions written in C++ whichare accessible fromR with Rcpp (Eddelbuettel and François (2011)). The file,once loaded, uses strictly the needed memory for its data type and canbe processed by other C++ functions (like the PAM algorithm or any othernumeric library written in C++) also from insideR.

The matrix contained in a binary data file injmatrixformat cannot be loaded directly inR memory as aR matrix (that would be impossible, anyway, since preciselythis package is done for the cases in which such matrix would NOT fitinto the available RAM). Nevertheless, limited access through somefunctions is provided to read one or more rows or one or more columns asR vectors or matrices (obviously, coerced to double).

The packagejmatrix must not be considered as a final,finished software. Currently is mostly an instrumental solution toaddress our needs and we make available as a separate package just incase it could be useful for anyone else.

Workflow

Debug messages

First of all, the package can show quite informative (but sometimesverbose) messages in the console. To turn on/off such messages you canuse.

ScellpamSetDebug(TRUE,TRUE,TRUE)#> Debugging for scellpam (biological part) of the package set to ON.#> Debugging for parallelpam inside scellpam package set to ON.#> Debugging for jmatrix inside scellpam package set to ON.
# Initially, state of debug is FALSE.

Data storage

As stated before, the binary matrix files should normally be createdfrom C++ getting the data from an external source like a data file in aformat used in bioinformatics or a .csv file. These files should be readby chunks. As an example, look at functionCsvToJMat inpackagescellpam.

As a convenience and only for testing purposes (to be used in thisvignette), we provide the functionJWriteBin to write a Rmatrix as ajmatrix file.

# Create a 6x8 matrix of random valuesRf<-matrix(runif(48),nrow=6)# Set row and column names for itrownames(Rf)<-c("A","B","C","D","E","F")colnames(Rf)<-c("a","b","c","d","e","f","g","h")# Let's see the matrixRf#>            a         b         c           d         e         f          g#> A 0.27589971 0.9846780 0.3718661 0.054738886 0.2069468 0.2158733 0.09942662#> B 0.01678051 0.2110365 0.7561296 0.143986892 0.5582157 0.6200316 0.32619514#> C 0.48299428 0.5528306 0.1274279 0.421164190 0.5377578 0.8797580 0.90853176#> D 0.77741102 0.5887183 0.3318360 0.634689683 0.7284263 0.1091728 0.03242129#> E 0.29742146 0.3560088 0.1045180 0.596748357 0.4722127 0.7780832 0.43684075#> F 0.67264312 0.2984609 0.1981823 0.008672506 0.6744010 0.9683110 0.37124352#>           h#> A 0.4432057#> B 0.1925767#> C 0.3744585#> D 0.4201738#> E 0.7171956#> F 0.1224369
# and write it as the binary file Rfullfloat.binJWriteBin(Rf,"Rfullfloat.bin",dtype="float",dmtype="full",comment="Full matrix of floats")#> The passed matrix has row names for the 6 rows and they will be used.#> The passed matrix has column names for the 8 columns and they will be used.#> Writing binary matrix Rfullfloat.bin of (6x8)#> End of block of binary data at offset 320#>    Writing row names (6 strings written, from A to F).#>    Writing column names (8 strings written, from a to h).#>    Writing comment: Full matrix of floats
# Also, you can write it with double data type:JWriteBin(Rf,"Rfulldouble.bin",dtype="double",dmtype="full",comment="Full matrix of doubles")#> The passed matrix has row names for the 6 rows and they will be used.#> The passed matrix has column names for the 8 columns and they will be used.#> Writing binary matrix Rfulldouble.bin of (6x8)#> End of block of binary data at offset 512#>    Writing row names (6 strings written, from A to F).#>    Writing column names (8 strings written, from a to h).#>    Writing comment: Full matrix of doubles

To get information about the stored file the functionJMatInfo is provided. Of course, this funcion does not readthe complete file in memory but just the header.

# Information about the float binary fileJMatInfo("Rfullfloat.bin")#> File:               Rfullfloat.bin#> Matrix type:        FullMatrix#> Number of elements: 48#> Data type:          float#> Endianness:         little endian (same as this machine)#> Number of rows:     6#> Number of columns:  8#> Metadata:           Stored names of rows and columns.#> Metadata comment:  "Full matrix of floats"
# Same information about the double binary fileJMatInfo("Rfulldouble.bin")#> File:               Rfulldouble.bin#> Matrix type:        FullMatrix#> Number of elements: 48#> Data type:          double#> Endianness:         little endian (same as this machine)#> Number of rows:     6#> Number of columns:  8#> Metadata:           Stored names of rows and columns.#> Metadata comment:  "Full matrix of doubles"

A jmatrix binary file can be exported to .csv/.tsv table. This isdone with the functionJMatToCsv

# Create a 6x8 matrix of random valuesRf<-matrix(runif(48),nrow=6)# Set row and column names for itrownames(Rf)<-c("A","B","C","D","E","F")colnames(Rf)<-c("a","b","c","d","e","f","g","h")# Store it as the binary file Rfullfloat.binJWriteBin(Rf,"Rfullfloat.bin",dtype="float",dmtype="full",comment="Full matrix of floats")#> The passed matrix has row names for the 6 rows and they will be used.#> The passed matrix has column names for the 8 columns and they will be used.#> Writing binary matrix Rfullfloat.bin of (6x8)#> End of block of binary data at offset 320#>    Writing row names (6 strings written, from A to F).#>    Writing column names (8 strings written, from a to h).#>    Writing comment: Full matrix of floats
# Save the content of this .bin as a .csv fileJMatToCsv("Rfullfloat.bin","Rfullfloat.csv",csep=",",withquotes=FALSE)#> Read full matrix with size (6,8)

The generated file will not have quotes neither around the columnnames (in its first line) nor around each row name (at the beginning ofeach line) since withquotes is FALSE but it can be set to TRUE for theopposite behavior. Also, a .tsv (tabulator separated values) would havebeen generated using csep=“\t”.

Also, a jmatrix binary file can also be generated from a .csv/.tsvfile. Such file must have a first line with the names of the columns(possibly surrounded by double quotes, including a first emptydouble-quote, since the column of row names has no name itself). Therest of its lines must start with a string (possibly surrounded bydouble quotes) with the row name and the values. In all cases (firstline and data lines) each column must be separated from the next by aseparation character (usually, a comma). No separation character must beadded at the end of each line. This format is compatible with the .csvgenerated by R with the functionwrite.csv.

The function to read .csv files isCsvToJMat

# Create a 6x8 matrix of random valuesRf<-matrix(runif(48),nrow=6)# Set row and column names for itrownames(Rf)<-c("A","B","C","D","E","F")colnames(Rf)<-c("a","b","c","d","e","f","g","h")# Save it as a .csv file with the standard R function...write.csv(Rf,"rf.csv")# ...and read it to create a jmatrix binary fileCsvToJMat("rf.csv","rf.bin",mtype="full",csep=",",ctype="raw",valuetype="float",transpose=FALSE,comment="Test matrix generated reading a .csv file")#> 8 columns of values (not including the column of names) in file rf.csv.#> 6 lines (excluding header) in file rf.csv#> Data will be read from each line and stored as float values.#> Reading line... 0#> Read 6 data lines of file rf.csv, as expected.#> Writing binary matrix rf.bin of (6x8)#> End of block of binary data at offset 320#>    Writing row names (6 strings written, from A to F).#>    Writing column names (8 strings written, from a to h).#>    Writing comment: Test matrix generated reading a .csv file
# Let's see the characteristics of the binary fileJMatInfo("rf.bin")#> File:               rf.bin#> Matrix type:        FullMatrix#> Number of elements: 48#> Data type:          float#> Endianness:         little endian (same as this machine)#> Number of rows:     6#> Number of columns:  8#> Metadata:           Stored names of rows and columns.#> Metadata comment:  "Test matrix generated reading a .csv file"

Special note for symmetric matrices:

The parameter mtype=“symmetric” will consider the content of the .csvfile as a symmetric matrix. This implies that it must be a square matrix(same number of rows and columns) but the upper-diagonal matrix thatmust be present (it does not matter with which values) will be read, andimmediately ignored, i.e.: only the lower-diagonal matrix (including themain diagonal) will be stored.

Data load

As stated before, no function is provided to read the whole matrix inmemory which would contradict the philosophy of this package, but youcan get rows or columns from a file.

# Reads row 1 into vector vf. Float values inside the file are# promoted to double.(vf<-GetJRow("Rfullfloat.bin",1))#>          a          b          c          d          e          f          g#> 0.13720177 0.32074910 0.56739020 0.02200842 0.07246488 0.77734601 0.97423404#>          h#> 0.85302329

Obviously, storage in float provokes a loosing of precision. We haveobserved this not to be relevant forPAM (partitioningaround medoids) algorihm but it can be important in other cases. It isthe price to pay for halving the needed space.

# Checks the precision lostmax(abs(Rf[1,]-vf))#> [1] 0.3643971

Nevertheless, storing as double obviously keeps the data intact.

vd<-GetJRow("Rfulldouble.bin",1)max(abs(Rf[1,]-vd))#> [1] 0.8391243

Now, let us see examples of some functions to read rows or columns bynumber or by name, or to read several rows/columns as a R matrix. In allexamples numbers for rows and columns are in R-convention (i.e. starting at 1)

# Read column number 3(vf<-GetJCol("Rfullfloat.bin",3))#>         A         B         C         D         E         F#> 0.5673902 0.8786766 0.6234628 0.7779209 0.6899800 0.2735479
# Test precisionmax(abs(Rf[,3]-vf))#> [1] 0.8357554
# Read row with name C(vf<-GetJRowByName("Rfullfloat.bin","C"))#>         a         b         c         d         e         f         g         h#> 0.8305408 0.3501627 0.6234628 0.1549637 0.1349286 0.5732384 0.1725612 0.8720549
# Read column with name c(vf<-GetJColByName("Rfullfloat.bin","c"))#>         A         B         C         D         E         F#> 0.5673902 0.8786766 0.6234628 0.7779209 0.6899800 0.2735479
# Get the names of all rows or columns as vectors of R strings(rn<-GetJRowNames("Rfullfloat.bin"))#> [1] "A" "B" "C" "D" "E" "F"
(cn<-GetJColNames("Rfullfloat.bin"))#> [1] "a" "b" "c" "d" "e" "f" "g" "h"
# Get the names of rows and columns simultaneosuly as a list of two elements(l<-GetJNames("Rfullfloat.bin"))#> $rownames#> [1] "A" "B" "C" "D" "E" "F"#>#> $colnames#> [1] "a" "b" "c" "d" "e" "f" "g" "h"
# Get several rows at once. The returned matrix has the rows in the# same order as the passed list,# and this list can contain even repeated values(vm<-GetJManyRows("Rfullfloat.bin",c(1,4)))#>           a         b         c          d          e         f         g#> A 0.1372018 0.3207491 0.5673902 0.02200842 0.07246488 0.7773460 0.9742340#> D 0.6204843 0.6626323 0.7779209 0.52911037 0.99443126 0.2887037 0.3533239#>           h#> A 0.8530233#> D 0.2409358
# Of course, columns can be extrated equally(vc<-GetJManyCols("Rfulldouble.bin",c(1,4)))#>            a           d#> A 0.27589971 0.054738886#> B 0.01678051 0.143986892#> C 0.48299428 0.421164190#> D 0.77741102 0.634689683#> E 0.29742146 0.596748357#> F 0.67264312 0.008672506
# and similar functions are provided for extracting by names:(vm<-GetJManyRowsByNames("Rfulldouble.bin",c("A","D")))#>           a         b         c          d         e         f          g#> A 0.2758997 0.9846780 0.3718661 0.05473889 0.2069468 0.2158733 0.09942662#> D 0.7774110 0.5887183 0.3318360 0.63468968 0.7284263 0.1091728 0.03242129#>           h#> A 0.4432057#> D 0.4201738
(vc<-GetJManyColsByNames("Rfulldouble.bin",c("a","d")))#>            a           d#> A 0.27589971 0.054738886#> B 0.01678051 0.143986892#> C 0.48299428 0.421164190#> D 0.77741102 0.634689683#> E 0.29742146 0.596748357#> F 0.67264312 0.008672506

The package can manage and store sparse and symmetric matrices,too.

# Generation of a 6x8 sparse matrixRsp<-matrix(rep(0,48),nrow=6)sparsity<-0.1nnz<-round(48*sparsity)where<-floor(47*runif(nnz))val<-runif(nnz)for (iin1:nnz){ Rsp[floor(where[i]/8)+1,(where[i]%%8)+1]<- val[i]}rownames(Rsp)<-c("A","B","C","D","E","F")colnames(Rsp)<-c("a","b","c","d","e","f","g","h")# Let's see the matrixRsp#>           a b c         d e         f         g h#> A 0.0000000 0 0 0.0000000 0 0.0000000 0.0000000 0#> B 0.0000000 0 0 0.0000000 0 0.0000000 0.0000000 0#> C 0.2819449 0 0 0.0000000 0 0.0000000 0.0000000 0#> D 0.9078598 0 0 0.0000000 0 0.1857652 0.9463108 0#> E 0.0000000 0 0 0.2251895 0 0.0000000 0.0000000 0#> F 0.0000000 0 0 0.0000000 0 0.0000000 0.0000000 0
# Write the matrix as sparse with type floatJWriteBin(Rsp,"Rspafloat.bin",dtype="float",dmtype="sparse",comment="Sparse matrix of floats")#> The passed matrix has row names for the 6 rows and they will be used.#> The passed matrix has column names for the 8 columns and they will be used.#> Writing binary matrix Rspafloat.bin of (6x8)#> End of block of binary data at offset 192#>    Writing row names (6 strings written, from A to F).#>    Writing column names (8 strings written, from a to h).#>    Writing comment: Sparse matrix of floats

Notice that the condition of being a sparse matrix and the storagespace used can be known with the matrix info.

JMatInfo("Rspafloat.bin")#> File:               Rspafloat.bin#> Matrix type:        SparseMatrix#> Number of elements: 48#> Data type:          float#> Endianness:         little endian (same as this machine)#> Number of rows:     6#> Number of columns:  8#> Metadata:           Stored names of rows and columns.#> Metadata comment:  "Sparse matrix of floats"#> Binary data size:   64 bytes, which is 33.3333% of the full matrix size (which would be 192 bytes).

Be careful: trying to store as sparse a matrix which is not (it hasnot a majority of 0-entries) works, but produces a matrix larger thanthe corresponding full matrix.

With respect to symmetric matrices,JWriteBin works thesame way. Let us generate a\(7 \times7\) symmetric matrix.

Rns<-matrix(runif(49),nrow=7)Rsym<-0.5*(Rns+t(Rns))rownames(Rsym)<-c("A","B","C","D","E","F","G")colnames(Rsym)<-c("a","b","c","d","e","f","g")# Let's see the matrixRsym#>           a         b         c         d         e         f         g#> A 0.6580855 0.2277210 0.3177799 0.1935018 0.8192427 0.2259890 0.5978475#> B 0.2277210 0.9026866 0.8097610 0.8490149 0.2086403 0.4967763 0.2406990#> C 0.3177799 0.8097610 0.4672116 0.6815653 0.4647701 0.8140876 0.4645924#> D 0.1935018 0.8490149 0.6815653 0.2008103 0.2752492 0.2906596 0.2553448#> E 0.8192427 0.2086403 0.4647701 0.2752492 0.6722613 0.5298083 0.2106208#> F 0.2259890 0.4967763 0.8140876 0.2906596 0.5298083 0.3636066 0.6298440#> G 0.5978475 0.2406990 0.4645924 0.2553448 0.2106208 0.6298440 0.8802311
# Write the matrix as symmetric with type floatJWriteBin(Rsym,"Rsymfloat.bin",dtype="float",dmtype="symmetric",comment="Symmetric matrix of floats")#> The passed matrix has row names for the 7 rows and they will be used.#> Writing binary matrix Rsymfloat.bin#> End of block of binary data at offset 240#>    Writing row names (7 strings written, from A to G).#>    Writing comment: Symmetric matrix of floats
# Get the informationJMatInfo("Rsymfloat.bin")#> File:               Rsymfloat.bin#> Matrix type:        SymmetricMatrix#> Number of elements: 49 (28 really stored)#> Data type:          float#> Endianness:         little endian (same as this machine)#> Number of rows:     7#> Number of columns:  7#> Metadata:           Stored only names of rows.#> Metadata comment:  "Symmetric matrix of floats"

Notice that if you store a R matrix which is NOT symmetric as asymmetricjmatrix, only the lower triangular part(including the main diagonal) will be saved. The upper-triangular partwill be lost.

The functions to read rows/colums stated before works equallyindependently of the matrix character (full, sparse or symmetric) so youcan play with them using theRspafloat.bin andRsymfloat.bin file to check they work.

If the jmatrix stored in a binary file has names associated to rowsor columns, you can filter it using them and generate another jmatrixfile with only the rows or columns you wish to keep. The function to doso is ‘FilterJMatByName’.

Rns<-matrix(runif(49),nrow=7)rownames(Rns)<-c("A","B","C","D","E","F","G")colnames(Rns)<-c("a","b","c","d","e","f","g")# Let's see the matrixRns#>            a         b          c         d          e         f         g#> A 0.31021769 0.3593303 0.98829232 0.9253817 0.18590674 0.5768363 0.8441831#> B 0.23485317 0.9915758 0.83992465 0.7763845 0.55670597 0.2598939 0.5588117#> C 0.30524976 0.5852155 0.21496642 0.7879748 0.57089967 0.5254845 0.5747068#> D 0.89363911 0.4487132 0.06305535 0.2217450 0.04212866 0.4336528 0.6390827#> E 0.85171313 0.9734071 0.79525294 0.8950884 0.54799849 0.7841066 0.6538280#> F 0.07688849 0.3062170 0.15616117 0.9416905 0.14402383 0.8818467 0.3853623#> G 0.83164201 0.4677084 0.64806237 0.9006757 0.83035942 0.8629806 0.8469882
# Write the matrix as full with type floatJWriteBin(Rns,"Rfullfloat.bin",dtype="float",dmtype="full",comment="Full matrix of floats")#> The passed matrix has row names for the 7 rows and they will be used.#> The passed matrix has column names for the 7 columns and they will be used.#> Writing binary matrix Rfullfloat.bin of (7x7)#> End of block of binary data at offset 324#>    Writing row names (7 strings written, from A to G).#>    Writing column names (7 strings written, from a to g).#>    Writing comment: Full matrix of floats
# Extract the first two and the last two columnsFilterJMatByName("Rfullfloat.bin",c("a","b","f","g"),"Rfullfloat_fourcolumns.bin",namesat="cols")#> Read full matrix with size (7,7)#> 4 columns of the 7 in the original matrix will be kept.#> Writing binary matrix Rfullfloat_fourcolumns.bin of (7x4)#> End of block of binary data at offset 240#>    Writing row names (7 strings written, from A to G).#>    Writing column names (4 strings written, from a to g).#>    Writing comment: Full matrix of floats
# Let's load the matrix and let's see itvm<-GetJManyRows("Rfullfloat_fourcolumns.bin",c(1,7))vm#>           a         b         f         g#> A 0.3102177 0.3593303 0.5768363 0.8441831#> G 0.8316420 0.4677084 0.8629806 0.8469882
Domingo, Juan. 2023a.Applying Partitioning Around Medoids to SingleCell Data with High Number of Cells.
———. 2023b.Jmatrix: Read from/Write to Disk Matrices with Any DataType in a Binary Format.
———. 2023c.Parallelpam: Applies the Partitioning-Around-Medoids(PAM) Clustering Algorithm to Big Sets of Data Using ParallelImplementation, If Several Cores Are Available.
Eddelbuettel, Dirk, and Romain François. 2011.Rcpp:SeamlessR andC++ Integration.”Journal of Statistical Software 40 (8): 1–18.https://doi.org/10.18637/jss.v040.i08.
Schmidt, Drew. 2022.float: 32-BitFloats.”https://cran.r-project.org/package=float.

[8]ページ先頭

©2009-2025 Movatter.jp