Movatterモバイル変換


[0]ホーム

URL:


Introduction to the hdf5r package

Holger Hoefling

April 17th 2016

Abstract

Overview on how to use the simple as well as advanced facilities ofHDF5 using thehdf5r package


1 Introduction

HDF5 is a data model, library, and file format for storing andmanaging data. It supports an unlimited variety of datatypes, and isdesigned for flexible and efficient I/O and for high volume and complexdata.

As R is very often used to process large amounts of data, having adirect interface to HDF5 is very useful. As of the writing of thisvignette, there are 2 other packages available that also implement aninterface to HDF5,h5 on CRAN andrhdf5 on Bioconductor. These are also goodimplementations, but there are several points that make this packagehere –hdf5r – stand out:

In the following sections of this vignette, first a simple examplewill be given that shows how standard operations are being performed.Next, more advanced features will be discussed such as the creation ofcomplex datatypes, datasets with special datatypes, the setting of thevarious available filters when reading/writing a package etc. We willend with a technical overview on the underlying implementation.


2 A simple example

As an introduction on how to use it, let us set up a very simpleusage example. We will create a file, some groups in it as well asdatasets of different sizes. We will read and write data, deletedatasets again, get information on various objects.

2.1 Creating files,groups and datasets

But first things first. We create a random filename in a temporarydirectory and create a file with read/write access, deleting it if italready exists (it won’t - tempfile gives us a name of a file thatdoesn’t exist yet).

library(hdf5r)test_filename<-tempfile(fileext =".h5")file.h5<- H5File$new(test_filename,mode ="w")file.h5
## Class: H5File## Filename: /tmp/RtmpmUJ35B/file33f4917e9f5a7.h5## Access type: H5F_ACC_RDWR

Now that we have this, we will create 2 groups, one for themtcars dataset and one for thenycflights13 dataset.

mtcars.grp<- file.h5$create_group("mtcars")flights.grp<- file.h5$create_group("flights")

Into these groups, we will now write the datasets

library(datasets)library(nycflights13)library(reshape2)mtcars.grp[["mtcars"]]<- datasets::mtcarsflights.grp[["weather"]]<- nycflights13::weatherflights.grp[["flights"]]<- nycflights13::flights

Out of the weather data, we extract the information on thewind-direction and wind-speed and will save it as a matrix with thehours in the columns and the days in the rows (only for weather stationEWR, the others are not complete).

weather_wind_dir<-subset(nycflights13::weather, origin=="EWR",select =c("year","month","day","hour","wind_dir"))weather_wind_dir<-na.exclude(weather_wind_dir)weather_wind_dir$wind_dir<-as.integer(weather_wind_dir$wind_dir)weather_wind_dir<-acast(weather_wind_dir, year+ month+ day~ hour,value.var ="wind_dir")
## Aggregation function missing: defaulting to length
flights.grp[["wind_dir"]]<- weather_wind_dir

and

weather_wind_speed<-subset(nycflights13::weather, origin=="EWR",select =c("year","month","day","hour","wind_speed"))weather_wind_speed<-na.exclude(weather_wind_speed)weather_wind_speed<-acast(weather_wind_speed, year+ month+ day~ hour,value.var ="wind_speed")
## Aggregation function missing: defaulting to length
flights.grp[["wind_speed"]]<- weather_wind_speed

For completeness, we also attach the row and column names asattributes:

h5attr(flights.grp[["wind_dir"]],"colnames")<-colnames(weather_wind_dir)h5attr(flights.grp[["wind_dir"]],"rownames")<-rownames(weather_wind_dir)h5attr(flights.grp[["wind_speed"]],"colnames")<-colnames(weather_wind_speed)h5attr(flights.grp[["wind_speed"]],"rownames")<-rownames(weather_wind_speed)

2.2 Getting informationabout different objects

2.2.1 Content of filesand groups

With respect to groups and files, we also want to have a simple wayto extract the contents. With thenames function, wecan get all names of objects in a group or in the root directory of afile

names(file.h5)
## [1] "flights" "mtcars"
names(flights.grp)
## [1] "flights"    "weather"    "wind_dir"   "wind_speed"

Another option that gives more information isls, amethod of the classesH5File andH5Group

flights.grp$ls()
##         name     link.type    obj_type num_attrs group.nlinks group.mounted## 1    flights H5L_TYPE_HARD H5I_DATASET         0           NA            NA## 2    weather H5L_TYPE_HARD H5I_DATASET         0           NA            NA## 3   wind_dir H5L_TYPE_HARD H5I_DATASET         2           NA            NA## 4 wind_speed H5L_TYPE_HARD H5I_DATASET         2           NA            NA##   dataset.rank dataset.dims dataset.maxdims dataset.type_class## 1            1       336776             Inf       H5T_COMPOUND## 2            1        26115             Inf       H5T_COMPOUND## 3            2     364 x 24       Inf x Inf        H5T_INTEGER## 4            2     364 x 24       Inf x Inf        H5T_INTEGER##   dataset.space_class committed_type## 1          H5S_SIMPLE           <NA>## 2          H5S_SIMPLE           <NA>## 3          H5S_SIMPLE           <NA>## 4          H5S_SIMPLE           <NA>

2.2.2 Information onattributes, datatypes and datasets

If you have an HDF5-File, it is of course important to look upvarious information not only about groups, but also about theinformation contained in it. First, we want to get more informationabout the dataset.ls on the group already gives a lotof information about the datatype, the size, the maximum size etc.However there are also other, more direct, ways to get the sameinformation. In order to investigate the datatype we can

weather_ds<- flights.grp[["weather"]]weather_ds_type<- weather_ds$get_type()weather_ds_type$get_class()
## [1] H5T_COMPOUND## 13 Levels: H5T_NO_CLASS H5T_INTEGER H5T_FLOAT H5T_TIME ... H5T_NCLASSES## 13 Values: -1 0 1 2 ... 11
cat(weather_ds_type$to_text())
## H5T_COMPOUND {##       H5T_STRING {##          STRSIZE H5T_VARIABLE;##          STRPAD H5T_STR_NULLTERM;##          CSET H5T_CSET_ASCII;##          CTYPE H5T_C_S1;##       } "origin" : 0;##       H5T_STD_I32LE "year" : 8;##       H5T_STD_I32LE "month" : 12;##       H5T_STD_I32LE "day" : 16;##       H5T_STD_I32LE "hour" : 20;##       H5T_IEEE_F64LE "temp" : 24;##       H5T_IEEE_F64LE "dewp" : 32;##       H5T_IEEE_F64LE "humid" : 40;##       H5T_IEEE_F64LE "wind_dir" : 48;##       H5T_IEEE_F64LE "wind_speed" : 56;##       H5T_IEEE_F64LE "wind_gust" : 64;##       H5T_IEEE_F64LE "precip" : 72;##       H5T_IEEE_F64LE "pressure" : 80;##       H5T_IEEE_F64LE "visib" : 88;##       H5T_IEEE_F64LE "time_hour" : 96;##    }

telling us that our dataset consists of aH5T_COMPOUNDdatatype and prints more detailed information on its content of everycolumn. Regarding the size of the dataset and the size of the chunks(datasets are by default chunked; more about this below) we do:

weather_ds$dimsweather_ds$maxdimsweather_ds$chunk_dims
## [1] 26115## [1] Inf## [1] 78

In order to get information on attributes we also have variousfunction available. Which attributes are attached to an object we cansee with

h5attr_names(flights.grp[["wind_dir"]])
## [1] "colnames" "rownames"

and the content of one attribute can be extracted withh5attr, the content of all of them with a list ash5attributes.

h5attr(flights.grp[["wind_dir"]],"colnames")
##  [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"## [16] "15" "16" "17" "18" "19" "20" "21" "22" "23"

2.2.3 Detailedinformation about various objects

In HDF5, there are also various ways of getting more detailedinformation about objects. The most detailed methods for this are

  • get_obj_info: Various information on the number ofattributes, the type of the object, the reference count, access times(if set to be recorded) and other more technical information
  • get_link_info: For links, mainly yields informationon the link type, i.e. hard link or soft link. The difference betweenthem and how to create them will be discussed further below.
  • get_group_info: Information about the storage typeof the group, if a file is mounted to the group and the number of itemsin the group. For the casual user, the most interesting information isthe number of items in the group, which can also be retrieved using thenames function. For very large groups, this way ishowever more efficient.
  • get_file_name: For anH5File orH5Group,H5D orH5T (where D is for datasetand T stands for a committed type) object, returns the name of the fileit is in.
  • get_obj_name: Similar asget_file_name,applies to the same object, but returns the pathinsidethe file to the object
  • file_info: It extracts relatively technicalinformation about a file. It can only be applied to an object of classH5File. This function is usually not of interest to thecasual user

Most of these are somewhat advanced. They key information can usuallyalso be extracted with one of the “higher-level” methods shown above,but sometimes theinfo methods are more efficient.

2.3 Assigning data intodatasets and deleting datasets

Of course we also want to to be able to read out data, change it,extend the dataset and also delete it again. Reading out the data worksjust as it does for regular R arrays and data frames. However,HDF5-tables only have one dimension, not two. It is currently notpossible to selectively read columns - all of them have to be read atthe same time. For arrays, any data point can be read on its withoutrestrictions

weather_ds[1:5]
##   origin year month day hour  temp  dewp humid wind_dir wind_speed wind_gust## 1    EWR 2013     1   1    1 39.02 26.06 59.37      270   10.35702        NA## 2    EWR 2013     1   1    2 39.02 26.96 61.63      250    8.05546        NA## 3    EWR 2013     1   1    3 39.02 28.04 64.43      240   11.50780        NA## 4    EWR 2013     1   1    4 39.92 28.04 62.21      250   12.65858        NA## 5    EWR 2013     1   1    5 39.02 28.04 64.43      260   12.65858        NA##   precip pressure visib  time_hour## 1      0   1012.0    10 1357020000## 2      0   1012.3    10 1357023600## 3      0   1012.5    10 1357027200## 4      0   1012.2    10 1357030800## 5      0   1011.9    10 1357034400
wind_dir_ds<- flights.grp[["wind_dir"]]wind_dir_ds[1:3, ]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]## [1,]    0    1    1    1    1    1    1    1    1     1     1     1     0     1## [2,]    1    1    1    1    1    1    1    1    1     1     1     1     1     1## [3,]    1    1    1    1    1    1    1    1    1     1     1     0     1     1##      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]## [1,]     1     1     1     1     1     1     1     1     1     1## [2,]     1     1     1     1     1     1     1     1     1     1## [3,]     1     1     1     1     1     1     1     1     1     1

Let us replace one row. Currently, vector-recycling is not enabled,so you have to ensure that your replacements have the correct size.Recycling may be enabled in the future.

wind_dir_ds[1, ]<-rep(1,24)wind_dir_ds[1, ]
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

It is also possible to add data outside the dimensions of the datasetas long as they are within themaxdims. The dataset will beexpanded to accommodate the new data. When the expansion of the datasetleads to unassigned points, they are filled with the default fill value.The default fill value can be obtained using

wind_dir_ds$get_fill_value()
## [1] 0
wind_dir_ds[1,25]<-1wind_dir_ds[1:2, ]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]## [1,]    1    1    1    1    1    1    1    1    1     1     1     1     1     1## [2,]    1    1    1    1    1    1    1    1    1     1     1     1     1     1##      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]## [1,]     1     1     1     1     1     1     1     1     1     1     1## [2,]     1     1     1     1     1     1     1     1     1     1     0

Now that we have expanded the dataset to have a 25th column, filledwith 0s except for the first column, it only remains to show how todelete a dataset. However note: Deleting a dataset does not lead to areduction in HDF5 file size, but the internal space can be re-used forother datasets later.

flights.grp$link_delete("wind_dir")flights.grp$ls()
##         name     link.type    obj_type num_attrs group.nlinks group.mounted## 1    flights H5L_TYPE_HARD H5I_DATASET         0           NA            NA## 2    weather H5L_TYPE_HARD H5I_DATASET         0           NA            NA## 3 wind_speed H5L_TYPE_HARD H5I_DATASET         2           NA            NA##   dataset.rank dataset.dims dataset.maxdims dataset.type_class## 1            1       336776             Inf       H5T_COMPOUND## 2            1        26115             Inf       H5T_COMPOUND## 3            2     364 x 24       Inf x Inf        H5T_INTEGER##   dataset.space_class committed_type## 1          H5S_SIMPLE           <NA>## 2          H5S_SIMPLE           <NA>## 3          H5S_SIMPLE           <NA>

2.4 Closing the file

As a last step, we want to close the file. For this, we have 2options, theclose andclose_allmethods of an h5-file. There are some non-obvious differences for noviceusers between the two.close will close the file, butgroups and datatsets that are already open, will stay open. Furthermore,as along as any object is still open, the file cannot be re-opened inthe regular fashion as HDF5 prevents a file from being opened more thanonce.

However, it can be quite cumbersome to close all objects associatedwith a file - that is if we even have still access to them. We may havecreated an object, discarded it, but the garbage collector hasn’t closedit yet.

In order to make this process simpler for the end-user,close_all closes the file as well as all objectsassociated with the file. Any R6-classes pointing to the object willautomatically be invalidated. This way, if it is needed, the file can bere-opened again.

file.h5$close_all()

As a rule - it is recommended to work in the following fashion. Opena file withH5File$new and store the resulting R6-classobject. Do not discard this object. The current default behavior is toclose the file, but not the objects inside the file if the garbagecollector is triggered. This is done in order not to interfere withother open objects later, but as explained can prevent the there-opening of the file later. Therefore, do not discard the R6-classpointing to a file - and close it later again using the **close_all*method in order to ensure that all IDs using the file are being closedas well.


3 Advanced features

HDF5 provides a very wide range of tools. Describing it here wouldcertainly be a task that is too large for this vignette. For a completeoverview on what HDF5 can do, the reader should have a look at theHDF5 website and thedocumentation that is listed there as well as specifically thereferencemanual. Most API-functions that are referenced there are alreadyimplemented (and any other missing functionality that is feasible willhopefully follow soon).

In this section we will will therefore only shine a spotlight on anumber of low-level API functions that can be used in connection withcreating datasets as well as datatypes.

3.1 Creatingdatasets

As we have already seen above, a dataset can be created by simplyassigning an appropriate R object under a given name into a group or afile. The automatic algorithm then uses the size of the assigned objectto determine the size of the HDF5 dataset, it makes assumptions about“chunking” that have an influence on the storage efficiency as well asthe maximum possible size of the dataset.

However, we have much more control if we specify these things “byhand”. In the following example, we will create a dataset consisting of2 bit unsigned integers (i.e. capable of storing values from 0 to 3). Wewill set the size of the dataset as well as the space and the chunk-sizeourselves. As a first step, lets create the custom datatype

uint2_dt<- h5types$H5T_NATIVE_UINT32$set_size(1)$set_precision(2)$set_sign(h5const$H5T_SGN_NONE)

Here we use a built-in constant and datatype. All constants can beaccessed usingh5const\(<const_name>** and all built-in types areaccesses with **h5types\). An overview of allexisting constants can be retrieved withh5const\(overview** and all existing types are shown by**h5types\)overview.

Next we define the space that we will use for the dataset, where wewant 10 columns and 10 rows. The number of columns will always be fixed,but the number of rows should be able to increase to infinity.

space_ds<- H5S$new(dims =c(10,10),maxdims =c(Inf,10))

Next, we have to define with which properties the dataset should becreated. We will set a default fill value of 1, enable n-bit filteringbut no compression and set the chunk size to (10, 10).

ds_create_pl_nbit<- H5P_DATASET_CREATE$new()ds_create_pl_nbit$set_chunk(c(10,10))$set_fill_value(uint2_dt,1)$set_nbit()

Now lets put all this together and create a dataset.

uint2.grp<- file.h5$create_group("uint2")uint2_ds_nbit<- uint2.grp$create_dataset(name ="nbit_filter",space = space_ds,dtype = uint2_dt,dataset_create_pl = ds_create_pl_nbit,chunk_dim =NULL,gzip_level =NULL)uint2_ds_nbit[, ]<-sample(0:3,size =100,replace =TRUE)uint2_ds_nbit$get_storage_size()
## [1] 26

And not lets compare what happens if we don’t have any filter, onlycompression and nbit as well as compression

ds_create_pl_nbit_deflate<- ds_create_pl_nbit$copy()$set_deflate(9)ds_create_pl_deflate<- ds_create_pl_nbit$copy()$remove_filter()$set_deflate(9)ds_create_pl_none<- ds_create_pl_nbit$copy()$remove_filter()uint2_ds_nbit_deflate<- uint2.grp$create_dataset(name ="nbit_deflate_filter",space = space_ds,dtype = uint2_dt,dataset_create_pl = ds_create_pl_nbit_deflate,chunk_dim =NULL,gzip_level =NULL)uint2_ds_nbit_deflate[, ]<- uint2_ds_nbit[, ]uint2_ds_deflate<- uint2.grp$create_dataset(name ="deflate_filter",space = space_ds,dtype = uint2_dt,dataset_create_pl = ds_create_pl_deflate,chunk_dim =NULL,gzip_level =NULL)uint2_ds_deflate[, ]<- uint2_ds_nbit[, ]uint2_ds_none<- uint2.grp$create_dataset(name ="none_filter",space = space_ds,dtype = uint2_dt,dataset_create_pl = ds_create_pl_none,chunk_dim =NULL,gzip_level =NULL)uint2_ds_none[, ]<- uint2_ds_nbit[, ]

With the sizes of the datasets

uint2_ds_nbit_deflate$get_storage_size()uint2_ds_nbit$get_storage_size()uint2_ds_deflate$get_storage_size()uint2_ds_none$get_storage_size()
## [1] 35## [1] 26## [1] 55## [1] 100

and we see that in the case of random data, not surprisingly, thenbit filter alone is the most efficient. Using compression on thenbit-filter actually increases the storage size. However, despite therandom data, compression can still save some space compared to rawstorage as in raw storage mode, a whole byte is stored and not just 2bit.

3.2 Interacting withdatatypes

3.2.1 Integer, Float

For integer-datatypes we have already seen that we have control overessentially everything, i.e. signed/unsigned as well as precision downto the exact number of bits. For floats we have similar control, beingable to customize the size of the mantissa as well as the exponent(although in practice this is likely less relevant than being able tocustomize integer types). To learn more about this functionality forfloats, we recommend to read the relevant section of the manual.

3.2.2 Strings

HDF5 itself provides access to both C-type strings and FORTRAN typestrings. As R internally uses C-strings, only C-type strings aresupported (i.e. strings that are NULL delimited). In terms of the sizeof the strings, there are fixed and variable length stringsavailable.

str_fixed_len<- H5T_STRING$new(size =20)str_var_length<- H5T_STRING$new(size =Inf)

These two types of strings have implications for efficiency andusability. For obvious reasons, variable length strings are moreconvenient as they are never too small hold a piece of information.However, internally in HDF5, these aren’t stored in the dataset itself -only a pointer to the HDF5-internal heap is stored. This has 2implications:

  • Retrieving the string is somewhat slower
  • As the heap is not compressed, compression of datasets does notyield much space saving for variable length data

From this perspective, fixed length strings are considerably betteras they are both faster (if not too long) and compressible. However, theuser has to be careful that their strings aren’t getting too long, orthey will be truncated.

3.2.3 Enum

The equivalent to factors in R areENUM datatypes.These are stored internally as integers, but each integer has a stringlabel attached to it. In contrast to R-factor variables, the integervalues do not have to start at 1 and do not have to to consecutiveeither. In order to support this more flexible datatype also optimallyon the R side, hdf5r comes with thefactor_extendedclass. In the HDF5 API - each enum level is inserted one at a time. Asthis is rather inconvenient for a vector-oriented language like R, thisfunctionality has not been exposed. We instead provide an R6-classconstructor that lets us set all labels and values in one go.

enum_example<- H5T_ENUM$new(c("Label 1","Label 2","Label 3"),values =c(-3,5,10))

For efficiency reasons, an integer datatype is automaticallygenerated that provides exactly the needed precision in order to storethe values of the enum. Given an enum, variable, we can also find outwhat labels and values it has

enum_example$get_labels()enum_example$get_values()
## [1] "Label 1" "Label 2" "Label 3"## [1] -3  5 10

In addition, we can also get the datatype back that the enum is basedon

enum_example$get_super()
## Class: H5T_INTEGER## Datatype: undefined integer

3.2.3.1 Logicalvalues

A logical variable is a special case of an enum. It is internallybased on a 1-byte unsigned integer that has a precision of 1-bit (so ann-bit filter will only store a single bit). Its internal values are 0and 1 with labelsFALSE andTRUE respectively. As aclass, it is represented as an H5T_ENUM

logical_example<- H5T_LOGICAL$new(include_NA =TRUE)## we could also use h5types$H5T_LOGICAL or h5types$H5T_LOGICAL_NAlogical_example$get_labels()logical_example$get_values()
## [1] "FALSE" "TRUE"  "NA"   ## [1] 0 1 2

Note that doLogical has precedence over thelabelsparameter.

3.2.4 Compounds(Tables)

Tables are represented asCOMPOUND HDF5 objects, which arethe equivalent of C-struct. As R does not know this datatype natively,it has to be converted from structs to the list-based construct of Rdata-frames. Similar as with ENUMs, we don’t expose the underlying C-APIthat builds the compound on element at a time but instead provideconstructors that create it in one go.

cpd_example<- H5T_COMPOUND$new(c("Double_col","Int_col","Logical_col"),dtypes =list(h5types$H5T_NATIVE_DOUBLE,    h5types$H5T_NATIVE_INT, logical_example))

and similar to enums, we can also get back the column names, theclasses of the datatypes as well as identifiers for the datatypesitself.

cpd_example$get_cpd_labels()
## [1] "Double_col"  "Int_col"     "Logical_col"
cpd_example$get_cpd_classes()
## [1] H5T_FLOAT   H5T_INTEGER H5T_ENUM   ## 13 Levels: H5T_NO_CLASS H5T_INTEGER H5T_FLOAT H5T_TIME ... H5T_NCLASSES## 13 Values: -1 0 1 2 ... 11
cpd_example$get_cpd_types()
## [[1]]## Class: H5T_FLOAT## Datatype: H5T_IEEE_F64LE## ## [[2]]## Class: H5T_INTEGER## Datatype: H5T_STD_I32LE## ## [[3]]## Class: H5T_LOGICAL## Datatype: H5T_ENUM {##       undefined integer;##       "FALSE"            0;##       "TRUE"             1;##       "NA"               2;##    }

A textual description is also available

cat(cpd_example$to_text())
## H5T_COMPOUND {##       H5T_IEEE_F64LE "Double_col" : 0;##       H5T_STD_I32LE "Int_col" : 8;##       H5T_ENUM {##          undefined integer;##          "FALSE"            0;##          "TRUE"             1;##          "NA"               2;##       } "Logical_col" : 12;##    }

3.2.4.1 Complexvalues

We also have a way of representing complex variables, these are acompound object consisting of two double precision floating pointcolumns. This also matches nicely the fact that internally in R, complexvalues are represented as a struct of doubles.

cplx_example<- H5T_COMPLEX$new()cplx_example$get_cpd_labels()cplx_example$get_cpd_classes()
## [1] "Real"      "Imaginary"## [1] H5T_FLOAT H5T_FLOAT## 13 Levels: H5T_NO_CLASS H5T_INTEGER H5T_FLOAT H5T_TIME ... H5T_NCLASSES## 13 Values: -1 0 1 2 ... 11

3.2.5 Arrays

A special datatype is theH5T_ARRAY. As datasets areitself arrays, they are not needed to represent arrays itself. Rather,the are useful in cases where one datatype is wrapped inside another, somainly if a column of a compound object is supposed to be an array. Solets create an array and put it into a compound object together withsome other columns

array_example<- H5T_ARRAY$new(dims =c(3,4),dtype_base = h5types$H5T_NATIVE_INT)cpd_several<- H5T_COMPOUND$new(c("STRING_fixed","Double","Complex","Array"),dtypes =list(str_fixed_len, h5types$H5T_NATIVE_DOUBLE, cplx_example, array_example))cat(cpd_several$to_text())
## H5T_COMPOUND {##       H5T_STRING {##          STRSIZE 20;##          STRPAD H5T_STR_NULLTERM;##          CSET H5T_CSET_ASCII;##          CTYPE H5T_C_S1;##       } "STRING_fixed" : 0;##       H5T_IEEE_F64LE "Double" : 20;##       H5T_COMPOUND {##             H5T_IEEE_F64LE "Real" : 0;##             H5T_IEEE_F64LE "Imaginary" : 8;##          } "Complex" : 28;##       H5T_ARRAY {##          [4][3] H5T_STD_I32LE##       } "Array" : 44;##    }

And to see what this would look like as an R object

obj_empty<-create_empty(1, cpd_several)obj_empty
## Warning in format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :## corrupt data frame: columns will be truncated or padded with NAs
obj_empty$Array
##   STRING_fixed Double Complex Array## 1                   0    0+0i     0##  [1] 0 0 0 0 0 0 0 0 0 0 0 0

3.2.6 Variable lengthdata types

And last, there are also variable length datatypes - corresponding toa list in R where each item of the list has the same datatype (general Rlist, where each item can have a different type cannot be represented inHDF5).

vlen_example<- H5T_VLEN$new(dtype_base = cpd_several)

This would represent a list where each item is a table with anarbitrary number of rows.


4 Implementationdetails

In this section some of the details will be discussed that are likelyonly interesting for the technically inclined or someone who would wantto extend the package itself.

4.1 Closing of unused idsand garbage collection

In this package, the C-API of HDF5 is being used. For the C-API, itis usually the programmer’s responsibility to close manually an HDF5-IDthat is being used by calling the appropriate “close” function. Ifprograms are not written very diligently, this can easily lead tomemory-leaks.

As users of R are used to objects being automaticallygarbage-collected, such a behavior could pose a significant problem inR. In order to avoid any issues, the closing of HDF5-IDs is thereforedone automatically using the R garbage collection mechanism.

For every id that is created in the C-code and passed back to R, anR6-class object is created that is non-cloneable. During creating, thefinalizer (see reg.finalizer) is set so that during garbage collectionof the R6-class object or when shutting down R, the corresponding HDF5resources are being released.

In addition to this, all HDF5-IDs that are currently in use are beingtracked as well (in the obj_tracker environment; not exported). Thereason for this separate tracking is so that on demand, all objects thatare currently still open in a file can be closed. The special challengehere is on the one-hand to track every R6 object that is in use in R,and at the same time not interfere with the normal operation of the Rgarbage collection mechanism. To this end, we cannot just save theenvironment itself in the obj_tracker (note that in R, anenvironment-object is always a pointer to the environment, not the wholeenvironment itself). If we stored a pointer to the environment itself,the R garbage collector would never delete the environment as formallyit would still be in use (in the obj_tracker). In order to prevent that,the following mechanism was implemented:

As mentioned, this was mainly implemented to allow for the closing ofall IDs that are still open inside a file and to invalidate all existingR6-classes as well.

4.1.1 Opening and closingof files

In this context, let us quickly also discuss the special way HDF5handles files. In HDF5, in principle a file can always only be openedonce. This can lead to problems as users in R are used to being able toopen files as often as they like. Furthermore, it is possible in HDF5 toclose the ID of a file without closing all objects in the file. Then,however, the file actually stays open until the last ID pointing intothe file is closed and it cannot be opened again without it.

Therefore, as already explained above (and as recommended by the HDF5manual), do not discard or close files that still have open objects inthem. It is preferable to keep the HDF5-file-id pointer around and closeit when it is no longer needed (and all objects inside the file) usingtheclose_all method.

4.2 Conversion ofdatatypes

A special feature of this package is the far-reaching and flexibleimplementation of data-conversion routines between R and HDF5. Routineshave been implemented for all datatypes, string, data-frames, arrays andvariable length (HDF5-VLEN) objects. Some are relativelystraightforward, others are more complicated. Here, numeric datatypescan be tricky due to the limited ability of R to represent certaindatatypes, specifically long doubles or 64bit-integers.

4.2.1 Numericdatatypes

For numeric datatypes, the situation is in certain circumstances abit tricky. In general, R numerical objects are either represented as64-bit floating point values (doubles) or 32-but integers. R switchesrelatively transparently between these types as needed (forcomputations, integers are converted to doubles and conversely, arraypositions can be addressed by doubles). The main issue when working withHDF5 occurs as R doesn’t have either a 64bit signed or unsigned integerdatatype (and also not a long double). In order to work around thisissue, the following conventions are being used

  • The package uses thebit64 package to providesupport for 64-bit integers. These are used extensively (e.g. for ids)and also for numeric integer data types.
  • 32 bit and 64 bit floats from HDF5 are always returned as 64 bitfloats in R. Writing 32 bit floats from R, may always incur loss ofprecision of the underlying 64 bit double that is used to represent itin R.
  • For integer data types, any HDF5 integer type that can accurately berepresented as a 32-bit signed integer will be returned to R as aregular integer (can be changed using flags). Any HDF5 64-bit integercan be returned as a signed 64-bit integer - with the option ofreturning it as a 32 bit integer or double if it can be done withoutloss of precision. For unsigned 64-bit integers, they will be returnedas floats, incurring loss of precision but avoiding truncation.

An overview of how the data conversion is being done can be seenhere:

Schematic of dataype conversion
Schematic of dataype conversion

The underlying principle is that any internal conversion between Rtypes is done by R (with the resulting handling of NA’s and overflows),whereas any conversion between R-types and Non-R-types is done by theHDF5 library (usually meaning that on overflow, truncation occurs).

4.2.2 Strings

In HDF5, strings can either be variable length or fixed lengthstrings. In R, they are always variable length. Therefore, strings fromR to HDF5 that are written into fixed-length fields will be truncated.Conversely, strings from HDF5 that are fixed length to R will only bereturned up the the NULL character that ends strings in C.

4.2.3Data-frames/Compounds

The situation is a bit more tricky for table-like objects. In R,these are data-frames, which internally are a list of vectors. In HDF5,a table is a Compound object, that is equivalent to C-struct -i.e. every row is represented together whereas in R every column isrepresented together. Each of these approaches has certain advantages,but the challenge here is to translate between them.

This is done in the straightforward manner. When converting from R toHDF5, the columns of the tables are copied into the struct whereas inthe reverse direction, every struct is decomposed into the correspondingcolumns.

The Data-frame <-> Compound conversion is also extensively usedfor HDF5-API functions that return structs as result (and thereforereturn data-frames).

4.2.4 Arraysdata-types

In HDF5, datasets itself can have arbitrary dimensions. In additionto that, there are also array-datatypes that allow for the inclusion forarrays e.g. inside a compound object. Translation to and from arrays isrelatively straightforward and only involves setting the correctdim attribute in R.

In addition to that, however, there is small complication. In R, thefirst dimension is the fastest changing dimension. In HDF5 (same as inC), the last dimension is however the fastest changing one. Fordatasets, we work around this problem by always reversing the dimensionsthat are passed between R and HDF5 and therefore making the distinctiontransparent. For arrays, this is however a bit trickier. For example letus assume that we have a dataset that is a one-dimensional vector oflength 10, each element of which is an array-datatype of length 4,resulting in a 10 x 4 dataset. However, it is now not quite clear howthis should be represented in R. If we follow the notion, that thefastest changing dimension in R is the first one, the result would be adataset with 4 rows and 10 columns, i.e. 10 x 4.

This does feel rather unintuitive, forcing a user to specify thesecond dimension to get all items of the array. Therefore, we haveimplemented it so that a 10 x 4 dataset is returned, with each rowcorresponding to the array-datatype. In order to achieve this we have todeviate from the ordering principle in HDF5. Where in HDF5, the elementsof the first internal array are in position 1, 2, 3 and 4 (or 0 to 3when you start counting at 0), in R they are now in position 1, 11, 21,and 31. In order to do this, we first internally read the HDF5 arrayinto an R-array of shape 4 x 10 and then transpose the result.

4.2.5 Variable-lengthdata types

In HDF5, there are also variable-length data types. Essentially, thiscorresponds to an R list-like object, with the additional restrictionthat every item of the list has to be of the same datatype. This is alsohow it is implemented. R list where all items are vectors (of arbitrarylength) of the same type can be converted to HDF5-VLEN objects and viceversa.

4.2.6 Referenceobjects

As of the writing of this vignette, these have not yet beenimplemented.


5 Future directions


[8]ページ先頭

©2009-2025 Movatter.jp