Abstract
Overview on how to use the simple as well as advanced facilities ofHDF5 using thehdf5r package
HDF5 is a data model, library, and file format for storing andmanaging data. It supports an unlimited variety of datatypes, and isdesigned for flexible and efficient I/O and for high volume and complexdata.
As R is very often used to process large amounts of data, having adirect interface to HDF5 is very useful. As of the writing of thisvignette, there are 2 other packages available that also implement aninterface to HDF5,h5 on CRAN andrhdf5 on Bioconductor. These are also goodimplementations, but there are several points that make this packagehere –hdf5r – stand out:
In the following sections of this vignette, first a simple examplewill be given that shows how standard operations are being performed.Next, more advanced features will be discussed such as the creation ofcomplex datatypes, datasets with special datatypes, the setting of thevarious available filters when reading/writing a package etc. We willend with a technical overview on the underlying implementation.
As an introduction on how to use it, let us set up a very simpleusage example. We will create a file, some groups in it as well asdatasets of different sizes. We will read and write data, deletedatasets again, get information on various objects.
But first things first. We create a random filename in a temporarydirectory and create a file with read/write access, deleting it if italready exists (it won’t - tempfile gives us a name of a file thatdoesn’t exist yet).
library(hdf5r)test_filename<-tempfile(fileext =".h5")file.h5<- H5File$new(test_filename,mode ="w")file.h5## Class: H5File## Filename: /tmp/RtmpmUJ35B/file33f4917e9f5a7.h5## Access type: H5F_ACC_RDWRNow that we have this, we will create 2 groups, one for themtcars dataset and one for thenycflights13 dataset.
Into these groups, we will now write the datasets
library(datasets)library(nycflights13)library(reshape2)mtcars.grp[["mtcars"]]<- datasets::mtcarsflights.grp[["weather"]]<- nycflights13::weatherflights.grp[["flights"]]<- nycflights13::flightsOut of the weather data, we extract the information on thewind-direction and wind-speed and will save it as a matrix with thehours in the columns and the days in the rows (only for weather stationEWR, the others are not complete).
weather_wind_dir<-subset(nycflights13::weather, origin=="EWR",select =c("year","month","day","hour","wind_dir"))weather_wind_dir<-na.exclude(weather_wind_dir)weather_wind_dir$wind_dir<-as.integer(weather_wind_dir$wind_dir)weather_wind_dir<-acast(weather_wind_dir, year+ month+ day~ hour,value.var ="wind_dir")## Aggregation function missing: defaulting to lengthand
weather_wind_speed<-subset(nycflights13::weather, origin=="EWR",select =c("year","month","day","hour","wind_speed"))weather_wind_speed<-na.exclude(weather_wind_speed)weather_wind_speed<-acast(weather_wind_speed, year+ month+ day~ hour,value.var ="wind_speed")## Aggregation function missing: defaulting to lengthFor completeness, we also attach the row and column names asattributes:
With respect to groups and files, we also want to have a simple wayto extract the contents. With thenames function, wecan get all names of objects in a group or in the root directory of afile
## [1] "flights" "mtcars"## [1] "flights" "weather" "wind_dir" "wind_speed"Another option that gives more information isls, amethod of the classesH5File andH5Group
## name link.type obj_type num_attrs group.nlinks group.mounted## 1 flights H5L_TYPE_HARD H5I_DATASET 0 NA NA## 2 weather H5L_TYPE_HARD H5I_DATASET 0 NA NA## 3 wind_dir H5L_TYPE_HARD H5I_DATASET 2 NA NA## 4 wind_speed H5L_TYPE_HARD H5I_DATASET 2 NA NA## dataset.rank dataset.dims dataset.maxdims dataset.type_class## 1 1 336776 Inf H5T_COMPOUND## 2 1 26115 Inf H5T_COMPOUND## 3 2 364 x 24 Inf x Inf H5T_INTEGER## 4 2 364 x 24 Inf x Inf H5T_INTEGER## dataset.space_class committed_type## 1 H5S_SIMPLE <NA>## 2 H5S_SIMPLE <NA>## 3 H5S_SIMPLE <NA>## 4 H5S_SIMPLE <NA>If you have an HDF5-File, it is of course important to look upvarious information not only about groups, but also about theinformation contained in it. First, we want to get more informationabout the dataset.ls on the group already gives a lotof information about the datatype, the size, the maximum size etc.However there are also other, more direct, ways to get the sameinformation. In order to investigate the datatype we can
weather_ds<- flights.grp[["weather"]]weather_ds_type<- weather_ds$get_type()weather_ds_type$get_class()## [1] H5T_COMPOUND## 13 Levels: H5T_NO_CLASS H5T_INTEGER H5T_FLOAT H5T_TIME ... H5T_NCLASSES## 13 Values: -1 0 1 2 ... 11## H5T_COMPOUND {## H5T_STRING {## STRSIZE H5T_VARIABLE;## STRPAD H5T_STR_NULLTERM;## CSET H5T_CSET_ASCII;## CTYPE H5T_C_S1;## } "origin" : 0;## H5T_STD_I32LE "year" : 8;## H5T_STD_I32LE "month" : 12;## H5T_STD_I32LE "day" : 16;## H5T_STD_I32LE "hour" : 20;## H5T_IEEE_F64LE "temp" : 24;## H5T_IEEE_F64LE "dewp" : 32;## H5T_IEEE_F64LE "humid" : 40;## H5T_IEEE_F64LE "wind_dir" : 48;## H5T_IEEE_F64LE "wind_speed" : 56;## H5T_IEEE_F64LE "wind_gust" : 64;## H5T_IEEE_F64LE "precip" : 72;## H5T_IEEE_F64LE "pressure" : 80;## H5T_IEEE_F64LE "visib" : 88;## H5T_IEEE_F64LE "time_hour" : 96;## }telling us that our dataset consists of aH5T_COMPOUNDdatatype and prints more detailed information on its content of everycolumn. Regarding the size of the dataset and the size of the chunks(datasets are by default chunked; more about this below) we do:
## [1] 26115## [1] Inf## [1] 78In order to get information on attributes we also have variousfunction available. Which attributes are attached to an object we cansee with
## [1] "colnames" "rownames"and the content of one attribute can be extracted withh5attr, the content of all of them with a list ash5attributes.
## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"## [16] "15" "16" "17" "18" "19" "20" "21" "22" "23"In HDF5, there are also various ways of getting more detailedinformation about objects. The most detailed methods for this are
Most of these are somewhat advanced. They key information can usuallyalso be extracted with one of the “higher-level” methods shown above,but sometimes theinfo methods are more efficient.
Of course we also want to to be able to read out data, change it,extend the dataset and also delete it again. Reading out the data worksjust as it does for regular R arrays and data frames. However,HDF5-tables only have one dimension, not two. It is currently notpossible to selectively read columns - all of them have to be read atthe same time. For arrays, any data point can be read on its withoutrestrictions
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust## 1 EWR 2013 1 1 1 39.02 26.06 59.37 270 10.35702 NA## 2 EWR 2013 1 1 2 39.02 26.96 61.63 250 8.05546 NA## 3 EWR 2013 1 1 3 39.02 28.04 64.43 240 11.50780 NA## 4 EWR 2013 1 1 4 39.92 28.04 62.21 250 12.65858 NA## 5 EWR 2013 1 1 5 39.02 28.04 64.43 260 12.65858 NA## precip pressure visib time_hour## 1 0 1012.0 10 1357020000## 2 0 1012.3 10 1357023600## 3 0 1012.5 10 1357027200## 4 0 1012.2 10 1357030800## 5 0 1011.9 10 1357034400## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]## [1,] 0 1 1 1 1 1 1 1 1 1 1 1 0 1## [2,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1## [3,] 1 1 1 1 1 1 1 1 1 1 1 0 1 1## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]## [1,] 1 1 1 1 1 1 1 1 1 1## [2,] 1 1 1 1 1 1 1 1 1 1## [3,] 1 1 1 1 1 1 1 1 1 1Let us replace one row. Currently, vector-recycling is not enabled,so you have to ensure that your replacements have the correct size.Recycling may be enabled in the future.
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1It is also possible to add data outside the dimensions of the datasetas long as they are within themaxdims. The dataset will beexpanded to accommodate the new data. When the expansion of the datasetleads to unassigned points, they are filled with the default fill value.The default fill value can be obtained using
## [1] 0## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]## [1,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1## [2,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]## [1,] 1 1 1 1 1 1 1 1 1 1 1## [2,] 1 1 1 1 1 1 1 1 1 1 0Now that we have expanded the dataset to have a 25th column, filledwith 0s except for the first column, it only remains to show how todelete a dataset. However note: Deleting a dataset does not lead to areduction in HDF5 file size, but the internal space can be re-used forother datasets later.
## name link.type obj_type num_attrs group.nlinks group.mounted## 1 flights H5L_TYPE_HARD H5I_DATASET 0 NA NA## 2 weather H5L_TYPE_HARD H5I_DATASET 0 NA NA## 3 wind_speed H5L_TYPE_HARD H5I_DATASET 2 NA NA## dataset.rank dataset.dims dataset.maxdims dataset.type_class## 1 1 336776 Inf H5T_COMPOUND## 2 1 26115 Inf H5T_COMPOUND## 3 2 364 x 24 Inf x Inf H5T_INTEGER## dataset.space_class committed_type## 1 H5S_SIMPLE <NA>## 2 H5S_SIMPLE <NA>## 3 H5S_SIMPLE <NA>As a last step, we want to close the file. For this, we have 2options, theclose andclose_allmethods of an h5-file. There are some non-obvious differences for noviceusers between the two.close will close the file, butgroups and datatsets that are already open, will stay open. Furthermore,as along as any object is still open, the file cannot be re-opened inthe regular fashion as HDF5 prevents a file from being opened more thanonce.
However, it can be quite cumbersome to close all objects associatedwith a file - that is if we even have still access to them. We may havecreated an object, discarded it, but the garbage collector hasn’t closedit yet.
In order to make this process simpler for the end-user,close_all closes the file as well as all objectsassociated with the file. Any R6-classes pointing to the object willautomatically be invalidated. This way, if it is needed, the file can bere-opened again.
As a rule - it is recommended to work in the following fashion. Opena file withH5File$new and store the resulting R6-classobject. Do not discard this object. The current default behavior is toclose the file, but not the objects inside the file if the garbagecollector is triggered. This is done in order not to interfere withother open objects later, but as explained can prevent the there-opening of the file later. Therefore, do not discard the R6-classpointing to a file - and close it later again using the **close_all*method in order to ensure that all IDs using the file are being closedas well.
HDF5 provides a very wide range of tools. Describing it here wouldcertainly be a task that is too large for this vignette. For a completeoverview on what HDF5 can do, the reader should have a look at theHDF5 website and thedocumentation that is listed there as well as specifically thereferencemanual. Most API-functions that are referenced there are alreadyimplemented (and any other missing functionality that is feasible willhopefully follow soon).
In this section we will will therefore only shine a spotlight on anumber of low-level API functions that can be used in connection withcreating datasets as well as datatypes.
As we have already seen above, a dataset can be created by simplyassigning an appropriate R object under a given name into a group or afile. The automatic algorithm then uses the size of the assigned objectto determine the size of the HDF5 dataset, it makes assumptions about“chunking” that have an influence on the storage efficiency as well asthe maximum possible size of the dataset.
However, we have much more control if we specify these things “byhand”. In the following example, we will create a dataset consisting of2 bit unsigned integers (i.e. capable of storing values from 0 to 3). Wewill set the size of the dataset as well as the space and the chunk-sizeourselves. As a first step, lets create the custom datatype
Here we use a built-in constant and datatype. All constants can beaccessed usingh5const\(<const_name>** and all built-in types areaccesses with **h5types\)
Next we define the space that we will use for the dataset, where wewant 10 columns and 10 rows. The number of columns will always be fixed,but the number of rows should be able to increase to infinity.
Next, we have to define with which properties the dataset should becreated. We will set a default fill value of 1, enable n-bit filteringbut no compression and set the chunk size to (10, 10).
ds_create_pl_nbit<- H5P_DATASET_CREATE$new()ds_create_pl_nbit$set_chunk(c(10,10))$set_fill_value(uint2_dt,1)$set_nbit()Now lets put all this together and create a dataset.
uint2.grp<- file.h5$create_group("uint2")uint2_ds_nbit<- uint2.grp$create_dataset(name ="nbit_filter",space = space_ds,dtype = uint2_dt,dataset_create_pl = ds_create_pl_nbit,chunk_dim =NULL,gzip_level =NULL)uint2_ds_nbit[, ]<-sample(0:3,size =100,replace =TRUE)uint2_ds_nbit$get_storage_size()## [1] 26And not lets compare what happens if we don’t have any filter, onlycompression and nbit as well as compression
ds_create_pl_nbit_deflate<- ds_create_pl_nbit$copy()$set_deflate(9)ds_create_pl_deflate<- ds_create_pl_nbit$copy()$remove_filter()$set_deflate(9)ds_create_pl_none<- ds_create_pl_nbit$copy()$remove_filter()uint2_ds_nbit_deflate<- uint2.grp$create_dataset(name ="nbit_deflate_filter",space = space_ds,dtype = uint2_dt,dataset_create_pl = ds_create_pl_nbit_deflate,chunk_dim =NULL,gzip_level =NULL)uint2_ds_nbit_deflate[, ]<- uint2_ds_nbit[, ]uint2_ds_deflate<- uint2.grp$create_dataset(name ="deflate_filter",space = space_ds,dtype = uint2_dt,dataset_create_pl = ds_create_pl_deflate,chunk_dim =NULL,gzip_level =NULL)uint2_ds_deflate[, ]<- uint2_ds_nbit[, ]uint2_ds_none<- uint2.grp$create_dataset(name ="none_filter",space = space_ds,dtype = uint2_dt,dataset_create_pl = ds_create_pl_none,chunk_dim =NULL,gzip_level =NULL)uint2_ds_none[, ]<- uint2_ds_nbit[, ]With the sizes of the datasets
uint2_ds_nbit_deflate$get_storage_size()uint2_ds_nbit$get_storage_size()uint2_ds_deflate$get_storage_size()uint2_ds_none$get_storage_size()## [1] 35## [1] 26## [1] 55## [1] 100and we see that in the case of random data, not surprisingly, thenbit filter alone is the most efficient. Using compression on thenbit-filter actually increases the storage size. However, despite therandom data, compression can still save some space compared to rawstorage as in raw storage mode, a whole byte is stored and not just 2bit.
For integer-datatypes we have already seen that we have control overessentially everything, i.e. signed/unsigned as well as precision downto the exact number of bits. For floats we have similar control, beingable to customize the size of the mantissa as well as the exponent(although in practice this is likely less relevant than being able tocustomize integer types). To learn more about this functionality forfloats, we recommend to read the relevant section of the manual.
HDF5 itself provides access to both C-type strings and FORTRAN typestrings. As R internally uses C-strings, only C-type strings aresupported (i.e. strings that are NULL delimited). In terms of the sizeof the strings, there are fixed and variable length stringsavailable.
These two types of strings have implications for efficiency andusability. For obvious reasons, variable length strings are moreconvenient as they are never too small hold a piece of information.However, internally in HDF5, these aren’t stored in the dataset itself -only a pointer to the HDF5-internal heap is stored. This has 2implications:
From this perspective, fixed length strings are considerably betteras they are both faster (if not too long) and compressible. However, theuser has to be careful that their strings aren’t getting too long, orthey will be truncated.
The equivalent to factors in R areENUM datatypes.These are stored internally as integers, but each integer has a stringlabel attached to it. In contrast to R-factor variables, the integervalues do not have to start at 1 and do not have to to consecutiveeither. In order to support this more flexible datatype also optimallyon the R side, hdf5r comes with thefactor_extendedclass. In the HDF5 API - each enum level is inserted one at a time. Asthis is rather inconvenient for a vector-oriented language like R, thisfunctionality has not been exposed. We instead provide an R6-classconstructor that lets us set all labels and values in one go.
For efficiency reasons, an integer datatype is automaticallygenerated that provides exactly the needed precision in order to storethe values of the enum. Given an enum, variable, we can also find outwhat labels and values it has
## [1] "Label 1" "Label 2" "Label 3"## [1] -3 5 10In addition, we can also get the datatype back that the enum is basedon
## Class: H5T_INTEGER## Datatype: undefined integerA logical variable is a special case of an enum. It is internallybased on a 1-byte unsigned integer that has a precision of 1-bit (so ann-bit filter will only store a single bit). Its internal values are 0and 1 with labelsFALSE andTRUE respectively. As aclass, it is represented as an H5T_ENUM
logical_example<- H5T_LOGICAL$new(include_NA =TRUE)## we could also use h5types$H5T_LOGICAL or h5types$H5T_LOGICAL_NAlogical_example$get_labels()logical_example$get_values()## [1] "FALSE" "TRUE" "NA" ## [1] 0 1 2Note that doLogical has precedence over thelabelsparameter.
Tables are represented asCOMPOUND HDF5 objects, which arethe equivalent of C-struct. As R does not know this datatype natively,it has to be converted from structs to the list-based construct of Rdata-frames. Similar as with ENUMs, we don’t expose the underlying C-APIthat builds the compound on element at a time but instead provideconstructors that create it in one go.
cpd_example<- H5T_COMPOUND$new(c("Double_col","Int_col","Logical_col"),dtypes =list(h5types$H5T_NATIVE_DOUBLE, h5types$H5T_NATIVE_INT, logical_example))and similar to enums, we can also get back the column names, theclasses of the datatypes as well as identifiers for the datatypesitself.
## [1] "Double_col" "Int_col" "Logical_col"## [1] H5T_FLOAT H5T_INTEGER H5T_ENUM ## 13 Levels: H5T_NO_CLASS H5T_INTEGER H5T_FLOAT H5T_TIME ... H5T_NCLASSES## 13 Values: -1 0 1 2 ... 11## [[1]]## Class: H5T_FLOAT## Datatype: H5T_IEEE_F64LE## ## [[2]]## Class: H5T_INTEGER## Datatype: H5T_STD_I32LE## ## [[3]]## Class: H5T_LOGICAL## Datatype: H5T_ENUM {## undefined integer;## "FALSE" 0;## "TRUE" 1;## "NA" 2;## }A textual description is also available
## H5T_COMPOUND {## H5T_IEEE_F64LE "Double_col" : 0;## H5T_STD_I32LE "Int_col" : 8;## H5T_ENUM {## undefined integer;## "FALSE" 0;## "TRUE" 1;## "NA" 2;## } "Logical_col" : 12;## }We also have a way of representing complex variables, these are acompound object consisting of two double precision floating pointcolumns. This also matches nicely the fact that internally in R, complexvalues are represented as a struct of doubles.
## [1] "Real" "Imaginary"## [1] H5T_FLOAT H5T_FLOAT## 13 Levels: H5T_NO_CLASS H5T_INTEGER H5T_FLOAT H5T_TIME ... H5T_NCLASSES## 13 Values: -1 0 1 2 ... 11A special datatype is theH5T_ARRAY. As datasets areitself arrays, they are not needed to represent arrays itself. Rather,the are useful in cases where one datatype is wrapped inside another, somainly if a column of a compound object is supposed to be an array. Solets create an array and put it into a compound object together withsome other columns
array_example<- H5T_ARRAY$new(dims =c(3,4),dtype_base = h5types$H5T_NATIVE_INT)cpd_several<- H5T_COMPOUND$new(c("STRING_fixed","Double","Complex","Array"),dtypes =list(str_fixed_len, h5types$H5T_NATIVE_DOUBLE, cplx_example, array_example))cat(cpd_several$to_text())## H5T_COMPOUND {## H5T_STRING {## STRSIZE 20;## STRPAD H5T_STR_NULLTERM;## CSET H5T_CSET_ASCII;## CTYPE H5T_C_S1;## } "STRING_fixed" : 0;## H5T_IEEE_F64LE "Double" : 20;## H5T_COMPOUND {## H5T_IEEE_F64LE "Real" : 0;## H5T_IEEE_F64LE "Imaginary" : 8;## } "Complex" : 28;## H5T_ARRAY {## [4][3] H5T_STD_I32LE## } "Array" : 44;## }And to see what this would look like as an R object
## Warning in format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :## corrupt data frame: columns will be truncated or padded with NAs## STRING_fixed Double Complex Array## 1 0 0+0i 0## [1] 0 0 0 0 0 0 0 0 0 0 0 0And last, there are also variable length datatypes - corresponding toa list in R where each item of the list has the same datatype (general Rlist, where each item can have a different type cannot be represented inHDF5).
This would represent a list where each item is a table with anarbitrary number of rows.
In this section some of the details will be discussed that are likelyonly interesting for the technically inclined or someone who would wantto extend the package itself.
In this package, the C-API of HDF5 is being used. For the C-API, itis usually the programmer’s responsibility to close manually an HDF5-IDthat is being used by calling the appropriate “close” function. Ifprograms are not written very diligently, this can easily lead tomemory-leaks.
As users of R are used to objects being automaticallygarbage-collected, such a behavior could pose a significant problem inR. In order to avoid any issues, the closing of HDF5-IDs is thereforedone automatically using the R garbage collection mechanism.
For every id that is created in the C-code and passed back to R, anR6-class object is created that is non-cloneable. During creating, thefinalizer (see reg.finalizer) is set so that during garbage collectionof the R6-class object or when shutting down R, the corresponding HDF5resources are being released.
In addition to this, all HDF5-IDs that are currently in use are beingtracked as well (in the obj_tracker environment; not exported). Thereason for this separate tracking is so that on demand, all objects thatare currently still open in a file can be closed. The special challengehere is on the one-hand to track every R6 object that is in use in R,and at the same time not interfere with the normal operation of the Rgarbage collection mechanism. To this end, we cannot just save theenvironment itself in the obj_tracker (note that in R, anenvironment-object is always a pointer to the environment, not the wholeenvironment itself). If we stored a pointer to the environment itself,the R garbage collector would never delete the environment as formallyit would still be in use (in the obj_tracker). In order to prevent that,the following mechanism was implemented:
As mentioned, this was mainly implemented to allow for the closing ofall IDs that are still open inside a file and to invalidate all existingR6-classes as well.
In this context, let us quickly also discuss the special way HDF5handles files. In HDF5, in principle a file can always only be openedonce. This can lead to problems as users in R are used to being able toopen files as often as they like. Furthermore, it is possible in HDF5 toclose the ID of a file without closing all objects in the file. Then,however, the file actually stays open until the last ID pointing intothe file is closed and it cannot be opened again without it.
Therefore, as already explained above (and as recommended by the HDF5manual), do not discard or close files that still have open objects inthem. It is preferable to keep the HDF5-file-id pointer around and closeit when it is no longer needed (and all objects inside the file) usingtheclose_all method.
A special feature of this package is the far-reaching and flexibleimplementation of data-conversion routines between R and HDF5. Routineshave been implemented for all datatypes, string, data-frames, arrays andvariable length (HDF5-VLEN) objects. Some are relativelystraightforward, others are more complicated. Here, numeric datatypescan be tricky due to the limited ability of R to represent certaindatatypes, specifically long doubles or 64bit-integers.
For numeric datatypes, the situation is in certain circumstances abit tricky. In general, R numerical objects are either represented as64-bit floating point values (doubles) or 32-but integers. R switchesrelatively transparently between these types as needed (forcomputations, integers are converted to doubles and conversely, arraypositions can be addressed by doubles). The main issue when working withHDF5 occurs as R doesn’t have either a 64bit signed or unsigned integerdatatype (and also not a long double). In order to work around thisissue, the following conventions are being used
An overview of how the data conversion is being done can be seenhere:
The underlying principle is that any internal conversion between Rtypes is done by R (with the resulting handling of NA’s and overflows),whereas any conversion between R-types and Non-R-types is done by theHDF5 library (usually meaning that on overflow, truncation occurs).
In HDF5, strings can either be variable length or fixed lengthstrings. In R, they are always variable length. Therefore, strings fromR to HDF5 that are written into fixed-length fields will be truncated.Conversely, strings from HDF5 that are fixed length to R will only bereturned up the the NULL character that ends strings in C.
The situation is a bit more tricky for table-like objects. In R,these are data-frames, which internally are a list of vectors. In HDF5,a table is a Compound object, that is equivalent to C-struct -i.e. every row is represented together whereas in R every column isrepresented together. Each of these approaches has certain advantages,but the challenge here is to translate between them.
This is done in the straightforward manner. When converting from R toHDF5, the columns of the tables are copied into the struct whereas inthe reverse direction, every struct is decomposed into the correspondingcolumns.
The Data-frame <-> Compound conversion is also extensively usedfor HDF5-API functions that return structs as result (and thereforereturn data-frames).
In HDF5, datasets itself can have arbitrary dimensions. In additionto that, there are also array-datatypes that allow for the inclusion forarrays e.g. inside a compound object. Translation to and from arrays isrelatively straightforward and only involves setting the correctdim attribute in R.
In addition to that, however, there is small complication. In R, thefirst dimension is the fastest changing dimension. In HDF5 (same as inC), the last dimension is however the fastest changing one. Fordatasets, we work around this problem by always reversing the dimensionsthat are passed between R and HDF5 and therefore making the distinctiontransparent. For arrays, this is however a bit trickier. For example letus assume that we have a dataset that is a one-dimensional vector oflength 10, each element of which is an array-datatype of length 4,resulting in a 10 x 4 dataset. However, it is now not quite clear howthis should be represented in R. If we follow the notion, that thefastest changing dimension in R is the first one, the result would be adataset with 4 rows and 10 columns, i.e. 10 x 4.
This does feel rather unintuitive, forcing a user to specify thesecond dimension to get all items of the array. Therefore, we haveimplemented it so that a 10 x 4 dataset is returned, with each rowcorresponding to the array-datatype. In order to achieve this we have todeviate from the ordering principle in HDF5. Where in HDF5, the elementsof the first internal array are in position 1, 2, 3 and 4 (or 0 to 3when you start counting at 0), in R they are now in position 1, 11, 21,and 31. In order to do this, we first internally read the HDF5 arrayinto an R-array of shape 4 x 10 and then transpose the result.
In HDF5, there are also variable-length data types. Essentially, thiscorresponds to an R list-like object, with the additional restrictionthat every item of the list has to be of the same datatype. This is alsohow it is implemented. R list where all items are vectors (of arbitrarylength) of the same type can be converted to HDF5-VLEN objects and viceversa.
As of the writing of this vignette, these have not yet beenimplemented.