This article describes the various data and metadata object typessupplied by arrow, and documents how these objects are structured.
Arrow metadata classes
The arrow package defines the following classes for representingmetadata:
- A
Schemais a list ofFieldobjects usedto describe the structure of a tabular data object; where - A
Fieldspecifies a character string name and aDataType; and - A
DataTypeis an attribute controlling how values arerepresented
Consider this:
df<-data.frame(x=1:3, y=c("a","b","c"))tb<-arrow_table(df)tb$schema## Schema## x: int32## y: string#### See $metadata for additional Schema metadataThe schema that has been automatically inferred could also bemanually created:
## Schema## x: int32## y: stringTheschema() function allows the following shorthand todefine fields:
## Schema## x: int32## y: stringSometimes it is important to specify the schema manually,particularly if you want fine-grained control over the Arrow datatypes:
arrow_table(df, schema=schema(x=int64(), y=utf8()))## Table## 3 rows x 2 columns## $x <int64>## $y <string>#### See $metadata for additional Schema metadataarrow_table(df, schema=schema(x=float64(), y=utf8()))## Table## 3 rows x 2 columns## $x <double>## $y <string>#### See $metadata for additional Schema metadataR object attributes
Arrow supports custom key-value metadata attached to Schemas. When weconvert adata.frame to an Arrow Table or RecordBatch, thepackage stores anyattributes() attached to the columns ofthedata.frame in the Arrow object Schema. Attributes addedto objects in this fashion are stored under ther key, asshown below:
# data frame with custom metadatadf<-data.frame(x=1:3, y=c("a","b","c"))attr(df,"df_meta")<-"custom data frame metadata"attr(df$y,"col_meta")<-"custom column metadata"# when converted to a Table, the metadata is preservedtb<-arrow_table(df)tb$metadata## $r## $r$attributes## $r$attributes$df_meta## [1] "custom data frame metadata"###### $r$columns## $r$columns$x## NULL#### $r$columns$y## $r$columns$y$attributes## $r$columns$y$attributes$col_meta## [1] "custom column metadata"###### $r$columns$y$columns## NULLIt is also possible to assign additional string metadata under anyother key you wish, using a command like this:
tb$metadata$new_key<-"new value"Metadata attached to a Schema is preserved when writing the Table toArrow/Feather or Parquet formats. When reading those files into R, orwhen callingas.data.frame() on a Table or RecordBatch, thecolumn attributes are restored to the columns of the resultingdata.frame. This means that custom data types, includinghaven::labelled,vctrs annotations, andothers, are preserved when doing a round-trip through Arrow.
Note that the attributes stored in$metadata$r are onlyunderstood by R. If you write adata.frame withhaven columns to a Feather file and read that in Pandas,thehaven metadata won’t be recognized there. Similarly,Pandas writes its own custom metadata, which the R package does notconsume. You are free, however, to define custom metadata conventionsfor your application and assign any (string) values you want to othermetadata keys.
Further reading
- To learn more about arrow metadata, see the documentation for
schema(). - To learn more about data types, see thedata types article.