Data are stored in many different ways in tables or spreadsheetsbecause no strict semantic or topographic standards for the organisationof tables are commonly accepted. In the R environment thetidyparadigm is a first step towards interoperability of data, in that itrequires a certain arrangement of tables, where variables are recordedin columns and observations in rows (seehttps://tidyr.tidyverse.org/). Tables can be tidied(i.e., brought into a tidy arrangement) via packages such astidyr, however, all functions that deal with reshapingtables to date require data that are already organised intotopologically coherent, rectangular tables. This is often violated inpractice, especially in data that are scraped off of the internet.
tabshiftr fills this gap in the toolchain towards moreinteroperable data viaschema descriptions that are builtwith setters and debugged with getters and areorganise()function that ties everything together.
install.packages("tabshiftr")or the latest development version from github:
devtools::install_github("EhrmannS/tabshiftr")A disorganised table may look like the following table:
library(tabshiftr)library(knitr)# a rather disorganised table with messy clusters and a distinct variableinput<- tabs2shift$clusters_messykable(input)| X1 | X2 | X3 | X4 | X5 | X6 | X7 |
|---|---|---|---|---|---|---|
| commodities | harvested | production | . | . | . | . |
| unit 1 | . | . | . | . | . | . |
| soybean | 1111 | 1112 | year 1 | . | . | . |
| maize | 1121 | 1122 | year 1 | . | . | . |
| soybean | 1211 | 1212 | year 2 | . | . | . |
| maize | 1221 | 1222 | year 2 | . | . | . |
| . | . | . | . | . | . | . |
| commodities | harvested | production | commodities | harvested | production | . |
| unit 2 | . | . | unit 3 | . | . | . |
| soybean | 2111 | 2112 | soybean | 3111 | 3112 | year 1 |
| maize | 2121 | 2122 | maize | 3121 | 3122 | year 1 |
| soybean | 2211 | 2212 | soybean | 3211 | 3212 | year 2 |
| maize | 2221 | 2222 | maize | 3221 | 3222 | year 2 |
If we were to transform this data into tidy data by merely using thefunctions intidyr (or the extendedtidyversein general), we’d potentially end up with a massive algorithm,especially for such complicated table arrangements. For other tablesthat may or may not be as complicated, we’d have to set up yet morealgorithms and while a pipeline of tidy functions is relatively easy toset up, it would still become very laborious to repeat this for thedozens of potential table arrangements. Intabshiftr wesolve that by describing the schema of the input table and providingthis schema description to thereorganise() function. Thisrequires us to use a vastly smaller set of code and makes it thus a lotmore efficient to bring multiple heterogeneous data into aninteroperable format.
# put together schema description by ...# ... identifying cluster positionsschema<-setCluster(id ="territories",left =c(1,1,4),top =c(1,8,8))# ... specifying the cluster ID as id variable (obligatory for when we deal with clusters)schema<- schema%>%setIDVar(name ="territories",columns =c(1,1,4),rows =c(2,9,9))# ... specifying a distinct variable (explicit position)schema<- schema%>%setIDVar(name ="year",columns =4,rows =c(3:6),distinct =TRUE)# ... specifying a tidy variable (by giving the column values)schema<- schema%>%setIDVar(name ="commodities",columns =c(1,1,4))# ... identifying the (tidy) observed variablesschema<- schema%>%setObsVar(name ="harvested",columns =c(2,2,5))%>%setObsVar(name ="production",columns =c(3,3,6))# to potentially debug the schema description, first validate the schema ...schema_valid<-validateSchema(schema = schema,input = input)# ... and extract parts of it per cluster (also check out the other getters in# this package)getIDVars(schema = schema_valid,input = input)#> [[1]]#> [[1]]$year#> # A tibble: 4 × 1#> X4#> <chr>#> 1 year 1#> 2 year 1#> 3 year 2#> 4 year 2#>#> [[1]]$commodities#> # A tibble: 4 × 1#> X1#> <chr>#> 1 soybean#> 2 maize#> 3 soybean#> 4 maize#>#>#> [[2]]#> [[2]]$year#> # A tibble: 4 × 1#> X4#> <chr>#> 1 year 1#> 2 year 1#> 3 year 2#> 4 year 2#>#> [[2]]$commodities#> # A tibble: 4 × 1#> X1#> <chr>#> 1 soybean#> 2 maize#> 3 soybean#> 4 maize#>#>#> [[3]]#> [[3]]$year#> # A tibble: 4 × 1#> X4#> <chr>#> 1 year 1#> 2 year 1#> 3 year 2#> 4 year 2#>#> [[3]]$commodities#> # A tibble: 4 × 1#> X4#> <chr>#> 1 soybean#> 2 maize#> 3 soybean#> 4 maizegetObsVars(schema = schema_valid,input = input)#> [[1]]#> [[1]]$harvested#> # A tibble: 4 × 1#> X2#> <chr>#> 1 1111#> 2 1121#> 3 1211#> 4 1221#>#> [[1]]$production#> # A tibble: 4 × 1#> X3#> <chr>#> 1 1112#> 2 1122#> 3 1212#> 4 1222#>#>#> [[2]]#> [[2]]$harvested#> # A tibble: 4 × 1#> X2#> <chr>#> 1 2111#> 2 2121#> 3 2211#> 4 2221#>#> [[2]]$production#> # A tibble: 4 × 1#> X3#> <chr>#> 1 2112#> 2 2122#> 3 2212#> 4 2222#>#>#> [[3]]#> [[3]]$harvested#> # A tibble: 4 × 1#> X5#> <chr>#> 1 3111#> 2 3121#> 3 3211#> 4 3221#>#> [[3]]$production#> # A tibble: 4 × 1#> X6#> <chr>#> 1 3112#> 2 3122#> 3 3212#> 4 3222# alternatively, if the clusters are regular, relative values starting from the# cluster origin could be setschema_alt<-setCluster(id ="territories",left =c(1,1,4),top =c(1,8,8))%>%setIDVar(name ="territories",columns =1,rows =.find(row =2,relative =TRUE))%>%setIDVar(name ="year",columns =4,rows =c(3:6),distinct =TRUE)%>%setIDVar(name ="commodities",columns =.find(col =1,relative =TRUE))%>%setObsVar(name ="harvested",columns =.find(col =2,relative =TRUE))%>%setObsVar(name ="production",columns =.find(col =3,relative =TRUE))Thereorganise() function carries out the steps ofvalidating, extracting the variables, pivoting the tentative output andputting the final table together automatically, so it merely requiresthe finalisedschema and theinput table.
schema# has a pretty print function#> 3 clusters#> origin : 1|1, 8|1, 8|4 (row|col)#> id : territories#>#> variable type row col dist#> ------------- ---------- ------ ------ ------#> territories id 2, 9 1, 4 F#> year id 3:6 4 T#> commodities id 1, 4 F#> harvested observed 2, 5 F#> production observed 3, 6 Foutput<-reorganise(input = input,schema = schema)kable(output)| territories | year | commodities | harvested | production |
|---|---|---|---|---|
| unit 1 | year 1 | maize | 1121 | 1122 |
| unit 1 | year 1 | soybean | 1111 | 1112 |
| unit 1 | year 2 | maize | 1221 | 1222 |
| unit 1 | year 2 | soybean | 1211 | 1212 |
| unit 2 | year 1 | maize | 2121 | 2122 |
| unit 2 | year 1 | soybean | 2111 | 2112 |
| unit 2 | year 2 | maize | 2221 | 2222 |
| unit 2 | year 2 | soybean | 2211 | 2212 |
| unit 3 | year 1 | maize | 3121 | 3122 |
| unit 3 | year 1 | soybean | 3111 | 3112 |
| unit 3 | year 2 | maize | 3221 | 3222 |
| unit 3 | year 2 | soybean | 3211 | 3212 |
tabshiftr.tabshiftr should at somepoint support that those schemas can be exported into data-formats thatare used by downstream applications (xml, json, …), following proper(ISO) standards. In case you have experience with those standards andwould like to collaborate on it, please get in touch!This work was supported by funding to Carsten Meyer through theFlexpool mechanism of the German Centre for Integrative BiodiversityResearch (iDiv) (FZT-118, DFG).