- Notifications
You must be signed in to change notification settings - Fork2
Reshape disorganised messy data
luckinet/tabshiftr
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Data are stored in many different ways in tables or spreadsheets becauseno strict semantic or topographic standards for the organisation oftables are commonly accepted. In the R environment thetidy paradigmis a first step towards interoperability of data, in that it requires acertain arrangement of tables, where variables are recorded in columnsand observations in rows (seehttps://tidyr.tidyverse.org/). Tablescan be tidied (i.e., brought into a tidy arrangement) via packages suchastidyr, however, all functions that deal with reshaping tables todate require data that are already organised into topologicallycoherent, rectangular tables. This is often violated in practice,especially in data that are scraped off of the internet.
tabshiftr fills this gap in the toolchain towards more interoperabledata viaschema descriptions that are built with setters and debuggedwith getters and areorganise() function that ties everythingtogether.
- Install the official version from CRAN:
install.packages("tabshiftr")or the latest development version from github:
devtools::install_github("luckinet/tabshiftr")
- Thevignette
- gives an introduction of the nature of tabular data and of thedimensions of disorganization dealt with here,
- provides an instruction on how to set up a schema description
- shows a wide range of table arrangements that can be reshaped with thetools provided here.
A disorganized table may look like the following table:
library(tabshiftr)library(knitr)# a rather disorganized table with messy clusters and a distinct variableinput<-tabs2shift$clusters_messykable(input)
| X1 | X2 | X3 | X4 | X5 | X6 | X7 |
|---|---|---|---|---|---|---|
| commodities | harvested | production | . | . | . | . |
| unit 1 | . | . | . | . | . | . |
| soybean | 1111 | 1112 | year 1 | . | . | . |
| maize | 1121 | 1122 | year 1 | . | . | . |
| soybean | 1211 | 1212 | year 2 | . | . | . |
| maize | 1221 | 1222 | year 2 | . | . | . |
| . | . | . | . | . | . | . |
| commodities | harvested | production | commodities | harvested | production | . |
| unit 2 | . | . | unit 3 | . | . | . |
| soybean | 2111 | 2112 | soybean | 3111 | 3112 | year 1 |
| maize | 2121 | 2122 | maize | 3121 | 3122 | year 1 |
| soybean | 2211 | 2212 | soybean | 3211 | 3212 | year 2 |
| maize | 2221 | 2222 | maize | 3221 | 3222 | year 2 |
If we were to transform this data into tidy data by merely using thefunctions intidyr (or the extendedtidyverse in general), we’dpotentially end up with a massive algorithm, especially for suchcomplicated table arrangements. For other tables that may or may not beas complicated, we’d have to set up yet more algorithms and while apipeline of tidy functions is relatively easy to set up, it would stillbecome very laborious to repeat this for the dozens of potential tablearrangements. Intabshiftr we solve that by describing the schema ofthe input table and providing this schema description to thereorganise() function. This requires us to use a vastly smaller set ofcode and makes it thus a lot more efficient to bring multipleheterogeneous data into an interoperable format.
# put together schema description by ...# ... identifying cluster positionsschema<- setCluster(id="territories",left= c(1,1,4),top= c(1,8,8))# ... specifying the cluster ID as id variable (obligatory for when we deal with clusters)schema<-schema %>% setIDVar(name="territories",columns= c(1,1,4),rows= c(2,9,9))# ... specifying a distinct variable (explicit position)schema<-schema %>% setIDVar(name="year",columns=4,rows= c(3:6),distinct=TRUE)# ... specifying a tidy identifying variable (by giving the column values)schema<-schema %>% setIDVar(name="commodities",columns= c(1,1,4))# ... identifying the (tidy) observed variablesschema<-schema %>% setObsVar(name="harvested",columns= c(2,2,5)) %>% setObsVar(name="production",columns= c(3,3,6))# to potentially debug the schema description, first validate the schema ...schema_valid<- validateSchema(schema=schema,input=input)# ... and extract parts of it per cluster (also check out the other getters in# this package)getIDVars(schema=schema_valid,input=input)#> [[1]]#> [[1]]$year#> # A tibble: 4 × 1#> X4#> <chr>#> 1 year 1#> 2 year 1#> 3 year 2#> 4 year 2#>#> [[1]]$commodities#> # A tibble: 4 × 1#> X1#> <chr>#> 1 soybean#> 2 maize#> 3 soybean#> 4 maize#>#>#> [[2]]#> [[2]]$year#> # A tibble: 4 × 1#> X4#> <chr>#> 1 year 1#> 2 year 1#> 3 year 2#> 4 year 2#>#> [[2]]$commodities#> # A tibble: 4 × 1#> X1#> <chr>#> 1 soybean#> 2 maize#> 3 soybean#> 4 maize#>#>#> [[3]]#> [[3]]$year#> # A tibble: 4 × 1#> X4#> <chr>#> 1 year 1#> 2 year 1#> 3 year 2#> 4 year 2#>#> [[3]]$commodities#> # A tibble: 4 × 1#> X4#> <chr>#> 1 soybean#> 2 maize#> 3 soybean#> 4 maizegetObsVars(schema=schema_valid,input=input)#> [[1]]#> [[1]]$harvested#> # A tibble: 4 × 1#> X2#> <chr>#> 1 1111#> 2 1121#> 3 1211#> 4 1221#>#> [[1]]$production#> # A tibble: 4 × 1#> X3#> <chr>#> 1 1112#> 2 1122#> 3 1212#> 4 1222#>#>#> [[2]]#> [[2]]$harvested#> # A tibble: 4 × 1#> X2#> <chr>#> 1 2111#> 2 2121#> 3 2211#> 4 2221#>#> [[2]]$production#> # A tibble: 4 × 1#> X3#> <chr>#> 1 2112#> 2 2122#> 3 2212#> 4 2222#>#>#> [[3]]#> [[3]]$harvested#> # A tibble: 4 × 1#> X5#> <chr>#> 1 3111#> 2 3121#> 3 3211#> 4 3221#>#> [[3]]$production#> # A tibble: 4 × 1#> X6#> <chr>#> 1 3112#> 2 3122#> 3 3212#> 4 3222# alternatively, if the clusters are regular, relative values starting from the# cluster origin could be setschema_alt<- setCluster(id="territories",left= c(1,1,4),top= c(1,8,8)) %>% setIDVar(name="territories",columns=1,rows= .find(row=2,relative=TRUE)) %>% setIDVar(name="year",columns=4,rows= c(3:6),distinct=TRUE) %>% setIDVar(name="commodities",columns= .find(col=1,relative=TRUE)) %>% setObsVar(name="harvested",columns= .find(col=2,relative=TRUE)) %>% setObsVar(name="production",columns= .find(col=3,relative=TRUE))
Thereorganise() function carries out the steps of validating,extracting the variables, pivoting the tentative output and putting thefinal table together automatically, so it merely requires the finalisedschema and theinput table.
schema# has a pretty print function#> 3 clusters#> origin : 1|1, 8|1, 8|4 (row|col)#> id : territories#>#> variable type row col dist#> ------------- ---------- ------ ------ ------#> territories id 2, 9 1, 4 F#> year id 3:6 4 T#> commodities id 1, 4 F#> harvested observed 2, 5 F#> production observed 3, 6 Foutput<- reorganise(input=input,schema=schema)kable(output)
| territories | year | commodities | harvested | production |
|---|---|---|---|---|
| unit 1 | year 1 | soybean | 1111 | 1112 |
| unit 1 | year 1 | maize | 1121 | 1122 |
| unit 1 | year 2 | soybean | 1211 | 1212 |
| unit 1 | year 2 | maize | 1221 | 1222 |
| unit 2 | year 1 | soybean | 2111 | 2112 |
| unit 2 | year 1 | maize | 2121 | 2122 |
| unit 2 | year 2 | soybean | 2211 | 2212 |
| unit 2 | year 2 | maize | 2221 | 2222 |
| unit 3 | year 1 | soybean | 3111 | 3112 |
| unit 3 | year 1 | maize | 3121 | 3122 |
| unit 3 | year 2 | soybean | 3211 | 3212 |
| unit 3 | year 2 | maize | 3221 | 3222 |
- tabshiftr is still in development. So far it reliably reorganizes 20different types of tables, but additional dimensions ofdisorganization might show themselves. If you encounter a table thatcan’t be reorganized with the current infrastructure, we’d be morethan happy to collaborate on advancing
tabshiftr. - Informative error management is work in process.
- Moreover, the resulting schema descriptions can be useful for dataarchiving or database building and
tabshiftrshould at some pointsupport that those schemas can be exported into data-formats that areused by downstream applications (xml, json, …), following proper (ISO)standards. In case you have experience with those standards and wouldlike to collaborate on it, please get in touch!
This work was supported by funding to Carsten Meyer through the Flexpoolmechanism of the German Centre for Integrative Biodiversity Research(iDiv) (FZT-118, DFG).
About
Reshape disorganised messy data
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors5
Uh oh!
There was an error while loading.Please reload this page.