Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Reshape disorganised messy data

NotificationsYou must be signed in to change notification settings

luckinet/tabshiftr

Repository files navigation

CRAN_Status_BadgeDOI

R-CMD-checkcodecovLifecycle:maturing

Overview

Data are stored in many different ways in tables or spreadsheets becauseno strict semantic or topographic standards for the organisation oftables are commonly accepted. In the R environment thetidy paradigmis a first step towards interoperability of data, in that it requires acertain arrangement of tables, where variables are recorded in columnsand observations in rows (seehttps://tidyr.tidyverse.org/). Tablescan be tidied (i.e., brought into a tidy arrangement) via packages suchastidyr, however, all functions that deal with reshaping tables todate require data that are already organised into topologicallycoherent, rectangular tables. This is often violated in practice,especially in data that are scraped off of the internet.

tabshiftr fills this gap in the toolchain towards more interoperabledata viaschema descriptions that are built with setters and debuggedwith getters and areorganise() function that ties everythingtogether.

Installation

  1. Install the official version from CRAN:
install.packages("tabshiftr")

or the latest development version from github:

devtools::install_github("luckinet/tabshiftr")
  1. Thevignette
  • gives an introduction of the nature of tabular data and of thedimensions of disorganization dealt with here,
  • provides an instruction on how to set up a schema description
  • shows a wide range of table arrangements that can be reshaped with thetools provided here.

Examples

A disorganized table may look like the following table:

library(tabshiftr)library(knitr)# a rather disorganized table with messy clusters and a distinct variableinput<-tabs2shift$clusters_messykable(input)
X1X2X3X4X5X6X7
commoditiesharvestedproduction....
unit 1......
soybean11111112year 1...
maize11211122year 1...
soybean12111212year 2...
maize12211222year 2...
.......
commoditiesharvestedproductioncommoditiesharvestedproduction.
unit 2..unit 3...
soybean21112112soybean31113112year 1
maize21212122maize31213122year 1
soybean22112212soybean32113212year 2
maize22212222maize32213222year 2

If we were to transform this data into tidy data by merely using thefunctions intidyr (or the extendedtidyverse in general), we’dpotentially end up with a massive algorithm, especially for suchcomplicated table arrangements. For other tables that may or may not beas complicated, we’d have to set up yet more algorithms and while apipeline of tidy functions is relatively easy to set up, it would stillbecome very laborious to repeat this for the dozens of potential tablearrangements. Intabshiftr we solve that by describing the schema ofthe input table and providing this schema description to thereorganise() function. This requires us to use a vastly smaller set ofcode and makes it thus a lot more efficient to bring multipleheterogeneous data into an interoperable format.

# put together schema description by ...# ... identifying cluster positionsschema<- setCluster(id="territories",left= c(1,1,4),top= c(1,8,8))# ... specifying the cluster ID as id variable (obligatory for when we deal with clusters)schema<-schema %>%   setIDVar(name="territories",columns= c(1,1,4),rows= c(2,9,9))# ... specifying a distinct variable (explicit position)schema<-schema %>%   setIDVar(name="year",columns=4,rows= c(3:6),distinct=TRUE)# ... specifying a tidy identifying variable (by giving the column values)schema<-schema %>%   setIDVar(name="commodities",columns= c(1,1,4))# ... identifying the (tidy) observed variablesschema<-schema %>%   setObsVar(name="harvested",columns= c(2,2,5)) %>%   setObsVar(name="production",columns= c(3,3,6))# to potentially debug the schema description, first validate the schema ...schema_valid<- validateSchema(schema=schema,input=input)# ... and extract parts of it per cluster (also check out the other getters in# this package)getIDVars(schema=schema_valid,input=input)#> [[1]]#> [[1]]$year#> # A tibble: 4 × 1#>   X4#>   <chr>#> 1 year 1#> 2 year 1#> 3 year 2#> 4 year 2#>#> [[1]]$commodities#> # A tibble: 4 × 1#>   X1#>   <chr>#> 1 soybean#> 2 maize#> 3 soybean#> 4 maize#>#>#> [[2]]#> [[2]]$year#> # A tibble: 4 × 1#>   X4#>   <chr>#> 1 year 1#> 2 year 1#> 3 year 2#> 4 year 2#>#> [[2]]$commodities#> # A tibble: 4 × 1#>   X1#>   <chr>#> 1 soybean#> 2 maize#> 3 soybean#> 4 maize#>#>#> [[3]]#> [[3]]$year#> # A tibble: 4 × 1#>   X4#>   <chr>#> 1 year 1#> 2 year 1#> 3 year 2#> 4 year 2#>#> [[3]]$commodities#> # A tibble: 4 × 1#>   X4#>   <chr>#> 1 soybean#> 2 maize#> 3 soybean#> 4 maizegetObsVars(schema=schema_valid,input=input)#> [[1]]#> [[1]]$harvested#> # A tibble: 4 × 1#>   X2#>   <chr>#> 1 1111#> 2 1121#> 3 1211#> 4 1221#>#> [[1]]$production#> # A tibble: 4 × 1#>   X3#>   <chr>#> 1 1112#> 2 1122#> 3 1212#> 4 1222#>#>#> [[2]]#> [[2]]$harvested#> # A tibble: 4 × 1#>   X2#>   <chr>#> 1 2111#> 2 2121#> 3 2211#> 4 2221#>#> [[2]]$production#> # A tibble: 4 × 1#>   X3#>   <chr>#> 1 2112#> 2 2122#> 3 2212#> 4 2222#>#>#> [[3]]#> [[3]]$harvested#> # A tibble: 4 × 1#>   X5#>   <chr>#> 1 3111#> 2 3121#> 3 3211#> 4 3221#>#> [[3]]$production#> # A tibble: 4 × 1#>   X6#>   <chr>#> 1 3112#> 2 3122#> 3 3212#> 4 3222# alternatively, if the clusters are regular, relative values starting from the# cluster origin could be setschema_alt<- setCluster(id="territories",left= c(1,1,4),top= c(1,8,8)) %>%  setIDVar(name="territories",columns=1,rows= .find(row=2,relative=TRUE)) %>%  setIDVar(name="year",columns=4,rows= c(3:6),distinct=TRUE) %>%  setIDVar(name="commodities",columns= .find(col=1,relative=TRUE)) %>%  setObsVar(name="harvested",columns= .find(col=2,relative=TRUE)) %>%  setObsVar(name="production",columns= .find(col=3,relative=TRUE))

Thereorganise() function carries out the steps of validating,extracting the variables, pivoting the tentative output and putting thefinal table together automatically, so it merely requires the finalisedschema and theinput table.

schema# has a pretty print function#>   3 clusters#>     origin : 1|1, 8|1, 8|4  (row|col)#>     id     : territories#>#>    variable      type       row    col    dist#>   ------------- ---------- ------ ------ ------#>    territories   id         2, 9   1, 4   F#>    year          id         3:6    4      T#>    commodities   id                1, 4   F#>    harvested     observed          2, 5   F#>    production    observed          3, 6   Foutput<- reorganise(input=input,schema=schema)kable(output)
territoriesyearcommoditiesharvestedproduction
unit 1year 1soybean11111112
unit 1year 1maize11211122
unit 1year 2soybean12111212
unit 1year 2maize12211222
unit 2year 1soybean21112112
unit 2year 1maize21212122
unit 2year 2soybean22112212
unit 2year 2maize22212222
unit 3year 1soybean31113112
unit 3year 1maize31213122
unit 3year 2soybean32113212
unit 3year 2maize32213222

Contributions

  • tabshiftr is still in development. So far it reliably reorganizes 20different types of tables, but additional dimensions ofdisorganization might show themselves. If you encounter a table thatcan’t be reorganized with the current infrastructure, we’d be morethan happy to collaborate on advancingtabshiftr.
  • Informative error management is work in process.
  • Moreover, the resulting schema descriptions can be useful for dataarchiving or database building andtabshiftr should at some pointsupport that those schemas can be exported into data-formats that areused by downstream applications (xml, json, …), following proper (ISO)standards. In case you have experience with those standards and wouldlike to collaborate on it, please get in touch!

Acknowledgement

This work was supported by funding to Carsten Meyer through the Flexpoolmechanism of the German Centre for Integrative Biodiversity Research(iDiv) (FZT-118, DFG).

About

Reshape disorganised messy data

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors5

Languages


[8]ページ先頭

©2009-2025 Movatter.jp