NotificationsYou must be signed in to change notification settings
Fork2
Star6

Reshape disorganised messy data

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 630 Commits
.github		.github
R		R
data		data
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CRAN-SUBMISSION		CRAN-SUBMISSION
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
codecov.yml		codecov.yml
cran-comments.md		cran-comments.md
tabshiftr.Rproj		tabshiftr.Rproj

Repository files navigation

tabshiftr

Overview

Data are stored in many different ways in tables or spreadsheets becauseno strict semantic or topographic standards for the organisation oftables are commonly accepted. In the R environment thetidy paradigmis a first step towards interoperability of data, in that it requires acertain arrangement of tables, where variables are recorded in columnsand observations in rows (seehttps://tidyr.tidyverse.org/). Tablescan be tidied (i.e., brought into a tidy arrangement) via packages suchastidyr, however, all functions that deal with reshaping tables todate require data that are already organised into topologicallycoherent, rectangular tables. This is often violated in practice,especially in data that are scraped off of the internet.

tabshiftr fills this gap in the toolchain towards more interoperabledata viaschema descriptions that are built with setters and debuggedwith getters and areorganise() function that ties everythingtogether.

Installation

Install the official version from CRAN:

install.packages("tabshiftr")

or the latest development version from github:

devtools::install_github("luckinet/tabshiftr")

Thevignette

gives an introduction of the nature of tabular data and of thedimensions of disorganization dealt with here,
provides an instruction on how to set up a schema description
shows a wide range of table arrangements that can be reshaped with thetools provided here.

Examples

A disorganized table may look like the following table:

library(tabshiftr)library(knitr)# a rather disorganized table with messy clusters and a distinct variableinput<-tabs2shift$clusters_messykable(input)

X1	X2	X3	X4	X5	X6	X7
commodities	harvested	production	.	.	.	.
unit 1	.	.	.	.	.	.
soybean	1111	1112	year 1	.	.	.
maize	1121	1122	year 1	.	.	.
soybean	1211	1212	year 2	.	.	.
maize	1221	1222	year 2	.	.	.
.	.	.	.	.	.	.
commodities	harvested	production	commodities	harvested	production	.
unit 2	.	.	unit 3	.	.	.
soybean	2111	2112	soybean	3111	3112	year 1
maize	2121	2122	maize	3121	3122	year 1
soybean	2211	2212	soybean	3211	3212	year 2
maize	2221	2222	maize	3221	3222	year 2

If we were to transform this data into tidy data by merely using thefunctions intidyr (or the extendedtidyverse in general), we’dpotentially end up with a massive algorithm, especially for suchcomplicated table arrangements. For other tables that may or may not beas complicated, we’d have to set up yet more algorithms and while apipeline of tidy functions is relatively easy to set up, it would stillbecome very laborious to repeat this for the dozens of potential tablearrangements. Intabshiftr we solve that by describing the schema ofthe input table and providing this schema description to thereorganise() function. This requires us to use a vastly smaller set ofcode and makes it thus a lot more efficient to bring multipleheterogeneous data into an interoperable format.

# put together schema description by ...# ... identifying cluster positionsschema<- setCluster(id="territories",left= c(1,1,4),top= c(1,8,8))# ... specifying the cluster ID as id variable (obligatory for when we deal with clusters)schema<-schema %>%   setIDVar(name="territories",columns= c(1,1,4),rows= c(2,9,9))# ... specifying a distinct variable (explicit position)schema<-schema %>%   setIDVar(name="year",columns=4,rows= c(3:6),distinct=TRUE)# ... specifying a tidy identifying variable (by giving the column values)schema<-schema %>%   setIDVar(name="commodities",columns= c(1,1,4))# ... identifying the (tidy) observed variablesschema<-schema %>%   setObsVar(name="harvested",columns= c(2,2,5)) %>%   setObsVar(name="production",columns= c(3,3,6))# to potentially debug the schema description, first validate the schema ...schema_valid<- validateSchema(schema=schema,input=input)# ... and extract parts of it per cluster (also check out the other getters in# this package)getIDVars(schema=schema_valid,input=input)#> [[1]]#> [[1]]$year#> # A tibble: 4 × 1#>   X4#>   <chr>#> 1 year 1#> 2 year 1#> 3 year 2#> 4 year 2#>#> [[1]]$commodities#> # A tibble: 4 × 1#>   X1#>   <chr>#> 1 soybean#> 2 maize#> 3 soybean#> 4 maize#>#>#> [[2]]#> [[2]]$year#> # A tibble: 4 × 1#>   X4#>   <chr>#> 1 year 1#> 2 year 1#> 3 year 2#> 4 year 2#>#> [[2]]$commodities#> # A tibble: 4 × 1#>   X1#>   <chr>#> 1 soybean#> 2 maize#> 3 soybean#> 4 maize#>#>#> [[3]]#> [[3]]$year#> # A tibble: 4 × 1#>   X4#>   <chr>#> 1 year 1#> 2 year 1#> 3 year 2#> 4 year 2#>#> [[3]]$commodities#> # A tibble: 4 × 1#>   X4#>   <chr>#> 1 soybean#> 2 maize#> 3 soybean#> 4 maizegetObsVars(schema=schema_valid,input=input)#> [[1]]#> [[1]]$harvested#> # A tibble: 4 × 1#>   X2#>   <chr>#> 1 1111#> 2 1121#> 3 1211#> 4 1221#>#> [[1]]$production#> # A tibble: 4 × 1#>   X3#>   <chr>#> 1 1112#> 2 1122#> 3 1212#> 4 1222#>#>#> [[2]]#> [[2]]$harvested#> # A tibble: 4 × 1#>   X2#>   <chr>#> 1 2111#> 2 2121#> 3 2211#> 4 2221#>#> [[2]]$production#> # A tibble: 4 × 1#>   X3#>   <chr>#> 1 2112#> 2 2122#> 3 2212#> 4 2222#>#>#> [[3]]#> [[3]]$harvested#> # A tibble: 4 × 1#>   X5#>   <chr>#> 1 3111#> 2 3121#> 3 3211#> 4 3221#>#> [[3]]$production#> # A tibble: 4 × 1#>   X6#>   <chr>#> 1 3112#> 2 3122#> 3 3212#> 4 3222# alternatively, if the clusters are regular, relative values starting from the# cluster origin could be setschema_alt<- setCluster(id="territories",left= c(1,1,4),top= c(1,8,8)) %>%  setIDVar(name="territories",columns=1,rows= .find(row=2,relative=TRUE)) %>%  setIDVar(name="year",columns=4,rows= c(3:6),distinct=TRUE) %>%  setIDVar(name="commodities",columns= .find(col=1,relative=TRUE)) %>%  setObsVar(name="harvested",columns= .find(col=2,relative=TRUE)) %>%  setObsVar(name="production",columns= .find(col=3,relative=TRUE))

Thereorganise() function carries out the steps of validating,extracting the variables, pivoting the tentative output and putting thefinal table together automatically, so it merely requires the finalisedschema and theinput table.

schema# has a pretty print function#>   3 clusters#>     origin : 1|1, 8|1, 8|4  (row|col)#>     id     : territories#>#>    variable      type       row    col    dist#>   ------------- ---------- ------ ------ ------#>    territories   id         2, 9   1, 4   F#>    year          id         3:6    4      T#>    commodities   id                1, 4   F#>    harvested     observed          2, 5   F#>    production    observed          3, 6   Foutput<- reorganise(input=input,schema=schema)kable(output)

territories	year	commodities	harvested	production
unit 1	year 1	soybean	1111	1112
unit 1	year 1	maize	1121	1122
unit 1	year 2	soybean	1211	1212
unit 1	year 2	maize	1221	1222
unit 2	year 1	soybean	2111	2112
unit 2	year 1	maize	2121	2122
unit 2	year 2	soybean	2211	2212
unit 2	year 2	maize	2221	2222
unit 3	year 1	soybean	3111	3112
unit 3	year 1	maize	3121	3122
unit 3	year 2	soybean	3211	3212
unit 3	year 2	maize	3221	3222

Contributions

tabshiftr is still in development. So far it reliably reorganizes 20different types of tables, but additional dimensions ofdisorganization might show themselves. If you encounter a table thatcan’t be reorganized with the current infrastructure, we’d be morethan happy to collaborate on advancingtabshiftr.
Informative error management is work in process.
Moreover, the resulting schema descriptions can be useful for dataarchiving or database building andtabshiftr should at some pointsupport that those schemas can be exported into data-formats that areused by downstream applications (xml, json, …), following proper (ISO)standards. In case you have experience with those standards and wouldlike to collaborate on it, please get in touch!

Acknowledgement

This work was supported by funding to Carsten Meyer through the Flexpoolmechanism of the German Centre for Integrative Biodiversity Research(iDiv) (FZT-118, DFG).

About

Reshape disorganised messy data

luckinet.github.io/tabshiftr/

Releases3

grouping update Latest

Jun 13, 2023

+ 2 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

tabshiftr

Overview

Installation

Examples

Contributions

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases3

Packages

Uh oh!

Contributors5

Uh oh!

Languages

Movatterモバイル変換

luckinet/tabshiftr

Folders and files

Latest commit

History

Repository files navigation

tabshiftr

Overview

Installation

Examples

Contributions

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases3

Packages0

Uh oh!

Contributors5

Uh oh!

Languages

Packages