{xmap} workflowThis package is an implementation of the Crossmaps Framework forunifiedspecification, verification, implementation anddocumentation of operations involved in transforming aggregatestatistics between related measurement instruments (e.g. classificationcodes).
The framework conceptualises the aggregation of redistribution ofnumeric masses between related taxonomic structures as an operationwhich applies a graph-based representation of mapping and redistributionlogic between source and target keys (thecrossmap), toconformable key-value pairs (shared mass array).
Acrossmap specifies:
Ashared mass array is a collection of key-value pairs,where the values form a shared numeric and the keys are parts of ashared conceptual whole (e.g. GDP by state -> country)
The crossmaps framework is an alternative approach to datatransformation that removes the need for bespoke code to handle datapreparation involving many-to-one or one-to-many operations.
The framework gives rise to assertions on inputcrossmap andshared mass arrays which ensure the transformations are valid,and implemented exactly as specified. Valid and well-documentedtransformation workflows should have the following properties:
sum(state, na.rm = TRUE))See the related paper,A Unified Statistical AndComputational Framework For Ex-Post Harmonisation Of AggregateStatistics, for further details on the conditions whichguarantee the above properties. This package implements workflowwarnings and errors to ensure relevant conditions are met.
Consider data transformations which reference relations betweenhierarchical administrative regions.
In the following example, we use some basic data manipulationoperations from{dplyr} to generate mapping weights fortransforming numeric mass (e.g. GDP):
For aggregation, we use unit weights:
Links are validated when coercing them into crossmaps, and someadditional information about the transformation is computed (i.e. howmany unique keys are in the source and target taxonomies):
(agg_xmap<- aus_state_agg_links|>as_xmap_tbl(from = state,to = ctry,weight_by = ones))#> # A crossmap tibble: 8 × 3#> # with unique keys: [8] state -> [1] ctry#> .from$state .to$ctry .weight_by$ones#> <chr> <chr> <int>#> 1 AU-ACT AUS 1#> 2 AU-NSW AUS 1#> 3 AU-NT AUS 1#> 4 AU-QLD AUS 1#> 5 AU-SA AUS 1#> 6 AU-TAS AUS 1#> 7 AU-VIC AUS 1#> 8 AU-WA AUS 1The unit weights represent a “transfer” of 100% of the source valuesindexed by.from keys to the target.tokeys.
Let’s generate some dummy state-level data to apply our aggregationto:
set.seed(1395)(aus_state_data<- demo$aus_state_pairs|>mutate(gdp =runif(n(),100,2000),ref =100 ))#> # A tibble: 8 × 4#> ctry state gdp ref#> <chr> <chr> <dbl> <dbl>#> 1 AUS AU-ACT 1626. 100#> 2 AUS AU-NSW 1244. 100#> 3 AUS AU-NT 703. 100#> 4 AUS AU-QLD 239. 100#> 5 AUS AU-SA 1388. 100#> 6 AUS AU-TAS 1192. 100#> 7 AUS AU-VIC 1535. 100#> 8 AUS AU-WA 306. 100Now to transform / aggregate our data:
(aus_ctry_data<- aus_state_data|>apply_xmap(.xmap = agg_xmap,values_from =c(gdp, ref),keys_from = state ))#> # A tibble: 1 × 3#> ctry gdp ref#> <chr> <dbl> <dbl>#> 1 AUS 8233. 800What happens if our crossmap was missing instructions for multiplestates?
## dropping linksagg_xmap[1:3, ]#> # A crossmap tibble: 3 × 3#> # with unique keys: [3] state -> [1] ctry#> .from$state .to$ctry .weight_by$ones#> <chr> <chr> <int>#> 1 AU-ACT AUS 1#> 2 AU-NSW AUS 1#> 3 AU-NT AUS 1## will lead to an error!apply_xmap(.data = aus_state_data,.xmap = agg_xmap[1:3, ],values_from =c(gdp, ref),keys_from = state)#> Error in `apply_xmap()`:#> ✖ One or more keys in `.data` do not have corresponding links in `.xmap`#> ℹ Add missing links to `.xmap` or subset `.data`This error prevents the accidental dropping of observations byincomplete specification of transformation instruction.
To inspect and remedy this issue, we can usediagnose_apply_xmap() to find out which keys in.data are not covered by the.xmap:
diagnose_apply_xmap(.data = aus_state_data,.xmap = agg_xmap[1:3, ],values_from =c(gdp, ref))#> ✖ Found 8 keys in `.data` without corresponding match in `.xmap$.from`#> See .$not_covered#> $not_covered#> # A tibble: 8 × 2#> .key .value$gdp $ref#> <tibble[,0]> <dbl> <dbl>#> 1 1626. 100#> 2 1244. 100#> 3 703. 100#> 4 239. 100#> 5 1388. 100#> 6 1192. 100#> 7 1535. 100#> 8 306. 100Missing values will also be flagged to encourage explicit handling ofmissing values before theapply_xmap() mappingtransformation:
For redistributing, we can choose any weights as long as the sum ofweights on outgoing links from each source key totals one (ordplyr::near() enough). This ensures that we only splitsource values into percentage parts that sum to 100%.
A common naive strategy is to distribute equally amongst relatedtarget keys:
demo$aus_state_pairs|>group_by(ctry)|>mutate(equal =1/n_distinct(state))|>ungroup()|>as_xmap_tbl(from = ctry,to = state,weight_by = equal)#> # A crossmap tibble: 8 × 3#> # with unique keys: [1] ctry -> [8] state#> .from$ctry .to$state .weight_by$equal#> <chr> <chr> <dbl>#> 1 AUS AU-ACT 0.125#> 2 AUS AU-NSW 0.125#> 3 AUS AU-NT 0.125#> 4 AUS AU-QLD 0.125#> 5 AUS AU-SA 0.125#> 6 AUS AU-TAS 0.125#> 7 AUS AU-VIC 0.125#> 8 AUS AU-WA 0.125If we use invalid weights, such as unit weights,as_xmap_tbl() will error:
demo$aus_state_pairs|>mutate(ones =1)|>as_xmap_tbl(from = ctry,to = state,weight_by = ones)#> Error in `xmap_tbl()`:#> ! Invalid `.weight_by` found for some links#> ✖ The total outgoing `.weight_by` for some `.from` nodes are not near enough to#> 1#> ℹ Modify `.weight_by` or adjust `tol` and try again.#> ℹ Use `diagnose_xmap_tbl() for more information.Except in the case of one-to-one mappings, crossmaps are generallylateral (one-way), and have different weights in each direction.
A more sophisticated strategy for generating weights is to usereference information. For example, we can use population shares toredistribute GDP between states:
(split_xmap_pop<- demo$aus_state_pop_df|>group_by(ctry)|>mutate(pop_share = pop/sum(pop))|>ungroup()|>as_xmap_tbl(from = ctry,to = state,weight_by = pop_share ))#> # A crossmap tibble: 8 × 3#> # with unique keys: [1] ctry -> [8] state#> .from$ctry .to$state .weight_by$pop_share#> <chr> <chr> <dbl>#> 1 AUS AU-ACT 0.0176#> 2 AUS AU-NSW 0.314#> 3 AUS AU-NT 0.00965#> 4 AUS AU-QLD 0.205#> 5 AUS AU-SA 0.0701#> 6 AUS AU-TAS 0.0220#> 7 AUS AU-VIC 0.255#> 8 AUS AU-WA 0.107Let’s redistribute the country level data we aggregated above back tostate level using our calcuted population weights:
aus_state_data2<- aus_ctry_data|>mutate(ref =10000)|>apply_xmap(split_xmap_pop,values_from =c(gdp, ref),keys_from = ctry )Note: that the values in the transformedref column donot exactly match the float values in.weight_by$pop_shareused as transformation weights. This is due to floating pointinaccuracies. Over larger transformations with more keys, this mayresult in slight mismatches between the total numeric mass before andafter transformation.
#> # A tibble: 8 × 5#> .from$ctry state gdp ref .weight_by$pop_share#> <chr> <chr> <dbl> <dbl> <dbl>#> 1 AUS AU-ACT 145. 176. 0.0176 #> 2 AUS AU-NSW 2584. 3139. 0.314 #> 3 AUS AU-NT 79.4 96.5 0.00965#> 4 AUS AU-QLD 1687. 2049. 0.205 #> 5 AUS AU-SA 577. 701. 0.0701 #> 6 AUS AU-TAS 181. 220. 0.0220 #> 7 AUS AU-VIC 2096. 2546. 0.255 #> 8 AUS AU-WA 883. 1072. 0.107