Movatterモバイル変換


[0]ホーム

URL:


xmap

library(xmap)library(dplyr)

The Crossmaps Framework and{xmap} workflow

This package is an implementation of the Crossmaps Framework forunifiedspecification, verification, implementation anddocumentation of operations involved in transforming aggregatestatistics between related measurement instruments (e.g. classificationcodes).

The framework conceptualises the aggregation of redistribution ofnumeric masses between related taxonomic structures as an operationwhich applies a graph-based representation of mapping and redistributionlogic between source and target keys (thecrossmap), toconformable key-value pairs (shared mass array).

Acrossmap specifies:

Ashared mass array is a collection of key-value pairs,where the values form a shared numeric and the keys are parts of ashared conceptual whole (e.g. GDP by state -> country)

The crossmaps framework is an alternative approach to datatransformation that removes the need for bespoke code to handle datapreparation involving many-to-one or one-to-many operations.

The framework gives rise to assertions on inputcrossmap andshared mass arrays which ensure the transformations are valid,and implemented exactly as specified. Valid and well-documentedtransformation workflows should have the following properties:

See the related paper,A Unified Statistical AndComputational Framework For Ex-Post Harmonisation Of AggregateStatistics, for further details on the conditions whichguarantee the above properties. This package implements workflowwarnings and errors to ensure relevant conditions are met.

Example: Country-State Mappings

Consider data transformations which reference relations betweenhierarchical administrative regions.

In the following example, we use some basic data manipulationoperations from{dplyr} to generate mapping weights fortransforming numeric mass (e.g. GDP):

Aggregation, Coverage, and Missing Value Checks

For aggregation, we use unit weights:

aus_state_agg_links<- demo$aus_state_pairs|>mutate(ones =1L)

Links are validated when coercing them into crossmaps, and someadditional information about the transformation is computed (i.e. howmany unique keys are in the source and target taxonomies):

(agg_xmap<- aus_state_agg_links|>as_xmap_tbl(from = state,to = ctry,weight_by = ones))#> # A crossmap tibble: 8 × 3#> # with unique keys:  [8] state -> [1] ctry#>   .from$state .to$ctry .weight_by$ones#>   <chr>       <chr>              <int>#> 1 AU-ACT      AUS                    1#> 2 AU-NSW      AUS                    1#> 3 AU-NT       AUS                    1#> 4 AU-QLD      AUS                    1#> 5 AU-SA       AUS                    1#> 6 AU-TAS      AUS                    1#> 7 AU-VIC      AUS                    1#> 8 AU-WA       AUS                    1

The unit weights represent a “transfer” of 100% of the source valuesindexed by.from keys to the target.tokeys.

Let’s generate some dummy state-level data to apply our aggregationto:

set.seed(1395)(aus_state_data<- demo$aus_state_pairs|>mutate(gdp =runif(n(),100,2000),ref =100  ))#> # A tibble: 8 × 4#>   ctry  state    gdp   ref#>   <chr> <chr>  <dbl> <dbl>#> 1 AUS   AU-ACT 1626.   100#> 2 AUS   AU-NSW 1244.   100#> 3 AUS   AU-NT   703.   100#> 4 AUS   AU-QLD  239.   100#> 5 AUS   AU-SA  1388.   100#> 6 AUS   AU-TAS 1192.   100#> 7 AUS   AU-VIC 1535.   100#> 8 AUS   AU-WA   306.   100

Now to transform / aggregate our data:

(aus_ctry_data<- aus_state_data|>apply_xmap(.xmap = agg_xmap,values_from =c(gdp, ref),keys_from = state  ))#> # A tibble: 1 × 3#>   ctry    gdp   ref#>   <chr> <dbl> <dbl>#> 1 AUS   8233.   800

What happens if our crossmap was missing instructions for multiplestates?

## dropping linksagg_xmap[1:3, ]#> # A crossmap tibble: 3 × 3#> # with unique keys:  [3] state -> [1] ctry#>   .from$state .to$ctry .weight_by$ones#>   <chr>       <chr>              <int>#> 1 AU-ACT      AUS                    1#> 2 AU-NSW      AUS                    1#> 3 AU-NT       AUS                    1## will lead to an error!apply_xmap(.data = aus_state_data,.xmap = agg_xmap[1:3, ],values_from =c(gdp, ref),keys_from = state)#> Error in `apply_xmap()`:#> ✖ One or more keys in `.data` do not have corresponding links in `.xmap`#> ℹ Add missing links to `.xmap` or subset `.data`

This error prevents the accidental dropping of observations byincomplete specification of transformation instruction.

To inspect and remedy this issue, we can usediagnose_apply_xmap() to find out which keys in.data are not covered by the.xmap:

diagnose_apply_xmap(.data = aus_state_data,.xmap = agg_xmap[1:3, ],values_from =c(gdp, ref))#> ✖ Found 8 keys in `.data` without corresponding match in `.xmap$.from`#> See .$not_covered#> $not_covered#> # A tibble: 8 × 2#>   .key         .value$gdp  $ref#>   <tibble[,0]>      <dbl> <dbl>#> 1                   1626.   100#> 2                   1244.   100#> 3                    703.   100#> 4                    239.   100#> 5                   1388.   100#> 6                   1192.   100#> 7                   1535.   100#> 8                    306.   100

Missing values will also be flagged to encourage explicit handling ofmissing values before theapply_xmap() mappingtransformation:

# add some `NA`aus_state_data_na<- aus_state_dataaus_state_data_na[c(1,3,5),"gdp"]<-NAapply_xmap(.data = aus_state_data_na,.xmap = agg_xmap,values_from = gdp,keys_from = state)#> Error in `apply_xmap()`:#> ✖ Missing values not allowed in `.data` columns: gdp#> ℹ Remove or replace missing values.

Redistribution, valid weights and preserving totals

For redistributing, we can choose any weights as long as the sum ofweights on outgoing links from each source key totals one (ordplyr::near() enough). This ensures that we only splitsource values into percentage parts that sum to 100%.

A common naive strategy is to distribute equally amongst relatedtarget keys:

demo$aus_state_pairs|>group_by(ctry)|>mutate(equal =1/n_distinct(state))|>ungroup()|>as_xmap_tbl(from = ctry,to = state,weight_by = equal)#> # A crossmap tibble: 8 × 3#> # with unique keys:  [1] ctry -> [8] state#>   .from$ctry .to$state .weight_by$equal#>   <chr>      <chr>                <dbl>#> 1 AUS        AU-ACT               0.125#> 2 AUS        AU-NSW               0.125#> 3 AUS        AU-NT                0.125#> 4 AUS        AU-QLD               0.125#> 5 AUS        AU-SA                0.125#> 6 AUS        AU-TAS               0.125#> 7 AUS        AU-VIC               0.125#> 8 AUS        AU-WA                0.125

If we use invalid weights, such as unit weights,as_xmap_tbl() will error:

demo$aus_state_pairs|>mutate(ones =1)|>as_xmap_tbl(from = ctry,to = state,weight_by = ones)#> Error in `xmap_tbl()`:#> ! Invalid `.weight_by` found for some links#> ✖ The total outgoing `.weight_by` for some `.from` nodes are not near enough to#>   1#> ℹ Modify `.weight_by` or adjust `tol` and try again.#> ℹ Use `diagnose_xmap_tbl() for more information.

Except in the case of one-to-one mappings, crossmaps are generallylateral (one-way), and have different weights in each direction.

A more sophisticated strategy for generating weights is to usereference information. For example, we can use population shares toredistribute GDP between states:

(split_xmap_pop<- demo$aus_state_pop_df|>group_by(ctry)|>mutate(pop_share = pop/sum(pop))|>ungroup()|>as_xmap_tbl(from = ctry,to = state,weight_by = pop_share  ))#> # A crossmap tibble: 8 × 3#> # with unique keys:  [1] ctry -> [8] state#>   .from$ctry .to$state .weight_by$pop_share#>   <chr>      <chr>                    <dbl>#> 1 AUS        AU-ACT                 0.0176#> 2 AUS        AU-NSW                 0.314#> 3 AUS        AU-NT                  0.00965#> 4 AUS        AU-QLD                 0.205#> 5 AUS        AU-SA                  0.0701#> 6 AUS        AU-TAS                 0.0220#> 7 AUS        AU-VIC                 0.255#> 8 AUS        AU-WA                  0.107

Let’s redistribute the country level data we aggregated above back tostate level using our calcuted population weights:

aus_state_data2<- aus_ctry_data|>mutate(ref =10000)|>apply_xmap(split_xmap_pop,values_from =c(gdp, ref),keys_from = ctry  )

Note: that the values in the transformedref column donot exactly match the float values in.weight_by$pop_shareused as transformation weights. This is due to floating pointinaccuracies. Over larger transformations with more keys, this mayresult in slight mismatches between the total numeric mass before andafter transformation.

#> # A tibble: 8 × 5#>   .from$ctry state     gdp    ref .weight_by$pop_share#>   <chr>      <chr>   <dbl>  <dbl>                <dbl>#> 1 AUS        AU-ACT  145.   176.               0.0176 #> 2 AUS        AU-NSW 2584.  3139.               0.314  #> 3 AUS        AU-NT    79.4   96.5              0.00965#> 4 AUS        AU-QLD 1687.  2049.               0.205  #> 5 AUS        AU-SA   577.   701.               0.0701 #> 6 AUS        AU-TAS  181.   220.               0.0220 #> 7 AUS        AU-VIC 2096.  2546.               0.255  #> 8 AUS        AU-WA   883.  1072.               0.107

[8]ページ先頭

©2009-2025 Movatter.jp