This vignette describes the process of aggregating indicators, inCOINr.
Aggregation is the operation of combining multiple indicators intoone value. Many composite indicators have a hierarchical structure, soin practice this often involves multiple aggregations, for exampleaggregating groups of indicators into aggregate values, then aggregatingthose values into higher-level aggregates, and so on, until the finalindex value.
Aggregating should almost always be done on normalised data, unlessthe indicators are already on very similar scales. Otherwise therelative influence of indicators will be very uneven.
Of course you don’thave to aggregate indicators at all, andyou might be content with a scoreboard, or perhaps aggregating intoseveral aggregate values rather than a single index. However, considerthat aggregation should not substitute the underlying indicator data,but complement it.
Overall, aggregating indicators is a form of information compression- you are trying to combine many indicator values into one, andinevitably information will be lost (this recentpaper may be of interest). As long as this is kept in mind, andindicator data is presented and made available along side aggregatevalues, then aggregate (index) values can complement indicators and beused as a useful tool for summarising the underlying data, andidentifying overall trends and patterns.
Many aggregation methods involve some kind of weighting,i.e. coefficients that define the relative weight of theindicators/aggregates in the aggregation. In order to aggregate, weightsneed to first be specified, but to effectively adjust weights it isnecessary to aggregate.
This chicken and egg conundrum is best solved by aggregatinginitially with a trial set of weights, perhaps equal weights, thenseeing the effects of the weighting, and making any weight adjustmentsnecessary.
The most straightforward and widely-used approach to aggregation istheweighted arithmetic mean. Denoting the indicatorsas\(x_i \in \{x_1, x_2, ... , x_d\}\), a weighted arithmetic mean is calculated as:
\[ y = \frac{1}{\sum_{i=1}^d w_i}\sum_{i=1}^d x_iw_i \]
where the\(w_i\) are the weightscorresponding to each\(x_i\). Here, ifthe weights are chosen to sum to 1, it will simplify to the weighted sumof the indicators. In any case, the weighted mean is scaled by the sumof the weights, so weights operate relative to each other.
Clearly, if the index has more than two levels, then there will bemultiple aggregations. For example, there may be three groups ofindicators which give three separate aggregate scores. These aggregatescores would then be fed back into the weighted arithmetic mean above tocalculate the overall index.
The arithmetic mean has “perfect compensability”, which means that ahigh score in one indicator will perfectly compensate a low score inanother. In a simple example with two indicators scaled between 0 and 10and equal weighting, a unit with scores (0, 10) would be given the samescore as a unit with scores (5, 5) – both have a score of 5.
An alternative is theweighted geometric mean, whichuses the product of the indicators rather than the sum.
\[ y = \left( \prod_{i=1}^d x_i^{w_i}\right)^{1 / \sum_{i=1}^d w_i} \]
This is simply the product of each indicator to the power of itsweight, all raised the the power of the inverse of the sum of theweights.
The geometric mean is less compensatory than the arithmetic mean –low values in one indicator only partially substitute high values inothers. For this reason, the geometric mean may sometimes be preferredwhen indicators represent “essentials”. An example might be quality oflife: a longer life expectancy perhaps should not compensate severerestrictions on personal freedoms.
A third type of mean, in fact the third of the so-calledPythagoreanmeans is theweighted harmonic mean. This uses themean of the reciprocals of the indicators:
\[ y = \frac{\sum_{i=1}^dw_i}{\sum_{i=1}^d w_i/x_i} \]
The harmonic mean is the the least compensatory of the the threemeans, even less so than the geometric mean. It is often used for takingthe mean of rates and ratios.
Theweighted median is also a simple alternative candidate.It is defined by ordering indicator values, then picking the value whichhas half of the assigned weight above it, and half below it. Forordered indicators\(x_1, x_2, ...,x_d\) and corresponding weights\(w_1,w_2, ..., w_d\) the weighted median is the indicator value\(x_m\) that satisfies:
\[ \sum_{i=1}^{m-1} w_i \leq \frac{1}{2},\: \: \text{and} \sum_{i=m+1}^{d} w_i \leq \frac{1}{2} \]
The median is known to be robust to outliers, and this may be ofinterest if the distribution of scores across indicators is skewed.
Another somewhat different approach to aggregation is to use theCopelandmethod. This approach is based pairwise comparisons between unitsand proceeds as follows. First, anoutranking matrix isconstructed, which is a square matrix with\(N\) columns and\(N\) rows, where\(N\) is the number of units.
The element in the\(p\)th row and\(q\)th column of the matrix iscalculated by summing all the indicator weights where unit\(p\) has a higher value in those indicatorsthan unit\(q\). Similarly, the cell inthe\(q\)th row and\(p\)th column (which is the cell opposite onthe other side of the diagonal), is calculated as the sum of the weightsunit where\(q\) has a higher valuethan unit\(p\). If the indicatorweights sum to one over all indicators, then these two scores will alsosum to 1 by definition. The outranking matrix effectively summarises towhat extent each unit scores better or worse than all other units, forall unit pairs.
The Copeland score for each unit is calculated by taking the sum ofthe row values in the outranking matrix. This can be seen as an averagemeasure of to what extent that unit performs above other units.
Clearly, this can be applied at any level of aggregation and usedhierarchically like the other aggregation methods presented here.
In some cases, one unit may score higher than the other in allindicators. This is called adominance pair, and corresponds toany pair scores equal to one (equivalent to any pair scores equal tozero).
The percentage of dominance pairs is an indication of robustness.Under dominance, there is no way methodological choices (weighting,normalisation, etc.) can affect the relative standing of the pair in theranking. One will always be ranked higher than the other. The greaterthe number of dominance (or robust) pairs in a classification, the lesssensitive country ranks will be to methodological assumptions. COINrallows to calculate the percentage of dominance pairs with an inbuiltfunction.
We now turn to how data sets in a coin can be aggregated using themethods described previously. The function of interest isAggregate(), which is a generic with methods for coins,purses and data frames. To demonstrate COINr’sAggregate()function on a coin, we begin by loading the package, and building theexample coin, up to the normalised data set.
library(COINr)# build example up to normalised data setcoin<-build_example_coin(up_to ="Normalise")#> iData checked and OK.#> iMeta checked and OK.#> Written data set to .$Data$Raw#> Written data set to .$Data$Denominated#> Written data set to .$Data$Imputed#> Written data set to .$Data$Screened#> Written data set to .$Data$Treated#> Written data set to .$Data$NormalisedConsider what is needed to aggregate the normalised data into itshigher levels. We need:
All of these elements are already present in the coin, except thelast. For the first point, we simply need to tellAggregate() which data set to use (using thedset argument). The structure of the index was defined whenbuilding the coin innew_coin() (theiMetaargument). Weights were also attached toiMeta. Finally,specifications can be specified in the arguments ofAggregate(). Let’s begin with the simple case though: usingthe function defaults.
# aggregate normalised data setcoin<-Aggregate(coin,dset ="Normalised")#> Written data set to .$Data$AggregatedBy default, the aggregation function performs the followingsteps:
iMetaiMeta and using the data specified indset.$Data$Aggregated, whichconsists of the data indset, plus extra columns withscores for each aggregation group, at each aggregation level.Let’s examine the new data set. The columns of each level are addedsuccessively, working from level 1 upwards, so the highest aggregationlevel (the index, here) will be the last column of the data frame.
dset_aggregated<-get_dset(coin,dset ="Aggregated")nc<-ncol(dset_aggregated)# view aggregated scores (last 11 columns here)dset_aggregated[(nc-10): nc]|>head(5)|>signif(3)#> ConEcFin Environ Instit P2P Physical Political Social SusEcFin Conn Sust#> 1 12.6 31.9 52.4 39.50 34.8 52.5 71.9 55.7 38.4 53.2#> 2 26.2 69.5 77.5 54.10 41.1 78.2 72.8 62.9 55.4 68.4#> 3 48.2 53.0 75.6 43.30 72.0 80.8 86.2 50.1 64.0 63.1#> 4 13.3 81.7 26.5 5.85 22.9 32.4 27.5 64.6 20.2 57.9#> 5 24.6 55.7 75.9 27.10 28.4 67.5 53.3 61.7 44.7 56.9#> Index#> 1 45.8#> 2 61.9#> 3 63.5#> 4 39.0#> 5 50.8Here we see the level 2 aggregated scores created by aggregating eachgroup of indicators (the first eight columns), followed by the twosub-indexes (level 3) created by aggregating the scores of level 2, andfinally the Index (level 4), which is created by aggregating the “Conn”and “Sust” sub-indexes.
The format of this data frame is not hugely convenient for inspectingthe results. To see a more user-friendly version, use theget_results() function.
Let’s now explore some of the options of theAggregate()function. Like other coin-building functions in COINr,Aggregate() comes with a number of inbuilt options, but canalso accept any function that is passed to it, as long as it satisfiessome requirements. COINr’s inbuilt aggregation functions begin witha_, and are:
a_amean(): the weighted arithmetic meana_gmean(): the weighted geometric meana_hmean(): the weighted harmonic meana_copeland(): the Copeland method (note: requiresby_df = TRUE)For details of these methods, seeApproaches above and the function documentationof each of the functions listed.
By default, the arithmetic mean is called but we can easily changethis to the geometric mean, for example. However here we run into aproblem: the geometric mean will fail if any values to aggregate areless than or equal to zero. So to use the geometric mean we have tore-do the normalisation step to avoid this. Luckily this isstraightforward in COINr:
coin<-Normalise(coin,dset ="Treated",global_specs =list(f_n ="n_minmax",f_n_para =list(l_u =c(1,100))))#> Written data set to .$Data$Normalised#> (overwritten existing data set)Now, since the indicators are scaled between 1 and 100 (instead of 0and 100 as previously), they can be aggregated with the geometricmean.
All of the four aggregation functions mentioned above have the sameformat (try e.g. ?a_gmean), and are built into the COINrpackage. But what if we want to use another type of aggregationfunction? The process is exactly the same.
NOTE: the compind package has been disabled here from running thecommands in this vignette because of changes to a dependent packagewhich are causing problems with the R CMD check. The commands shouldstill work if you run them, but the results will not be shownhere.
In this section we use some functions from other packages: thematrixStats package and the Compind package. These are not imported byCOINr, so the code here will only work if you have these installed. Ifthis vignette was built on your computer, we have to check whether thesepackages are installed:
# ms_installed <- requireNamespace("matrixStats", quietly = TRUE)# ms_installed# ci_installed <- requireNamespace("Compind", quietly = TRUE)# ci_installedIf either of these have returnedFALSE, in the followingcode chunks you will see some blanks. See the online version of thisvignette to see the results, or install the above packages and rebuildthe vignettes.
Now for an example, we can use theweightedMedian()function from the matrixStats package. This has a number of arguments,but the ones we will use arex andw (with thesame meanings as COINr functions), andna.rm which we needto set toTRUE.
# RESTORE above eval=ms_installed# load matrixStats package# library(matrixStats)## # aggregate using weightedMedian()# coin <- Aggregate(coin, dset = "Normalised",# f_ag = "weightedMedian",# f_ag_para = list(na.rm = TRUE))The weightsw do not need to be specified inf_ag_para because they are automatically passed tof_ag unless specified otherwise.
The general requirements forf_ag functions passed toAggregate() are that:
x,possibly with missing valuesx) is passed as function argumentw. If the function doesn’t accept a vector of weights, wecan setw = "none" in the arguments toAggregate(), and it will not try to passw.f_ag, apart fromxandw, should be included in the named listf_ag_para.Sometimes this may mean that we have to create a wrapper function tosatisfy these requirements. For example, the ‘Compind’ package has anumber of sophisticated aggregation approaches. The “benefit of thedoubt” uses data envelopment analysis to aggregate indicators, howeverthe function Compind::ci_bod() outputs a list. We can make a wrapperfunction to use this inside COINr:
# RESTORE ABOVE eval= ci_installed#NOTE: this chunk disabled - see comments above.# load Compind# suppressPackageStartupMessages(library(Compind))## # wrapper to get output of interest from ci_bod# # also suppress messages about missing values# ci_bod2 <- function(x){# suppressMessages(Compind::ci_bod(x)$ci_bod_est)# }## # aggregate# coin <- Aggregate(coin, dset = "Normalised",# f_ag = "ci_bod2", by_df = TRUE, w = "none")The benefit of the doubt approach automatically assigns individualweights to each unit, so we need to specifyw = "none" tostopAggregate() from attempting to pass weights to thefunction. Importantly, we also need to specifyby_df = TRUEwhich tellsAggregate() to pass a data frame tof_ag rather than a vector.
Many aggregation functions will return an aggregated value as long asat least one of the values passed to it is non-NA. Forexample, R’smean() function:
Depending on how we setna.rm, we either get an answerorNA, and this is the same for many other aggregationfunctions (e.g. the ones built into COINr). Sometimes we might want abit more control. For example, if we have five indicators in a group, itmight only be reasonable to give an aggregated score if, say, at leastthree out of five indicators have non-NA values.
TheAggregate() function has the option to specify adata availability limit when aggregating. We simply setdat_thresh to a value between 0 and 1, and for eachaggregation group, any unit that has a data availability lower thandat_thresh will get aNA value instead of anaggregated score. This is most easily illustrated on a data frame (seenext section for more details on aggregating in data frames):
df1<-data.frame(i1 =c(1,2,3),i2 =c(3,NA,NA),i3 =c(1,NA,1))df1#> i1 i2 i3#> 1 1 3 1#> 2 2 NA NA#> 3 3 NA 1We will require that at least 2/3 of the indicators should benon-NA to give an aggregated value.
# aggregate with arithmetic mean, equal weight and data avail limit of 2/3Aggregate(df1,f_ag ="a_amean",f_ag_para =list(w =c(1,1,1)),dat_thresh =2/3)#> [1] 1.666667 NA 2.000000Here we see that the second row is aggregated to giveNAbecause it only has 1/3 data availability.
We can also use a different aggregation function for each aggregationlevel by specifyingf_ag as a vector of function namesrather than a single function.
coin<-Aggregate(coin,dset ="Normalised",f_ag =c("a_amean","a_gmean","a_amean"))#> Written data set to .$Data$Aggregated#> (overwritten existing data set)In this example, there are four levels in the index, which meansthere are three aggregation operations to be performed: from Level 1 toLevel 2, from Level 2 to Level 3, and from Level 3 to Level 4. Thismeans thatf_ag vector must haven-1 entries,wheren is the number of aggregation levels. The functionsare run in the order of aggregation.
In the same way, if parameters need to be passed to the functionsspecified byf_ag,f_ag_para can be specifiedas a list of lengthn-1, where each element is a list ofparameters.
TheAggregate() function also works in the same way ondata frames. This is probably more useful when aggregation functionstake vectors as inputs, rather than data frames, since it wouldotherwise be easier to go directly to the underlying function. In anycase, here are a couple of examples. First, using a built in COINrfunction to compute the weighted harmonic mean of a data frame.
# get some indicator data - take a few columns from built in data setX<- ASEM_iData[12:15]# normalise to avoid zeros - min max between 1 and 100X<-Normalise(X,global_specs =list(f_n ="n_minmax",f_n_para =list(l_u =c(1,100))))# aggregate using harmonic mean, with some weightsy<-Aggregate(X,f_ag ="a_hmean",f_ag_para =list(w =c(1,1,2,1)))cbind(X, y)|>head(5)|>signif(3)#> LPI Flights Ship Bord y#> 1 94.1 14.20 1.0 29.4 2.36#> 2 94.6 15.60 97.2 40.0 41.50#> 3 35.0 4.89 38.0 15.6 14.30#> 4 51.2 4.90 59.2 34.3 17.40#> 5 43.7 4.66 55.7 1.0 3.93The purse method forAggregate() is straightforward andsimply applies the same aggregation specifications to each of the coinswithin. It has exactly the same parameters as the coin method.
# build example purse up to normalised data setpurse<-build_example_purse(up_to ="Normalise",quietly =TRUE)# aggregate using defaultspurse<-Aggregate(purse,dset ="Normalised")#> Written data set to .$Data$Aggregated#> Written data set to .$Data$Aggregated#> Written data set to .$Data$Aggregated#> Written data set to .$Data$Aggregated#> Written data set to .$Data$AggregatedAfter aggregating indicators, it is likely that you will want tobegin viewing and exploring the results. See the vignette onExploring results for more details.