- Notifications
You must be signed in to change notification settings - Fork4
The objective of this work is to provide tools to be used for the classification of ordinal categorical distributions. To demonstrate how to do it, we propose an Homogeneity (HI) and Location (LI) Index to measure the concentration and central value of an ordinal categorical distribution.
License
lpinzari/homogeneity-location-index
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The Homogeneity & Location Index : A Statistical Framework for the Classifcation of ordinal categorical data
The objective of this work is to provide tools to be used for the classification of ordinal categorical distributions. To demonstrate how to do it, we propose anHomogeneity (HI) andLocation (LI) Index to measure the concentration and central value of an ordinal categorical distribution. We also provide a transparent set of criteria that a user can follow to establish if a givenHI's value indicates a"high" or"low" concentration of values around the central value of a distribution. Finally, we provide aConcentration Index (CI) for the classifcation of nominal categorical variables.
We applied our framework to assess the socioeconomic homogenity of the commonly used [SA3](https://www.abs.gov.au/websitedbs/D3310114.nsf/home/Australian+Statistical+Geography+Standard+(ASGS) Australian Census geography. In particular, we look at the population distribution in the SA3'sIRSD (Index of Relative Socioeconomic Disadvantage) decile category.
For more information about this work, the interested reader can refer to the publication:A framework for the identification and classification of homogeneous socioeconomic areas in the analysis of health care variation.
Figure 1. Conceptual Framework for the classifcation of homogenous areas.Source: International Journal of Health Geographics
- Description
- R Files
- Dependencies
- Usage
- Guidelines for contributing
- Author
- License
- Funding
- Future Work
- Notes
- Contacts
Conceptually, theHI's value of a distribution (pdf) is a number between 0 (uniform pdf) and 1 (singleton pdf), that is defined as the degree to which the population is concentrated among the set of categories for that area. For example, in the case of the IRSD decile, an HI of zero expresses minimal concentration and occurs when the population is equally distributed among all decile categories (i.e an IRSD decile contains 10% of the population). Conversely, an HI value equals to 1 is attained if the whole population is concentrated in a single decile. In the latter case, there is no variation within the area in that characteristic and the geography is uniquely identified by the central value of the distribution.
TheLI of a distribution refers to the category which could be considered representative of the entire population in a unit. For example, in the case of the IRSD decile distribution is an integer ranging from 1 (most disadvantaged) to 10 (least disadvantaged).
The formal defintitions and statistical properties of the HI and LI are illustrated in the Additional File of the publication:Model
ThedataSa3.csv file contains 330 SA3s and 15 columns:
- id: SA3 sequential Identifier
- SA3_code: ABS - 2016 SA3's code identifier
- SA3_name: SA3's state name
- State_code: ABS - 2016 SA3's State code identifier
- State_name: SA3's State name
- Columns (6-15): d1,d2,d3,...,di,...,d10. Number of people in each decile.
ABS: Australian Bureau of Statistics.
Thehi_li.R file contains the implementation of the Homogeneity Index (HI) function [uni.hom] and the Location Index function [uni.loc].
It also includes the following statistical utilities:
- [uni.conCI]: computes the convolution of two vectors
- [uni.corr]: computes the autocorrelation of a vector
- [uni.div]: computes the Divergence Index. It's a variance for ordinal categorical variables.Please refer toModel.
Thecreate_SA3.R contains the script to generate a new table with the first 15 columns ofdataSA3.csv and 4 additional columns:
- Hom: The value of the Homogeneity Index - HI ϵ [0 1]
- DI: The value of the Divergence Index - DI ϵ [0 1]
- LI: The value of the Location Index - LI ϵ {1,2,..,10}
- CL: The Homogeneity Classification - CL ϵ {A,B,C,D}
Table 1. SA3's IRSD HI CLASSIFICATION CRITERIA
CL | HI % | DECISION SUPPORT SYSTEM |
---|---|---|
A | [68.53 - 100] | Acceptably Homogeneous |
B | [57.62 - 68.53) | Marginal Heterogeneity |
C | [46.62 - 57.62) | Judgement Required |
D | [0 - 46.62) | Heterogeneous |
Table 2.HI(s): HI's value ofsequally populated deciles clustered on s consecutive bins
s | HI(s) % | pdf vector |
---|---|---|
1 | 100 | [1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0] |
4 | 68.53 | [1/4 ,1/4 ,1/4 ,1/4 , 0 , 0 , 0 , 0 , 0 , 0] |
5 | 57.62 | [1/5 ,1/5 ,1/5 ,1/5 ,1/5 , 0 , 0 , 0 , 0 , 0] |
6 | 46.62 | [1/6 ,1/6 ,1/6 ,1/6 ,1/6 ,1/6 , 0 , 0 , 0 , 0] |
10 | 0 | [1/10 ,1/10 ,1/10 ,1/10 ,1/10 ,1/10 ,1/10 ,1/10 ,1/10 ,1/10] |
Table 1 shows the partition of the HI's range into four classes. The selection of the breaks among classes is determined by the HI's value ofsequally populated deciles clustered on s consecutive bins. In this case, the parameters sets the smallest interval of categories which contains all the data. Consider for example the value68.53 (i.es = 4, HI(4) = 68.53;Table 2), then all distributions that have a bigger HI's value (Cl = A) are equivalent to a community whose socioeconomic groups are concentrated in at most four consecutive deciles.
The parameters is also known in the ecological literature astrue diversity. Clearly, other criteria can be chosen for the identification of homogeneous distributions and there is no definitive or "optimal" HI's threshold value. However, we believe that the specification ofs can help users to represent the homogenity of a distribution in "picture", and serves as a guide for interpreting dimensionless concentration indicies. For more information about the classifcation criteria and the notion of true diversity, the interested reader can refer to the publication:A framework for the identification and classification of homogeneous socioeconomic areas in the analysis of health care variation, sectionsConcentration Index and true diversity andHomogeneity Index and true diversity.
To run the scripts the following software requirements apply:
- R version 3.3.2 or later version
- library:data.table to read the dataset
Run thehi_li.R script to save the functions in the Global Environment Scope of the working directory. Then, place thecreate_SA3.R and thedataSa3.csv file in the working directory and run in the R console thecreate_SA3.R script. The outputSA3db.csv is a 330 x 20 table.
Feel free to use thehi_li.R library to classify your categorical dataset. Enjoy 😊 !
I welcome contributions to thehi_li.R library. Please see theCONTRIBUTING file for detailed guidelines of how to contribute.
Ludovico Pinzari
Thehomogeneity-location-index package is licensed under the MIT. See theLICENSE file for more details.
This work was funded through a partnership agreement between the Capital Markets Cooperative Research Centre and the Australian Institute of Health and Welfare, which provided a Ph.D scholarship to me.
A complete discussion of the mathematical model is included in my thesis dissertation about to be submitted in June 2019. I'll soon share a link to my work. I'll also push new documentation and files to this repo.
If you wish to reproduce the results illustrated in the publicationA framework for the identification and classification of homogeneous socioeconomic areas in the analysis of health care variation, please use the following dataset:data
For any enquiries about my work, please visit my web site:contacts or contact me on my linkedin profile:ludovico-pinzari
About
The objective of this work is to provide tools to be used for the classification of ordinal categorical distributions. To demonstrate how to do it, we propose an Homogeneity (HI) and Location (LI) Index to measure the concentration and central value of an ordinal categorical distribution.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.