- Notifications
You must be signed in to change notification settings - Fork2
Dictionary-based cleaning for categorical variables
License
reconhub/matchmaker
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
The goal of {matchmaker} is to provide dictionary-based cleaning for Rusers in a simple and intuitive manner built on the{forcats} package. Some of the featuresof this package include:
- preservation of factor orders
- ability to specify explicit and implicit missing values
- option to replace by fuzzy matching (regular expressions, anchoredby default)
- optional variable selection by fuzzy matching
You can install {matchmaker} from CRAN:
install.packages("matchmaker")The matchmaker package has two user-facing functions that performdictionary-based cleaning:
match_vec()will translate the values in a single vectormatch_df()will translate values in all specified columns of adata frame
Each of these functions have four manditory options:
x: your data. This will be a vector or data frame depending on thefunction.dictionary: This is a data frame with at least two columnsspecifying keys and values to modifyfrom: a character or number specifying which column contains thekeysto: a character or number specifying which column contains thevalues
Mostly, users will be working withmatch_df() to transform valuesacross specific columns. A typical workflow would be to:
- construct your dictionary in a spreadsheet program based on yourdata
- read in your data and dictionary to data frames in R
- match!
library("matchmaker")# Read in data setdat<- read.csv(matchmaker_example("coded-data.csv"),stringsAsFactors=FALSE)dat$date<- as.Date(dat$date)# Read in dictionarydict<- read.csv(matchmaker_example("spelling-dictionary.csv"),stringsAsFactors=FALSE)
This is the top of our data set, generated for examplepurposes
| id | date | readmission | treated | facility | age_group | lab_result_01 | lab_result_02 | lab_result_03 | has_symptoms | followup |
|---|---|---|---|---|---|---|---|---|---|---|
| ef267c | 2019-07-08 | NA | 0 | C | 10 | unk | high | inc | NA | u |
| e80a37 | 2019-07-07 | y | 0 | 3 | 10 | inc | unk | norm | y | oui |
| b72883 | 2019-07-07 | y | 1 | 8 | 30 | inc | norm | inc | oui | |
| c9ee86 | 2019-07-09 | n | 1 | 4 | 40 | inc | inc | unk | y | oui |
| 40bc7a | 2019-07-12 | n | 1 | 6 | 0 | norm | unk | norm | NA | n |
| 46566e | 2019-07-14 | y | NA | B | 50 | unk | unk | inc | NA | NA |
The dictionary looks like this:
| options | values | grp | orders |
|---|---|---|---|
| y | Yes | readmission | 1 |
| n | No | readmission | 2 |
| u | Unknown | readmission | 3 |
| .missing | Missing | readmission | 4 |
| 0 | Yes | treated | 1 |
| 1 | No | treated | 2 |
| .missing | Missing | treated | 3 |
| 1 | Facility 1 | facility | 1 |
| 2 | Facility 2 | facility | 2 |
| 3 | Facility 3 | facility | 3 |
| 4 | Facility 4 | facility | 4 |
| 5 | Facility 5 | facility | 5 |
| 6 | Facility 6 | facility | 6 |
| 7 | Facility 7 | facility | 7 |
| 8 | Facility 8 | facility | 8 |
| 9 | Facility 9 | facility | 9 |
| 10 | Facility 10 | facility | 10 |
| .default | Unknown | facility | 11 |
| 0 | 0-9 | age_group | 1 |
| 10 | 10-19 | age_group | 2 |
| 20 | 20-29 | age_group | 3 |
| 30 | 30-39 | age_group | 4 |
| 40 | 40-49 | age_group | 5 |
| 50 | 50+ | age_group | 6 |
| high | High | .regex ^lab_result_ | 1 |
| norm | Normal | .regex ^lab_result_ | 2 |
| inc | Inconclusive | .regex ^lab_result_ | 3 |
| y | yes | .global | Inf |
| n | no | .global | Inf |
| u | unknown | .global | Inf |
| unk | unknown | .global | Inf |
| oui | yes | .global | Inf |
| .missing | missing | .global | Inf |
# Clean spelling based on dictionary -----------------------------cleaned<- match_df(dat,dictionary=dict,from="options",to="values",by="grp")head(cleaned)#> id date readmission treated facility age_group#> 1 ef267c 2019-07-08 Missing Yes Unknown 10-19#> 2 e80a37 2019-07-07 Yes Yes Facility 3 10-19#> 3 b72883 2019-07-07 Yes No Facility 8 30-39#> 4 c9ee86 2019-07-09 No No Facility 4 40-49#> 5 40bc7a 2019-07-12 No No Facility 6 0-9#> 6 46566e 2019-07-14 Yes Missing Unknown 50+#> lab_result_01 lab_result_02 lab_result_03 has_symptoms followup#> 1 unknown High Inconclusive missing unknown#> 2 Inconclusive unknown Normal yes yes#> 3 Inconclusive Normal Inconclusive missing yes#> 4 Inconclusive Inconclusive unknown yes yes#> 5 Normal unknown Normal missing no#> 6 unknown unknown Inconclusive missing missing
About
Dictionary-based cleaning for categorical variables
Resources
License
Code of conduct
Contributing
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.