Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Dictionary-based cleaning for categorical variables

License

NotificationsYou must be signed in to change notification settings

reconhub/matchmaker

Lifecycle: experimentalCRAN statusTravis build statusAppVeyor build statusCodecov test coverage

The goal of {matchmaker} is to provide dictionary-based cleaning for Rusers in a simple and intuitive manner built on the{forcats} package. Some of the featuresof this package include:

  • preservation of factor orders
  • ability to specify explicit and implicit missing values
  • option to replace by fuzzy matching (regular expressions, anchoredby default)
  • optional variable selection by fuzzy matching

Installation

You can install {matchmaker} from CRAN:

install.packages("matchmaker")

Example

The matchmaker package has two user-facing functions that performdictionary-based cleaning:

  • match_vec() will translate the values in a single vector
  • match_df() will translate values in all specified columns of adata frame

Each of these functions have four manditory options:

  • x: your data. This will be a vector or data frame depending on thefunction.
  • dictionary: This is a data frame with at least two columnsspecifying keys and values to modify
  • from: a character or number specifying which column contains thekeys
  • to: a character or number specifying which column contains thevalues

Mostly, users will be working withmatch_df() to transform valuesacross specific columns. A typical workflow would be to:

  1. construct your dictionary in a spreadsheet program based on yourdata
  2. read in your data and dictionary to data frames in R
  3. match!
library("matchmaker")# Read in data setdat<- read.csv(matchmaker_example("coded-data.csv"),stringsAsFactors=FALSE)dat$date<- as.Date(dat$date)# Read in dictionarydict<- read.csv(matchmaker_example("spelling-dictionary.csv"),stringsAsFactors=FALSE)

Data

This is the top of our data set, generated for examplepurposes

iddatereadmissiontreatedfacilityage_grouplab_result_01lab_result_02lab_result_03has_symptomsfollowup
ef267c2019-07-08NA0C10unkhighincNAu
e80a372019-07-07y0310incunknormyoui
b728832019-07-07y1830incnormincoui
c9ee862019-07-09n1440incincunkyoui
40bc7a2019-07-12n160normunknormNAn
46566e2019-07-14yNAB50unkunkincNANA

Dictionary

The dictionary looks like this:

optionsvaluesgrporders
yYesreadmission1
nNoreadmission2
uUnknownreadmission3
.missingMissingreadmission4
0Yestreated1
1Notreated2
.missingMissingtreated3
1Facility 1facility1
2Facility 2facility2
3Facility 3facility3
4Facility 4facility4
5Facility 5facility5
6Facility 6facility6
7Facility 7facility7
8Facility 8facility8
9Facility 9facility9
10Facility 10facility10
.defaultUnknownfacility11
00-9age_group1
1010-19age_group2
2020-29age_group3
3030-39age_group4
4040-49age_group5
5050+age_group6
highHigh.regex ^lab_result_1
normNormal.regex ^lab_result_2
incInconclusive.regex ^lab_result_3
yyes.globalInf
nno.globalInf
uunknown.globalInf
unkunknown.globalInf
ouiyes.globalInf
.missingmissing.globalInf

Matching

# Clean spelling based on dictionary -----------------------------cleaned<- match_df(dat,dictionary=dict,from="options",to="values",by="grp")head(cleaned)#>       id       date readmission treated    facility age_group#> 1 ef267c 2019-07-08     Missing     Yes     Unknown     10-19#> 2 e80a37 2019-07-07         Yes     Yes Facility  3     10-19#> 3 b72883 2019-07-07         Yes      No Facility  8     30-39#> 4 c9ee86 2019-07-09          No      No Facility  4     40-49#> 5 40bc7a 2019-07-12          No      No Facility  6       0-9#> 6 46566e 2019-07-14         Yes Missing     Unknown       50+#>   lab_result_01 lab_result_02 lab_result_03 has_symptoms followup#> 1       unknown          High  Inconclusive      missing  unknown#> 2  Inconclusive       unknown        Normal          yes      yes#> 3  Inconclusive        Normal  Inconclusive      missing      yes#> 4  Inconclusive  Inconclusive       unknown          yes      yes#> 5        Normal       unknown        Normal      missing       no#> 6       unknown       unknown  Inconclusive      missing  missing

About

Dictionary-based cleaning for categorical variables

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp