reconhub/matchmakerPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star8

Dictionary-based cleaning for categorical variables

License

GPL-3.0 license

8 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github		.github
R		R
docs		docs
inst/extdata		inst/extdata
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
CRAN-RELEASE		CRAN-RELEASE
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
appveyor.yml		appveyor.yml
codecov.yml		codecov.yml
cran-comments.md		cran-comments.md
matchmaker.Rproj		matchmaker.Rproj

Repository files navigation

matchmaker R package

The goal of {matchmaker} is to provide dictionary-based cleaning for Rusers in a simple and intuitive manner built on the{forcats} package. Some of the featuresof this package include:

preservation of factor orders
ability to specify explicit and implicit missing values
option to replace by fuzzy matching (regular expressions, anchoredby default)
optional variable selection by fuzzy matching

Installation

You can install {matchmaker} from CRAN:

install.packages("matchmaker")

Example

The matchmaker package has two user-facing functions that performdictionary-based cleaning:

match_vec() will translate the values in a single vector
match_df() will translate values in all specified columns of adata frame

Each of these functions have four manditory options:

x: your data. This will be a vector or data frame depending on thefunction.
dictionary: This is a data frame with at least two columnsspecifying keys and values to modify
from: a character or number specifying which column contains thekeys
to: a character or number specifying which column contains thevalues

Mostly, users will be working withmatch_df() to transform valuesacross specific columns. A typical workflow would be to:

construct your dictionary in a spreadsheet program based on yourdata
read in your data and dictionary to data frames in R
match!

library("matchmaker")# Read in data setdat<- read.csv(matchmaker_example("coded-data.csv"),stringsAsFactors=FALSE)dat$date<- as.Date(dat$date)# Read in dictionarydict<- read.csv(matchmaker_example("spelling-dictionary.csv"),stringsAsFactors=FALSE)

Data

This is the top of our data set, generated for examplepurposes

id	date	readmission	treated	facility	age_group	lab_result_01	lab_result_02	lab_result_03	has_symptoms	followup
ef267c	2019-07-08	NA	0	C	10	unk	high	inc	NA	u
e80a37	2019-07-07	y	0	3	10	inc	unk	norm	y	oui
b72883	2019-07-07	y	1	8	30	inc	norm	inc		oui
c9ee86	2019-07-09	n	1	4	40	inc	inc	unk	y	oui
40bc7a	2019-07-12	n	1	6	0	norm	unk	norm	NA	n
46566e	2019-07-14	y	NA	B	50	unk	unk	inc	NA	NA

Dictionary

The dictionary looks like this:

options	values	grp	orders
y	Yes	readmission	1
n	No	readmission	2
u	Unknown	readmission	3
.missing	Missing	readmission	4
0	Yes	treated	1
1	No	treated	2
.missing	Missing	treated	3
1	Facility 1	facility	1
2	Facility 2	facility	2
3	Facility 3	facility	3
4	Facility 4	facility	4
5	Facility 5	facility	5
6	Facility 6	facility	6
7	Facility 7	facility	7
8	Facility 8	facility	8
9	Facility 9	facility	9
10	Facility 10	facility	10
.default	Unknown	facility	11
0	0-9	age_group	1
10	10-19	age_group	2
20	20-29	age_group	3
30	30-39	age_group	4
40	40-49	age_group	5
50	50+	age_group	6
high	High	.regex ^lab_result_	1
norm	Normal	.regex ^lab_result_	2
inc	Inconclusive	.regex ^lab_result_	3
y	yes	.global	Inf
n	no	.global	Inf
u	unknown	.global	Inf
unk	unknown	.global	Inf
oui	yes	.global	Inf
.missing	missing	.global	Inf

Matching

# Clean spelling based on dictionary -----------------------------cleaned<- match_df(dat,dictionary=dict,from="options",to="values",by="grp")head(cleaned)#>       id       date readmission treated    facility age_group#> 1 ef267c 2019-07-08     Missing     Yes     Unknown     10-19#> 2 e80a37 2019-07-07         Yes     Yes Facility  3     10-19#> 3 b72883 2019-07-07         Yes      No Facility  8     30-39#> 4 c9ee86 2019-07-09          No      No Facility  4     40-49#> 5 40bc7a 2019-07-12          No      No Facility  6       0-9#> 6 46566e 2019-07-14         Yes Missing     Unknown       50+#>   lab_result_01 lab_result_02 lab_result_03 has_symptoms followup#> 1       unknown          High  Inconclusive      missing  unknown#> 2  Inconclusive       unknown        Normal          yes      yes#> 3  Inconclusive        Normal  Inconclusive      missing      yes#> 4  Inconclusive  Inconclusive       unknown          yes      yes#> 5        Normal       unknown        Normal      missing       no#> 6       unknown       unknown  Inconclusive      missing  missing

About

Dictionary-based cleaning for categorical variables

www.repidemicsconsortium.org/matchmaker

Resources

Readme

License

GPL-3.0 license

Code of conduct

Releases2

matchmaker version 0.1.1 Latest

Feb 21, 2020

+ 1 release

Packages

No packages published

Languages

R100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

matchmaker R package

Installation

Example

Data

Dictionary

Matching

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases2

Packages

Uh oh!

Languages

Movatterモバイル変換

License

reconhub/matchmaker

Folders and files

Latest commit

History

Repository files navigation

matchmaker R package

Installation

Example

Data

Dictionary

Matching

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases2

Packages0

Uh oh!

Languages

Packages