epiverse-trace/cleanepiPublic

NotificationsYou must be signed in to change notification settings
Fork4
Star10

R package to clean and standardize epidemiological data

License

Unknown, MIT licenses found

Licenses found

10 stars 4 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 801 Commits
.github		.github
R		R
data-raw		data-raw
data		data
inst		inst
man		man
pkgdown/favicon		pkgdown/favicon
po		po
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.lintr		.lintr
CITATION.cff		CITATION.cff
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
cleanepi.Rproj		cleanepi.Rproj
codecov.yml		codecov.yml
cran-comments.md		cran-comments.md

Repository files navigation

cleanepi: Clean and standardize epidemiological data

cleanepi is an R package designed for cleaning, curating, andstandardizing epidemiological data. It streamlines various data cleaningtasks that are typically expected when working with datasets inepidemiology.

Key functionalities ofcleanepi include:

Removing irregularities: It removes duplicated and empty rowsand columns, as well as columns with constant values.
Handling missing values: It replaces missing values with thestandardNA format, ensuring consistency and ease of analysis.
Ensuring data integrity: It ensures the uniqueness of uniquelyidentified columns, thus maintaining data integrity and preventingduplicates.
Date conversion: It offers functionality to convert charactercolumns to Date format under specific conditions, enhancing datauniformity and facilitating temporal analysis. It also offersconversion of numeric values written in letters into numbers.
Standardizing entries: It can standardize column entries intospecified formats, promoting consistency across the dataset.
Time span calculation: It calculates the time span between twoelements of typeDate, providing valuable demographic insights forepidemiological analysis.

cleanepi operates on data frames or similar structures like tibbles,as well as linelist objects commonly used in epidemiological research.It returns the processed data in the same format, ensuring seamlessintegration into existing workflows. Additionally, it generates acomprehensive report detailing the outcomes of each cleaning task.

cleanepi is developed by theEpiverse-TRACE team at theMedical Research Council The Gambia unit at the London School ofHygiene and TropicalMedicine.

Installation

cleanepi can be installed from CRAN using

install.packages("cleanepi")

The latest development version ofcleanepi can be installed fromGitHub.

if (!require("pak")) install.packages("pak")pak::pak("epiverse-trace/cleanepi")library(cleanepi)

Quick start

The main function incleanepi isclean_data(), which internallymakes call of almost all standard data cleaning functions, such asremoval of empty and duplicated rows and columns, replacement of missingvalues, etc. However, each function can also be called independently toperform a specific task. This mechanism is explained in details in thevignette. Below is typical example of how to use theclean_data()function.

# READING IN THE TEST DATASETtest_data<- readRDS(  system.file("extdata","test_df.RDS",package="cleanepi"))

study_id	event_name	country_code	country_name	date.of.admission	dateOfBirth	date_first_pcr_positive_test	sex
PS001P2	day 0	2	Gambia	01/12/2020	06/01/1972	Dec 01, 2020	1
PS002P2	day 0	2	Gambia	28/01/2021	02/20/1952	Jan 01, 2021	1
PS004P2-1	day 0	2	Gambia	15/02/2021	06/15/1961	Feb 11, 2021	-99
PS003P2	day 0	2	Gambia	11/02/2021	11/11/1947	Feb 01, 2021	1
P0005P2	day 0	2	Gambia	17/02/2021	09/26/2000	Feb 16, 2021	2
PS006P2	day 0	2	Gambia	17/02/2021	-99	May 02, 2021	2
PB500P2	day 0	2	Gambia	28/02/2021	11/03/1989	Feb 19, 2021	1
PS008P2	day 0	2	Gambia	22/02/2021	10/05/1976	Sep 20, 2021	2
PS010P2	day 0	2	Gambia	02/03/2021	09/23/1991	Feb 26, 2021	1
PS011P2	day 0	2	Gambia	05/03/2021	02/08/1991	Mar 03, 2021	2

# READING IN THE DATA DICTIONARYtest_dictionary<- readRDS(  system.file("extdata","test_dictionary.RDS",package="cleanepi"))

options	values	grp	orders
1	male	sex	1
2	female	sex	2

# SCAN THROUGH THE DATAscan_res<-cleanepi::scan_data(test_data)

# DEFINING THE CLEANING PARAMETERSreplace_missing_values<-list(target_columns=NULL,na_strings="-99")remove_duplicates<-list(target_columns=NULL)standardize_dates<-list(target_columns=NULL,error_tolerance=0.4,format=NULL,timeframe= as.Date(c("1973-05-29","2023-05-29")),orders=list(world_named_months= c("Ybd","dby"),world_digit_months= c("dmy","Ymd"),US_formats= c("Omdy","YOmd")  ))standardize_subject_ids<-list(target_columns="study_id",prefix="PS",suffix="P2",range= c(1,100),nchar=7)remove_constants<-list(cutoff=1)standardize_column_names<-list(keep="date.of.admission",rename= c(DOB="dateOfBirth"))to_numeric<-list(target_columns="sex",lang="en")

# PERFORMING THE DATA CLEANINGcleaned_data<- clean_data(data=test_data,standardize_column_names=standardize_column_names,remove_constants=remove_constants,replace_missing_values=replace_missing_values,remove_duplicates=remove_duplicates,standardize_dates=standardize_dates,standardize_subject_ids=standardize_subject_ids,to_numeric=to_numeric,dictionary=test_dictionary,check_date_sequence=NULL)#> ℹ Cleaning column names#> ℹ Replacing missing values with NA#> ℹ Removing constant columns and empty rows#> ℹ Removing duplicated rows#> ℹ No duplicates were found.#> ℹ Standardizing Date columns#> ! Detected 8 values that comply with multiple formats and no values that are#>   outside of the specified time frame.#> ℹ Enter `print_report(data = dat, "date_standardization")` to access them,#>   where "dat" is the object used to store the output from this operation.#> ℹ Checking subject IDs format#>#> ! Detected 0 missing, 0 duplicated, and 3 incorrect subject IDs.#> ℹ Enter `print_report(data = dat, "incorrect_subject_id")` to access them,#>   where "dat" is the object used to store the output from this operation.#> ℹ You can use the `correct_subject_ids()` function to correct them.#> ℹ Converting the following  column into numeric: sex#>#> ℹ Performing dictionary-based cleaning

study_id	date.of.admission	DOB	date_first_pcr_positive_test	sex
PS001P2	2020-12-01	06/01/1972	2020-12-01	male
PS002P2	2021-01-28	02/20/1952	2021-01-01	male
PS004P2-1	2021-02-15	06/15/1961	2021-02-11	NA
PS003P2	2021-02-11	11/11/1947	2021-02-01	male
P0005P2	2021-02-17	09/26/2000	2021-02-16	female
PS006P2	2021-02-17	NA	2021-05-02	female
PB500P2	2021-02-28	11/03/1989	2021-02-19	male
PS008P2	2021-02-22	10/05/1976	2021-09-20	female
PS010P2	2021-03-02	09/23/1991	2021-02-26	male
PS011P2	2021-03-05	02/08/1991	2021-03-03	female

# ADD THE DATA SCANNING RESULT TO THE REPORTcleaned_data<-cleanepi::add_to_report(x=cleaned_data,key="scanning_result",value=scan_res)

# DISPLAY THE DATA CLEANING REPORTprint_report(cleaned_data,print=TRUE)

Vignette

browseVignettes("cleanepi")

Lifecycle

This package is currently anexperimental, as defined by theRECONsoftware lifecycle. Thismeans that it is functional, but interfaces and functionalities maychange over time, testing and documentation may be lacking.

Contributions

Contributions are welcome viapullrequests.

Code of Conduct

Please note that the cleanepi project is released with aContributorCode ofConduct.By contributing to this project, you agree to abide by its terms.

Citing this package

citation("cleanepi")#> To cite package 'cleanepi' in publications use:#>#>   Mané K, Degoot A, Ahadzie B, Mohammed N, Bah B (2025). _cleanepi:#>   Clean and Standardize Epidemiological Data_.#>   doi:10.5281/zenodo.11473985#>   <https://doi.org/10.5281/zenodo.11473985>,#>   <https://epiverse-trace.github.io/cleanepi/>.#>#> A BibTeX entry for LaTeX users is#>#>   @Manual{,#>     title = {cleanepi: Clean and Standardize Epidemiological Data},#>     author = {Karim Mané and Abdoelnaser Degoot and Bankolé Ahadzie and Nuredin Mohammed and Bubacarr Bah},#>     year = {2025},#>     doi = {10.5281/zenodo.11473985},#>     url = {https://epiverse-trace.github.io/cleanepi/},#>   }

About

R package to clean and standardize epidemiological data

epiverse-trace.github.io/cleanepi/

Topics

r epidemiology r-package data-cleaning epiverse

Resources

Readme

License

Unknown, MIT licenses found

Licenses found

Code of conduct

Releases5

cleanepi 1.1.2 Latest

Oct 29, 2025

+ 4 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Licenses found

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

cleanepi: Clean and standardize epidemiological data

Installation

Quick start

Vignette

Lifecycle

Contributions

Code of Conduct

Citing this package

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases5

Packages

Uh oh!

Contributors12

Uh oh!

Languages

Movatterモバイル変換

License

Licenses found

epiverse-trace/cleanepi

Folders and files

Latest commit

History

Repository files navigation

cleanepi: Clean and standardize epidemiological data

Installation

Quick start

Vignette

Lifecycle

Contributions

Code of Conduct

Citing this package

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases5

Packages0

Uh oh!

Contributors12

Uh oh!

Languages

Packages