- Notifications
You must be signed in to change notification settings - Fork4
R package to clean and standardize epidemiological data
License
Unknown, MIT licenses found
Licenses found
epiverse-trace/cleanepi
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
cleanepi is an R package designed for cleaning, curating, andstandardizing epidemiological data. It streamlines various data cleaningtasks that are typically expected when working with datasets inepidemiology.
Key functionalities ofcleanepi include:
Removing irregularities: It removes duplicated and empty rowsand columns, as well as columns with constant values.
Handling missing values: It replaces missing values with thestandard
NAformat, ensuring consistency and ease of analysis.Ensuring data integrity: It ensures the uniqueness of uniquelyidentified columns, thus maintaining data integrity and preventingduplicates.
Date conversion: It offers functionality to convert charactercolumns to Date format under specific conditions, enhancing datauniformity and facilitating temporal analysis. It also offersconversion of numeric values written in letters into numbers.
Standardizing entries: It can standardize column entries intospecified formats, promoting consistency across the dataset.
Time span calculation: It calculates the time span between twoelements of type
Date, providing valuable demographic insights forepidemiological analysis.
cleanepi operates on data frames or similar structures like tibbles,as well as linelist objects commonly used in epidemiological research.It returns the processed data in the same format, ensuring seamlessintegration into existing workflows. Additionally, it generates acomprehensive report detailing the outcomes of each cleaning task.
cleanepi is developed by theEpiverse-TRACE team at theMedical Research Council The Gambia unit at the London School ofHygiene and TropicalMedicine.
cleanepi can be installed from CRAN using
install.packages("cleanepi")The latest development version ofcleanepi can be installed fromGitHub.
if (!require("pak")) install.packages("pak")pak::pak("epiverse-trace/cleanepi")library(cleanepi)
The main function incleanepi isclean_data(), which internallymakes call of almost all standard data cleaning functions, such asremoval of empty and duplicated rows and columns, replacement of missingvalues, etc. However, each function can also be called independently toperform a specific task. This mechanism is explained in details in thevignette. Below is typical example of how to use theclean_data()function.
# READING IN THE TEST DATASETtest_data<- readRDS( system.file("extdata","test_df.RDS",package="cleanepi"))
| study_id | event_name | country_code | country_name | date.of.admission | dateOfBirth | date_first_pcr_positive_test | sex |
|---|---|---|---|---|---|---|---|
| PS001P2 | day 0 | 2 | Gambia | 01/12/2020 | 06/01/1972 | Dec 01, 2020 | 1 |
| PS002P2 | day 0 | 2 | Gambia | 28/01/2021 | 02/20/1952 | Jan 01, 2021 | 1 |
| PS004P2-1 | day 0 | 2 | Gambia | 15/02/2021 | 06/15/1961 | Feb 11, 2021 | -99 |
| PS003P2 | day 0 | 2 | Gambia | 11/02/2021 | 11/11/1947 | Feb 01, 2021 | 1 |
| P0005P2 | day 0 | 2 | Gambia | 17/02/2021 | 09/26/2000 | Feb 16, 2021 | 2 |
| PS006P2 | day 0 | 2 | Gambia | 17/02/2021 | -99 | May 02, 2021 | 2 |
| PB500P2 | day 0 | 2 | Gambia | 28/02/2021 | 11/03/1989 | Feb 19, 2021 | 1 |
| PS008P2 | day 0 | 2 | Gambia | 22/02/2021 | 10/05/1976 | Sep 20, 2021 | 2 |
| PS010P2 | day 0 | 2 | Gambia | 02/03/2021 | 09/23/1991 | Feb 26, 2021 | 1 |
| PS011P2 | day 0 | 2 | Gambia | 05/03/2021 | 02/08/1991 | Mar 03, 2021 | 2 |
# READING IN THE DATA DICTIONARYtest_dictionary<- readRDS( system.file("extdata","test_dictionary.RDS",package="cleanepi"))
| options | values | grp | orders |
|---|---|---|---|
| 1 | male | sex | 1 |
| 2 | female | sex | 2 |
# SCAN THROUGH THE DATAscan_res<-cleanepi::scan_data(test_data)
# DEFINING THE CLEANING PARAMETERSreplace_missing_values<-list(target_columns=NULL,na_strings="-99")remove_duplicates<-list(target_columns=NULL)standardize_dates<-list(target_columns=NULL,error_tolerance=0.4,format=NULL,timeframe= as.Date(c("1973-05-29","2023-05-29")),orders=list(world_named_months= c("Ybd","dby"),world_digit_months= c("dmy","Ymd"),US_formats= c("Omdy","YOmd") ))standardize_subject_ids<-list(target_columns="study_id",prefix="PS",suffix="P2",range= c(1,100),nchar=7)remove_constants<-list(cutoff=1)standardize_column_names<-list(keep="date.of.admission",rename= c(DOB="dateOfBirth"))to_numeric<-list(target_columns="sex",lang="en")
# PERFORMING THE DATA CLEANINGcleaned_data<- clean_data(data=test_data,standardize_column_names=standardize_column_names,remove_constants=remove_constants,replace_missing_values=replace_missing_values,remove_duplicates=remove_duplicates,standardize_dates=standardize_dates,standardize_subject_ids=standardize_subject_ids,to_numeric=to_numeric,dictionary=test_dictionary,check_date_sequence=NULL)#> ℹ Cleaning column names#> ℹ Replacing missing values with NA#> ℹ Removing constant columns and empty rows#> ℹ Removing duplicated rows#> ℹ No duplicates were found.#> ℹ Standardizing Date columns#> ! Detected 8 values that comply with multiple formats and no values that are#> outside of the specified time frame.#> ℹ Enter `print_report(data = dat, "date_standardization")` to access them,#> where "dat" is the object used to store the output from this operation.#> ℹ Checking subject IDs format#>#> ! Detected 0 missing, 0 duplicated, and 3 incorrect subject IDs.#> ℹ Enter `print_report(data = dat, "incorrect_subject_id")` to access them,#> where "dat" is the object used to store the output from this operation.#> ℹ You can use the `correct_subject_ids()` function to correct them.#> ℹ Converting the following column into numeric: sex#>#> ℹ Performing dictionary-based cleaning
| study_id | date.of.admission | DOB | date_first_pcr_positive_test | sex |
|---|---|---|---|---|
| PS001P2 | 2020-12-01 | 06/01/1972 | 2020-12-01 | male |
| PS002P2 | 2021-01-28 | 02/20/1952 | 2021-01-01 | male |
| PS004P2-1 | 2021-02-15 | 06/15/1961 | 2021-02-11 | NA |
| PS003P2 | 2021-02-11 | 11/11/1947 | 2021-02-01 | male |
| P0005P2 | 2021-02-17 | 09/26/2000 | 2021-02-16 | female |
| PS006P2 | 2021-02-17 | NA | 2021-05-02 | female |
| PB500P2 | 2021-02-28 | 11/03/1989 | 2021-02-19 | male |
| PS008P2 | 2021-02-22 | 10/05/1976 | 2021-09-20 | female |
| PS010P2 | 2021-03-02 | 09/23/1991 | 2021-02-26 | male |
| PS011P2 | 2021-03-05 | 02/08/1991 | 2021-03-03 | female |
# ADD THE DATA SCANNING RESULT TO THE REPORTcleaned_data<-cleanepi::add_to_report(x=cleaned_data,key="scanning_result",value=scan_res)
# DISPLAY THE DATA CLEANING REPORTprint_report(cleaned_data,print=TRUE)
browseVignettes("cleanepi")This package is currently anexperimental, as defined by theRECONsoftware lifecycle. Thismeans that it is functional, but interfaces and functionalities maychange over time, testing and documentation may be lacking.
Contributions are welcome viapullrequests.
Please note that the cleanepi project is released with aContributorCode ofConduct.By contributing to this project, you agree to abide by its terms.
citation("cleanepi")#> To cite package 'cleanepi' in publications use:#>#> Mané K, Degoot A, Ahadzie B, Mohammed N, Bah B (2025). _cleanepi:#> Clean and Standardize Epidemiological Data_.#> doi:10.5281/zenodo.11473985#> <https://doi.org/10.5281/zenodo.11473985>,#> <https://epiverse-trace.github.io/cleanepi/>.#>#> A BibTeX entry for LaTeX users is#>#> @Manual{,#> title = {cleanepi: Clean and Standardize Epidemiological Data},#> author = {Karim Mané and Abdoelnaser Degoot and Bankolé Ahadzie and Nuredin Mohammed and Bubacarr Bah},#> year = {2025},#> doi = {10.5281/zenodo.11473985},#> url = {https://epiverse-trace.github.io/cleanepi/},#> }
About
R package to clean and standardize epidemiological data
Topics
Resources
License
Unknown, MIT licenses found
Licenses found
Code of conduct
Contributing
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors12
Uh oh!
There was an error while loading.Please reload this page.