Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

R package to clean and standardize epidemiological data

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
NotificationsYou must be signed in to change notification settings

epiverse-trace/cleanepi

License: MITR-CMD-checkCodecov test coveragelifecycle-experimentalDOI

cleanepi is an R package designed for cleaning, curating, andstandardizing epidemiological data. It streamlines various data cleaningtasks that are typically expected when working with datasets inepidemiology.

Key functionalities ofcleanepi include:

  1. Removing irregularities: It removes duplicated and empty rowsand columns, as well as columns with constant values.

  2. Handling missing values: It replaces missing values with thestandardNA format, ensuring consistency and ease of analysis.

  3. Ensuring data integrity: It ensures the uniqueness of uniquelyidentified columns, thus maintaining data integrity and preventingduplicates.

  4. Date conversion: It offers functionality to convert charactercolumns to Date format under specific conditions, enhancing datauniformity and facilitating temporal analysis. It also offersconversion of numeric values written in letters into numbers.

  5. Standardizing entries: It can standardize column entries intospecified formats, promoting consistency across the dataset.

  6. Time span calculation: It calculates the time span between twoelements of typeDate, providing valuable demographic insights forepidemiological analysis.

cleanepi operates on data frames or similar structures like tibbles,as well as linelist objects commonly used in epidemiological research.It returns the processed data in the same format, ensuring seamlessintegration into existing workflows. Additionally, it generates acomprehensive report detailing the outcomes of each cleaning task.

cleanepi is developed by theEpiverse-TRACE team at theMedical Research Council The Gambia unit at the London School ofHygiene and TropicalMedicine.

Installation

cleanepi can be installed from CRAN using

install.packages("cleanepi")

The latest development version ofcleanepi can be installed fromGitHub.

if (!require("pak")) install.packages("pak")pak::pak("epiverse-trace/cleanepi")library(cleanepi)

Quick start

The main function incleanepi isclean_data(), which internallymakes call of almost all standard data cleaning functions, such asremoval of empty and duplicated rows and columns, replacement of missingvalues, etc. However, each function can also be called independently toperform a specific task. This mechanism is explained in details in thevignette. Below is typical example of how to use theclean_data()function.

# READING IN THE TEST DATASETtest_data<- readRDS(  system.file("extdata","test_df.RDS",package="cleanepi"))
study_idevent_namecountry_codecountry_namedate.of.admissiondateOfBirthdate_first_pcr_positive_testsex
PS001P2day 02Gambia01/12/202006/01/1972Dec 01, 20201
PS002P2day 02Gambia28/01/202102/20/1952Jan 01, 20211
PS004P2-1day 02Gambia15/02/202106/15/1961Feb 11, 2021-99
PS003P2day 02Gambia11/02/202111/11/1947Feb 01, 20211
P0005P2day 02Gambia17/02/202109/26/2000Feb 16, 20212
PS006P2day 02Gambia17/02/2021-99May 02, 20212
PB500P2day 02Gambia28/02/202111/03/1989Feb 19, 20211
PS008P2day 02Gambia22/02/202110/05/1976Sep 20, 20212
PS010P2day 02Gambia02/03/202109/23/1991Feb 26, 20211
PS011P2day 02Gambia05/03/202102/08/1991Mar 03, 20212
# READING IN THE DATA DICTIONARYtest_dictionary<- readRDS(  system.file("extdata","test_dictionary.RDS",package="cleanepi"))
optionsvaluesgrporders
1malesex1
2femalesex2
# SCAN THROUGH THE DATAscan_res<-cleanepi::scan_data(test_data)
# DEFINING THE CLEANING PARAMETERSreplace_missing_values<-list(target_columns=NULL,na_strings="-99")remove_duplicates<-list(target_columns=NULL)standardize_dates<-list(target_columns=NULL,error_tolerance=0.4,format=NULL,timeframe= as.Date(c("1973-05-29","2023-05-29")),orders=list(world_named_months= c("Ybd","dby"),world_digit_months= c("dmy","Ymd"),US_formats= c("Omdy","YOmd")  ))standardize_subject_ids<-list(target_columns="study_id",prefix="PS",suffix="P2",range= c(1,100),nchar=7)remove_constants<-list(cutoff=1)standardize_column_names<-list(keep="date.of.admission",rename= c(DOB="dateOfBirth"))to_numeric<-list(target_columns="sex",lang="en")
# PERFORMING THE DATA CLEANINGcleaned_data<- clean_data(data=test_data,standardize_column_names=standardize_column_names,remove_constants=remove_constants,replace_missing_values=replace_missing_values,remove_duplicates=remove_duplicates,standardize_dates=standardize_dates,standardize_subject_ids=standardize_subject_ids,to_numeric=to_numeric,dictionary=test_dictionary,check_date_sequence=NULL)#> ℹ Cleaning column names#> ℹ Replacing missing values with NA#> ℹ Removing constant columns and empty rows#> ℹ Removing duplicated rows#> ℹ No duplicates were found.#> ℹ Standardizing Date columns#> ! Detected 8 values that comply with multiple formats and no values that are#>   outside of the specified time frame.#> ℹ Enter `print_report(data = dat, "date_standardization")` to access them,#>   where "dat" is the object used to store the output from this operation.#> ℹ Checking subject IDs format#>#> ! Detected 0 missing, 0 duplicated, and 3 incorrect subject IDs.#> ℹ Enter `print_report(data = dat, "incorrect_subject_id")` to access them,#>   where "dat" is the object used to store the output from this operation.#> ℹ You can use the `correct_subject_ids()` function to correct them.#> ℹ Converting the following  column into numeric: sex#>#> ℹ Performing dictionary-based cleaning
study_iddate.of.admissionDOBdate_first_pcr_positive_testsex
PS001P22020-12-0106/01/19722020-12-01male
PS002P22021-01-2802/20/19522021-01-01male
PS004P2-12021-02-1506/15/19612021-02-11NA
PS003P22021-02-1111/11/19472021-02-01male
P0005P22021-02-1709/26/20002021-02-16female
PS006P22021-02-17NA2021-05-02female
PB500P22021-02-2811/03/19892021-02-19male
PS008P22021-02-2210/05/19762021-09-20female
PS010P22021-03-0209/23/19912021-02-26male
PS011P22021-03-0502/08/19912021-03-03female
# ADD THE DATA SCANNING RESULT TO THE REPORTcleaned_data<-cleanepi::add_to_report(x=cleaned_data,key="scanning_result",value=scan_res)
# DISPLAY THE DATA CLEANING REPORTprint_report(cleaned_data,print=TRUE)

Vignette

browseVignettes("cleanepi")

Lifecycle

This package is currently anexperimental, as defined by theRECONsoftware lifecycle. Thismeans that it is functional, but interfaces and functionalities maychange over time, testing and documentation may be lacking.

Contributions

Contributions are welcome viapullrequests.

Code of Conduct

Please note that the cleanepi project is released with aContributorCode ofConduct.By contributing to this project, you agree to abide by its terms.

Citing this package

citation("cleanepi")#> To cite package 'cleanepi' in publications use:#>#>   Mané K, Degoot A, Ahadzie B, Mohammed N, Bah B (2025). _cleanepi:#>   Clean and Standardize Epidemiological Data_.#>   doi:10.5281/zenodo.11473985#>   <https://doi.org/10.5281/zenodo.11473985>,#>   <https://epiverse-trace.github.io/cleanepi/>.#>#> A BibTeX entry for LaTeX users is#>#>   @Manual{,#>     title = {cleanepi: Clean and Standardize Epidemiological Data},#>     author = {Karim Mané and Abdoelnaser Degoot and Bankolé Ahadzie and Nuredin Mohammed and Bubacarr Bah},#>     year = {2025},#>     doi = {10.5281/zenodo.11473985},#>     url = {https://epiverse-trace.github.io/cleanepi/},#>   }

About

R package to clean and standardize epidemiological data

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors12

Languages


[8]ページ先頭

©2009-2025 Movatter.jp