CHIMB/datastanPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star3

Minimizing time spent by data analysts on the cleaning and processing of data by providing a user friendly system for processing source data into a clean format.

License

MIT license

3 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
R		R
docs		docs
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
datastan.Rproj		datastan.Rproj

Repository files navigation

datastan: An R Package to Automate Data Standardization for Efficient Record Linkage

Introduction

datastan provides an easy and user friendly way for data analysts toprocess data files and manage the standardization rules responsible forprocessing the source fields of a dataset. This package provides bothprogrammatic and graphical ways to manage and standardize source files.

This package would be most beneficial in the field of data science,specifically data-linkage and data analysis as thedatastan packagewould help minimize the time spent on writing standardization rules foreach dataset by using the standardization rules of the package tonormalize disparate source files to a common format to ready it forrecord-linkage and analysis.

Installation

R Studio Installation

To install datastan from GitHub, begin by installing and loading thedevtools package:

# install.packages("devtools")library(devtools)

Afterwards, you may install the automated.data.standardization packageusinginstall_github():

devtools::install_github("CHIMB/datastan")

Local Installation

To install datastan locally from GitHub, select the most recent releasefrom the right-hand tab on the GitHub repository page. Download theSource code (zip) file, then move over to RStudio. You may thenrun the code:

path_to_pkg<- file.choose()# Select the unmodified package you downloaded from GitHub.devtools::install_local(path_to_pkg)

Usage

Before standardizing data can take place, you must enter thestandardizing rules for the desired columns using the Metadata UserInterface application, which must be supplied with an.sqlitefile. To create an empty file for modification, you may run thecreate_new_metadata() function from thedatastan package:

output_folder<- choose.dir()create_new_metadata("my_standardizing_rules",output_folder)

You may then add your desired source file information and choose thedesired standardizing rules in the Metadata application, which can berun using the functionstartMetadataUI() while supplying an existingmetadata file:

my_metadata<- file.choose()# Choose the "my_standardizing_rules.sqlite" file.startMetadataUI(my_metadata)

Once the standardizing rules are in place, you may standardize the datain one of two ways:

OPTION 1: To standardize data programmatically, you may callthestandardize_data() function, consider the example:

Consider you added a new data set with standardizing rules in themetadata user interface called “mydataset”, you may then call thestandardize_data() function the following way:

input_file<- file.choose()# Choose the input file that contains the source dataset.dataset_code<-"mydataset"# This is the dataset_code we chose in the metadata application.standardizing_options<- create_standardizing_options_lookup(chunk_size=10000,convert_name_case="upper",read_mode="cmd",file_output="csv")output_folder<- choose.dir()# Choose the output folder where the cleaned data file will be output to.standardizing_rules<- choose.file()# Choose the .sqlite file containing the rules you just added.df<- standardize_data(input_file,dataset_code,standardizing_options,output_folder,standardizing_rules)

OPTION 2: To standardize the data using a Graphical UserInterface, you may run thestartDataStandardizationUI() and followalong with the instructions in the application:
```
startDataStandardizationUI()
```

Missing Value Imputation

One of the features of thedatastan package is the ability to try andimpute missing values in fields like postal codes and sex. One of theoptions within the sex imputation is to accept a custom.csv fileto use as imputation, if the user would like to supply such a file whenstandardizing data then the file must follow this schema:

There should be at least two columns, with the first names of theindividuals under the column nameprimary_given_name and theircorresponding sex values under the column namesex. Names shouldonly appear once and have a majority sex value that they map to forproper imputation when standardizing data. The outline of the csv shouldlook as follows:

primary_given_name	sex
John	M
Jane	F
Jim	M
Sam	M
Chris	M
Mabel	F

Custom CSV Imputation File

Additional Information & Documentation

For more details on how the code processes and cleans data, reads fromthe metadata, how the user interfaces work on the back-end, considerreading theDeveloper Facing Documentation(448.3KB).

For more details on how to work function calls, how to navigate thepages of the user interfaces, and how to make changes, or add newinformation to the metadata, consider reading theUser FacingDocumentation(1.5MB).

Authors

About

Minimizing time spent by data analysts on the cleaning and processing of data by providing a user friendly system for processing source data into a clean format.

Releases35

datastan - May 28th, 2025 Latest

May 28, 2025

+ 34 releases

Packages

No packages published

Contributors2

Languages

R100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

datastan: An R Package to Automate Data Standardization for Efficient Record Linkage

Introduction

Installation

R Studio Installation

Local Installation

Usage

Missing Value Imputation

Additional Information & Documentation

Authors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases35

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

CHIMB/datastan

Folders and files

Latest commit

History

Repository files navigation

datastan: An R Package to Automate Data Standardization for Efficient Record Linkage

Introduction

Installation

R Studio Installation

Local Installation

Usage

Missing Value Imputation

Additional Information & Documentation

Authors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases35

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages