- Notifications
You must be signed in to change notification settings - Fork0
Minimizing time spent by data analysts on the cleaning and processing of data by providing a user friendly system for processing source data into a clean format.
License
CHIMB/datastan
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
datastan
provides an easy and user friendly way for data analysts toprocess data files and manage the standardization rules responsible forprocessing the source fields of a dataset. This package provides bothprogrammatic and graphical ways to manage and standardize source files.
This package would be most beneficial in the field of data science,specifically data-linkage and data analysis as thedatastan
packagewould help minimize the time spent on writing standardization rules foreach dataset by using the standardization rules of the package tonormalize disparate source files to a common format to ready it forrecord-linkage and analysis.
To install datastan from GitHub, begin by installing and loading thedevtools
package:
# install.packages("devtools")library(devtools)
Afterwards, you may install the automated.data.standardization packageusinginstall_github()
:
devtools::install_github("CHIMB/datastan")
To install datastan locally from GitHub, select the most recent releasefrom the right-hand tab on the GitHub repository page. Download theSource code (zip) file, then move over to RStudio. You may thenrun the code:
path_to_pkg<- file.choose()# Select the unmodified package you downloaded from GitHub.devtools::install_local(path_to_pkg)
Before standardizing data can take place, you must enter thestandardizing rules for the desired columns using the Metadata UserInterface application, which must be supplied with an.sqlitefile. To create an empty file for modification, you may run thecreate_new_metadata()
function from thedatastan
package:
output_folder<- choose.dir()create_new_metadata("my_standardizing_rules",output_folder)
You may then add your desired source file information and choose thedesired standardizing rules in the Metadata application, which can berun using the functionstartMetadataUI()
while supplying an existingmetadata file:
my_metadata<- file.choose()# Choose the "my_standardizing_rules.sqlite" file.startMetadataUI(my_metadata)
Once the standardizing rules are in place, you may standardize the datain one of two ways:
OPTION 1: To standardize data programmatically, you may callthe
standardize_data()
function, consider the example:- Consider you added a new data set with standardizing rules in themetadata user interface called “mydataset”, you may then call the
standardize_data()
function the following way:
input_file<- file.choose()# Choose the input file that contains the source dataset.dataset_code<-"mydataset"# This is the dataset_code we chose in the metadata application.standardizing_options<- create_standardizing_options_lookup(chunk_size=10000,convert_name_case="upper",read_mode="cmd",file_output="csv")output_folder<- choose.dir()# Choose the output folder where the cleaned data file will be output to.standardizing_rules<- choose.file()# Choose the .sqlite file containing the rules you just added.df<- standardize_data(input_file,dataset_code,standardizing_options,output_folder,standardizing_rules)
- Consider you added a new data set with standardizing rules in themetadata user interface called “mydataset”, you may then call the
OPTION 2: To standardize the data using a Graphical UserInterface, you may run the
startDataStandardizationUI()
and followalong with the instructions in the application:startDataStandardizationUI()
One of the features of thedatastan
package is the ability to try andimpute missing values in fields like postal codes and sex. One of theoptions within the sex imputation is to accept a custom.csv fileto use as imputation, if the user would like to supply such a file whenstandardizing data then the file must follow this schema:
There should be at least two columns, with the first names of theindividuals under the column nameprimary_given_name and theircorresponding sex values under the column namesex. Names shouldonly appear once and have a majority sex value that they map to forproper imputation when standardizing data. The outline of the csv shouldlook as follows:
primary_given_name | sex |
---|---|
John | M |
Jane | F |
Jim | M |
Sam | M |
Chris | M |
Mabel | F |
Custom CSV Imputation File
For more details on how the code processes and cleans data, reads fromthe metadata, how the user interfaces work on the back-end, considerreading theDeveloper Facing Documentation(448.3KB).
For more details on how to work function calls, how to navigate thepages of the user interfaces, and how to make changes, or add newinformation to the metadata, consider reading theUser FacingDocumentation(1.5MB).
About
Minimizing time spent by data analysts on the cleaning and processing of data by providing a user friendly system for processing source data into a clean format.