Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Minimizing time spent by data analysts on the cleaning and processing of data by providing a user friendly system for processing source data into a clean format.

License

NotificationsYou must be signed in to change notification settings

CHIMB/datastan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

datastan provides an easy and user friendly way for data analysts toprocess data files and manage the standardization rules responsible forprocessing the source fields of a dataset. This package provides bothprogrammatic and graphical ways to manage and standardize source files.

This package would be most beneficial in the field of data science,specifically data-linkage and data analysis as thedatastan packagewould help minimize the time spent on writing standardization rules foreach dataset by using the standardization rules of the package tonormalize disparate source files to a common format to ready it forrecord-linkage and analysis.

Installation

R Studio Installation

To install datastan from GitHub, begin by installing and loading thedevtools package:

# install.packages("devtools")library(devtools)

Afterwards, you may install the automated.data.standardization packageusinginstall_github():

devtools::install_github("CHIMB/datastan")

Local Installation

To install datastan locally from GitHub, select the most recent releasefrom the right-hand tab on the GitHub repository page. Download theSource code (zip) file, then move over to RStudio. You may thenrun the code:

path_to_pkg<- file.choose()# Select the unmodified package you downloaded from GitHub.devtools::install_local(path_to_pkg)

Usage

Before standardizing data can take place, you must enter thestandardizing rules for the desired columns using the Metadata UserInterface application, which must be supplied with an.sqlitefile. To create an empty file for modification, you may run thecreate_new_metadata() function from thedatastan package:

output_folder<- choose.dir()create_new_metadata("my_standardizing_rules",output_folder)

You may then add your desired source file information and choose thedesired standardizing rules in the Metadata application, which can berun using the functionstartMetadataUI() while supplying an existingmetadata file:

my_metadata<- file.choose()# Choose the "my_standardizing_rules.sqlite" file.startMetadataUI(my_metadata)

Once the standardizing rules are in place, you may standardize the datain one of two ways:

  • OPTION 1: To standardize data programmatically, you may callthestandardize_data() function, consider the example:

    • Consider you added a new data set with standardizing rules in themetadata user interface called “mydataset”, you may then call thestandardize_data() function the following way:
    input_file<- file.choose()# Choose the input file that contains the source dataset.dataset_code<-"mydataset"# This is the dataset_code we chose in the metadata application.standardizing_options<- create_standardizing_options_lookup(chunk_size=10000,convert_name_case="upper",read_mode="cmd",file_output="csv")output_folder<- choose.dir()# Choose the output folder where the cleaned data file will be output to.standardizing_rules<- choose.file()# Choose the .sqlite file containing the rules you just added.df<- standardize_data(input_file,dataset_code,standardizing_options,output_folder,standardizing_rules)
  • OPTION 2: To standardize the data using a Graphical UserInterface, you may run thestartDataStandardizationUI() and followalong with the instructions in the application:

    startDataStandardizationUI()

Missing Value Imputation

One of the features of thedatastan package is the ability to try andimpute missing values in fields like postal codes and sex. One of theoptions within the sex imputation is to accept a custom.csv fileto use as imputation, if the user would like to supply such a file whenstandardizing data then the file must follow this schema:

There should be at least two columns, with the first names of theindividuals under the column nameprimary_given_name and theircorresponding sex values under the column namesex. Names shouldonly appear once and have a majority sex value that they map to forproper imputation when standardizing data. The outline of the csv shouldlook as follows:

primary_given_namesex
JohnM
JaneF
JimM
SamM
ChrisM
MabelF

Custom CSV Imputation File

Additional Information & Documentation

For more details on how the code processes and cleans data, reads fromthe metadata, how the user interfaces work on the back-end, considerreading theDeveloper Facing Documentation(448.3KB).

For more details on how to work function calls, how to navigate thepages of the user interfaces, and how to make changes, or add newinformation to the metadata, consider reading theUser FacingDocumentation(1.5MB).

Authors

About

Minimizing time spent by data analysts on the cleaning and processing of data by providing a user friendly system for processing source data into a clean format.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp