etl is an R package to facilitateExtract -Transform - Load (ETL) operations formedium data.The end result is generally a populated SQL database, but the userinteraction takes place solely within R.
etlInstantiate anetl object using a string that determinesthe class of the resulting object, and the package that provides accessto that data. The trivialmtcars database is built intoetl.
## No database was specified so I created one for you at:## /tmp/Rtmp00cuuZ/file276e72f83f738.sqlite3## [1] "etl_mtcars" "etl" "src_SQLiteConnection"## [4] "src_dbi" "src_sql" "src"Pay careful attention to where the SQLite database is stored. Thedefault location is a temporary directory, but you will want to movethis to a more secure location if you want this storage to bepersistent. Seefile.copy() for examples on how to move afile.
etl works with a local or remote database to store yourdata. Everyetl object extends adplyr::src_dbi object. If, as in the example above, you donot specify a SQL source, a localRSQLite database will becreated for you. However, you can also specify any source that inheritsfromdplyr::src_dbi.
Note: If you want to use a database other than a local RSQLite, youmust create the
mtcarsdatabase and have permission towrite to it first!
# For PostgreSQLlibrary(RPostgreSQL)db<-src_postgres(dbname ="mtcars",user ="postgres",host ="localhost")# Alternatively, for MySQLlibrary(RMariaDB)db<-src_mysql(dbname ="mtcars",user ="r-user",password ="mypass",host ="localhost")cars<-etl("mtcars", db)At the heart ofetl are three functions:etl_extract(),etl_transform(), andetl_load().
The first step is to acquire data from an online source.
## Extracting raw data...This creates a local store of raw data.
These data may need to be transformed from their raw form to filessuitable for importing into SQL (usually CSVs).
Populate the SQL database with the transformed data.
## Loading 12 file(s) into the database...To populate the whole database from scratch, useetl_create.
You can also update an existing database without re-initializing, butwatch out for primary key collisions.
Under the hood, there are three functions thatetl_update chains together:
## function (obj, ...) ## {## obj <- etl_load(etl_transform(etl_extract(obj, ...), ...), ## ...)## invisible(obj)## }## <bytecode: 0x5c2a32066318>## <environment: namespace:etl>etl_create is simply a call toetl_updatethat forces the SQL database to be written from scratch.
## function (obj, ...) ## {## obj <- etl_cleanup(etl_update(etl_init(obj, ...), ...), ...)## invisible(obj)## }## <bytecode: 0x5c2a340dc080>## <environment: namespace:etl>Now that your database is populated, you can work with it as asrc data table just like any otherdplyrsource.
## Warning: Missing values are always removed in SQL aggregation functions.## Use `na.rm = TRUE` to silence this warning## This warning is displayed once every 8 hours.## # Source: SQL [?? x 3]## # Database: sqlite 3.50.1 [/tmp/Rtmp00cuuZ/file276e72f83f738.sqlite3]## cyl N mean_mpg## <int> <int> <dbl>## 1 4 11 26.7## 2 6 7 19.7## 3 8 14 15.1etlSuppose you want to create your own ETL package calledpkgname. All you have to do is write a package thatrequiresetl, and then you have to writeone S3methods:
You may also wish to write
All of these functions must take and return an object of classetl_pkgname that inherits frometl. Please seethe “Extending etl” vignette for moreinformation.