| Type: | Package |
| Title: | Extract-Transform-Load Framework for Medium Data |
| Version: | 0.4.2 |
| Maintainer: | Benjamin S. Baumer <ben.baumer@gmail.com> |
| Description: | A predictable and pipeable framework for performing ETL (extract-transform-load) operations on publicly-accessible medium-sized data set. This package sets up the method structure and implements generic functions. Packages that depend on this package download specific data sets from the Internet, clean them up, and import them into a local or remote relational database management system. |
| License: | CC0 |
| Imports: | DBI, dbplyr, datasets, downloader, fs, janitor, lubridate,methods, readr, rlang, rvest, tibble, usethis, utils, xml2 |
| Depends: | R (≥ 4.2), dplyr |
| Suggests: | knitr, RSQLite, RPostgres, RMariaDB, ggplot2, testthat,rmarkdown |
| URL: | https://github.com/beanumber/etl |
| BugReports: | https://github.com/beanumber/etl/issues |
| RoxygenNote: | 7.3.2 |
| Encoding: | UTF-8 |
| VignetteBuilder: | knitr, rmarkdown, ggplot2, dplyr, dbplyr |
| NeedsCompilation: | no |
| Packaged: | 2025-07-24 18:45:06 UTC; bbaumer |
| Author: | Benjamin S. Baumer |
| Repository: | CRAN |
| Date/Publication: | 2025-07-24 19:10:02 UTC |
Create an ETL package skeleton
Description
Create an ETL package skeleton
Usage
create_etl_package(...)Arguments
... | arguments passed to |
Details
Extendsusethis::create_package() and places a template source file inthe R subdirectory of the new package. The file has a working stub ofetl_extract().The new package can be built immediately and run.
New S3 methods foretl_transform() andetl_load() can be added ifnecessary, but the default methods may suffice.
See Also
etl_extract(),etl_transform(),etl_load()
Examples
## Not run: path <- file.path(tempdir(), "scorecard")create_etl_package(path)## End(Not run)# Now switch projects, and "Install and Restart"Execute an SQL script
Description
Execute an SQL script
Usage
dbRunScript(conn, script, echo = FALSE, ...)Arguments
conn | aDBI::DBIConnection object |
script | Either a filename pointing to an SQL script ora character vector of length 1 containing SQL. |
echo | print the SQL commands to the output? |
... | arguments passed to |
Details
The SQL script file must be; delimited.
Value
a list of results fromDBI::dbExecute() for each of the individualSQL statements inscript.
Examples
sql <- "SHOW TABLES; SELECT 1+1 as Two;"sql2 <- system.file("sql", "mtcars.mysql", package = "etl")sql3 <- "SELECT * FROM user WHERE user = 'mysql';SELECT * FROM user WHERE 't' = 't';"if (require(RSQLite)) { con <- dbConnect(RSQLite::SQLite()) dbRunScript(con, "SELECT 1+1 as Two; VACUUM; ANALYZE;")}## Not run: if (require(RMariaDB)) { con <- dbConnect(RMariaDB::MySQL(), default.file = path.expand("~/.my.cnf"), group = "client", user = NULL, password = NULL, dbname = "mysql", host = "127.0.0.1") dbRunScript(con, script = sql) dbRunScript(con, script = sql2) dbRunScript(con, script = sql3) dbDisconnect(con)}## End(Not run)Wipe out all tables in a database
Description
Wipe out all tables in a database
Usage
dbWipe(conn, ...)Arguments
conn | aDBI::DBIConnection object |
... | arguments passed to |
Details
Finds all tables within a database and removes them
Return the database type for an ETL or DBI connection
Description
Return the database type for an ETL or DBI connection
Usage
db_type(obj, ...)## S3 method for class 'src_dbi'db_type(obj, ...)## S3 method for class 'DBIConnection'db_type(obj, ...)Arguments
obj | andetl orDBI::DBIConnection object |
... | currently ignored |
Examples
if (require(RMariaDB) && mariadbHasDefault()) { # connect to test database using rs-dbi db <- src_mysql_cnf() class(db) db # connect to another server using the 'client' group db_type(db) db_type(db$con)}Initialize anetl object
Description
Initialize anetl object
Usage
etl(x, db = NULL, dir = tempdir(), ...)## Default S3 method:etl(x, db = NULL, dir = tempdir(), ...)## S3 method for class 'etl'summary(object, ...)is.etl(object)## S3 method for class 'etl'print(x, ...)Arguments
x | the name of the |
db | a database connection that inherits from |
dir | a directory to store the raw and processed data files |
... | arguments passed to methods (currently ignored) |
object | an object for which a summary is desired. |
Details
A constructor function that instantiates anetl object.Anetl object extends adplyr::src_dbi() object.It also has attributes for:
- pkg
the name of the
etlpackage corresponding to the data source- dir
the directory where the raw and processed data are stored
- raw_dir
the directory where the raw data files are stored
- load_dir
the directory where the processed data files are stored
Just like anydplyr::src_dbi() object, anetl objectis a data source backed by an SQL database. However, anetl objecthas additional functionality based on the presumption that the SQL databasewill be populated from data files stored on the local hard disk. The ETL functionsdocumented inetl_create() provide the necessary functionalityforextracting data from the Internet toraw_dir,transforming those dataand placing the cleaned up data (usually in CSV format) intoload_dir,and finallyloading the clean data into the SQL database.
Value
Foretl, an object of classetl_x andetl that inheritsfromdplyr::src_dbi()
Foris.etl(),TRUE orFALSE,depending on whetherx has classetl
See Also
etl_create()
Examples
# Instantiate the etl objectcars <- etl("mtcars")str(cars)is.etl(cars)summary(cars)## Not run: # connect to a PostgreSQL serverif (require(RPostgreSQL)) { db <- src_postgres("mtcars", user = "postgres", host = "localhost") cars <- etl("mtcars", db)}## End(Not run)# Do it step-by-stepcars |> etl_extract() |> etl_transform() |> etl_load()src_tbls(cars)cars |> tbl("mtcars") |> group_by(cyl) |> summarize(N = n(), mean_mpg = mean(mpg))# Do it all in one stepcars2 <- etl("mtcars")cars2 |> etl_update()src_tbls(cars2)# generic summary function provides information about the objectcars <- etl("mtcars")summary(cars)cars <- etl("mtcars")# returns TRUEis.etl(cars)# returns FALSEis.etl("hello world")cars <- etl("mtcars") |> etl_create()carsETL functions for working with medium sized data
Description
These generic functions provide a systematic approachfor performing ETL (exchange-transform-load) operations on mediumsized data.
Usage
etl_cleanup(obj, ...)## Default S3 method:etl_cleanup( obj, delete_raw = FALSE, delete_load = FALSE, pattern = "\\.(csv|zip)$", ...)etl_create(obj, ...)## Default S3 method:etl_create(obj, ...)etl_update(obj, ...)## Default S3 method:etl_update(obj, ...)etl_extract(obj, ...)## Default S3 method:etl_extract(obj, ...)## S3 method for class 'etl_mtcars'etl_extract(obj, ...)## S3 method for class 'etl_cities'etl_extract(obj, ...)etl_load(obj, ...)## Default S3 method:etl_load(obj, ...)etl_transform(obj, ...)## Default S3 method:etl_transform(obj, ...)## S3 method for class 'etl_cities'etl_transform(obj, ...)Arguments
obj | anetl object |
... | arguments passed to methods |
delete_raw | should files be deleted from the |
delete_load | should files be deleted from the |
pattern | regular expression matching file names to be deleted. By default,this matches filenames ending in |
Details
The purposes of these functions are to download data from aparticular data source from the Internet, process it, and load itinto a SQL database server.
There are five primary functions:
etl_init()Initialize the database schema.
etl_extract()Download data from the Internet and store it locally inits raw form.
etl_transform()Manipulate the raw data such that it can be loadedinto a database table. Usually, this means converting the raw data to(a series of) CSV files, which are also stored locally.
etl_load()Load the transformed data into the database.
etl_cleanup()Perform housekeeping, such as deleting unnecessaryraw data files.
Additionally, two convenience functions chain these operations together:
etl_create()Run all five functions in succession.This is useful when you wantto create the database from scratch.
etl_update()Run the
etl_extract()-etl_transform()-etl_load()functionsin succession.This is usefulwhen the database already exists, but you want to insert some new data.
Value
Each one of these functions returns anetl object, invisibly.
See Also
Examples
## Not run: if (require(RPostgreSQL)) { db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost") cars <- etl("mtcars", db)}if (require(RMariaDB) && mariadbHasDefault()) { db <- src_mysql(dbname = "mtcars", user = "r-user", host = "localhost", password = "mypass") cars <- etl("mtcars", db)}## End(Not run)cars <- etl("mtcars")cars |> etl_extract() |> etl_transform() |> etl_load() |> etl_cleanup()carscars |> tbl(from = "mtcars") |> group_by(cyl) |> summarise(N = n(), mean_mpg = mean(mpg)) # do it all in one step, and peek at the SQL creation script cars |> etl_create(echo = TRUE) # specify a directory for the data ## Not run: cars <- etl("mtcars", dir = "~/dumps/mtcars/") str(cars) ## End(Not run)cars <- etl("mtcars")# Do it step-by-stepcars |> etl_extract() |> etl_transform() |> etl_load()# Note the somewhat imprecise data types for the columns. These are the default.tbl(cars, "mtcars")# But you can also specify your own schema if you wantschema <- system.file("sql", "init.sqlite", package = "etl")cars |> etl_init(schema) |> etl_load()Initialize a database using a defined schema
Description
Initialize a database using a defined schema
Usage
etl_init( obj, script = NULL, schema_name = "init", pkg = attr(obj, "pkg"), ext = NULL, ...)## Default S3 method:etl_init( obj, script = NULL, schema_name = "init", pkg = attr(obj, "pkg"), ext = NULL, ...)find_schema(obj, schema_name = "init", pkg = attr(obj, "pkg"), ext = NULL, ...)Arguments
obj | Anetl object |
script | either a vector of SQL commands to be executed, ora file path as a character vector containing an SQL initialization script.If |
schema_name | The name of the schema. Default is |
pkg | The package defining the schema. Should be set inetl. |
ext | The file extension used for the SQL schema file. If NULL (the default) itbe inferred from the |
... | Currently ignored |
Details
If the table definitions are at all non-trivial,you may wish to include a pre-defined table schema. This functionwill retrieve it.
Examples
cars <- etl("mtcars")cars |> etl_init()cars |> etl_init(script = sql("CREATE TABLE IF NOT EXISTS mtcars_alt (id INTEGER);"))cars |> etl_init(schema_name = "init")init_script <- find_schema(cars, schema_name = "init")cars |> etl_init(script = init_script, echo = TRUE)src_tbls(cars)cars <- etl("mtcars")find_schema(cars)find_schema(cars, "init", "etl")find_schema(cars, "my_crazy_schema", "etl")Match year and month vectors to filenames
Description
Match year and month vectors to filenames
Extracts a date from filenames
Usage
match_files_by_year_months( files, pattern, years = as.numeric(format(Sys.Date(), "%Y")), months = 1:12, ...)extract_date_from_filename(files, pattern, ...)Arguments
files | a character vector of filenames |
pattern | a regular expression to be passed to |
years | a numeric vector of years |
months | a numeric vector of months |
... | arguments passed to |
Value
a character vector offiles that match thepattern,year, andmonth arguments
a vector ofbase::POSIXct dates matching the pattern
Examples
## Not run: if (require(airlines)) { airlines <- etl("airlines", dir = "~/Data/airlines") |> etl_extract(year = 1987) summary(airlines) match_files_by_year_months(list.files(attr(airlines, "raw_dir")), pattern = "On_Time_On_Time_Performance_%Y_%m.zip", year = 1987)}## End(Not run)Download only those files that don't already exist
Description
Download only those files that don't already exist
Usage
smart_download(obj, src, new_filenames = basename(src), clobber = FALSE, ...)Arguments
obj | anetl object |
src | a character vector of URLs that you want to download |
new_filenames | an optional character vector of filenames for the new(local) files. Defaults to having the same filenames as those in |
clobber | do you want to clobber any existing files? |
... | arguments passed to |
Details
Downloads only those files insrc that are not already present inthe directory specified by theraw_dir attribute ofobj.
Author(s)
idiom courtesy of Hadley Wickham
Examples
## Not run: cars <- etl("mtcars")urls <- c("https://raw.githubusercontent.com/beanumber/etl/master/etl.Rproj","https://www.reddit.com/robots.txt")smart_download(cars, src = urls)# won't download again if the files are already theresmart_download(cars, src = urls)# use clobber to overwritesmart_download(cars, src = urls, clobber = TRUE)## End(Not run)Upload a list of files to the DB
Description
Upload a list of files to the DB
Usage
smart_upload(obj, src = NULL, tablenames = NULL, ...)Arguments
obj | Anetl object |
src | a list of CSV files to upload. If |
tablenames | a list the same length as |
... | arguments passed to |
Examples
## Not run: if (require(RMariaDB)) { # must have pre-existing database "fec" # if not, try system("mysql -e 'CREATE DATABASE IF NOT EXISTS fec;'") db <- src_mysql_cnf(dbname = "mtcars")}## End(Not run)Connect to local MySQL Server using ~/.my.cnf
Description
Connect to local MySQL Server using ~/.my.cnf
Usage
src_mysql_cnf(dbname = "test", groups = "rs-dbi", ...)Arguments
dbname | name of the local database you wish to connect to. Default is |
groups | section of |
... | arguments passed to |
See Also
dplyr::src_mysql(),RMariaDB::mariadbHasDefault()
Examples
if (require(RMariaDB) && mariadbHasDefault()) { # connect to test database using rs-dbi db <- src_mysql_cnf() class(db) db # connect to another server using the 'client' group src_mysql_cnf(groups = "client")}Ensure that years and months are within a certain time span
Description
Ensure that years and months are within a certain time span
Usage
valid_year_month(years, months, begin = "1870-01-01", end = Sys.Date())Arguments
years | a numeric vector of years |
months | a numeric vector of months |
begin | the earliest valid date, defaults to the UNIX epoch |
end | the most recent valid date, defaults to today |
Details
Often, a data source willbegin andend atknown points in time. At the same time, many data sources are dividedinto monthly archives. Given a set ofyears andmonths,any combination of which should be considered valid, this function willreturn adata.frame in which each row is one of thosevalid year-month pairs. Further, if the optionalbegin andend arguments are specified, the rows will be filter to liewithin that time interval. Furthermore, the first and last day ofeach month are computed.
Value
adata.frame with four variables:year,month,month_begin (the first day of the month), andmonth_end (the last day of the month).
Examples
valid_year_month(years = 1999:2001, months = c(1:3, 7))# Mets in the World Series since the UNIX epochmets_ws <- c(1969, 1973, 1986, 2000, 2015)valid_year_month(years = mets_ws, months = 10)# Mets in the World Series during the Clinton administrationif (require(ggplot2)) { clinton <- filter(presidential, name == "Clinton") valid_year_month(years = mets_ws, months = 10, begin = clinton$start, end = clinton$end)}