Movatterモバイル変換

Type:

Package

Title:

Extract-Transform-Load Framework for Medium Data

Version:

0.4.2

Maintainer:

Benjamin S. Baumer <ben.baumer@gmail.com>

Description:

A predictable and pipeable framework for performing ETL (extract-transform-load) operations on publicly-accessible medium-sized data set. This package sets up the method structure and implements generic functions. Packages that depend on this package download specific data sets from the Internet, clean them up, and import them into a local or remote relational database management system.

License:

CC0

Imports:

DBI, dbplyr, datasets, downloader, fs, janitor, lubridate,methods, readr, rlang, rvest, tibble, usethis, utils, xml2

Depends:

R (≥ 4.2), dplyr

Suggests:

knitr, RSQLite, RPostgres, RMariaDB, ggplot2, testthat,rmarkdown

URL:

https://github.com/beanumber/etl

BugReports:

https://github.com/beanumber/etl/issues

RoxygenNote:

7.3.2

Encoding:

UTF-8

VignetteBuilder:

knitr, rmarkdown, ggplot2, dplyr, dbplyr

NeedsCompilation:

Packaged:

2025-07-24 18:45:06 UTC; bbaumer

Author:

Benjamin S. Baumer

[aut, cre], Carson Sievert [ctb], Natalia Iannucci [ctb]

Repository:

CRAN

Date/Publication:

2025-07-24 19:10:02 UTC

Create an ETL package skeleton

Description

Create an ETL package skeleton

Usage

create_etl_package(...)

Arguments

...

arguments passed tousethis::create_package()

Details

Extendsusethis::create_package() and places a template source file inthe R subdirectory of the new package. The file has a working stub ofetl_extract().The new package can be built immediately and run.

New S3 methods foretl_transform() andetl_load() can be added ifnecessary, but the default methods may suffice.

Examples

## Not run: path <- file.path(tempdir(), "scorecard")create_etl_package(path)## End(Not run)# Now switch projects, and "Install and Restart"

Execute an SQL script

Description

Execute an SQL script

Usage

dbRunScript(conn, script, echo = FALSE, ...)

Arguments

conn

aDBI::DBIConnection object

script

Either a filename pointing to an SQL script ora character vector of length 1 containing SQL.

echo

print the SQL commands to the output?

...

arguments passed toDBI::dbExecute()

Details

The SQL script file must be⁠;⁠ delimited.

Value

a list of results fromDBI::dbExecute() for each of the individualSQL statements inscript.

Examples

sql <- "SHOW TABLES; SELECT 1+1 as Two;"sql2 <- system.file("sql", "mtcars.mysql", package = "etl")sql3 <- "SELECT * FROM user WHERE user = 'mysql';SELECT * FROM user WHERE 't' = 't';"if (require(RSQLite)) {  con <- dbConnect(RSQLite::SQLite())  dbRunScript(con, "SELECT 1+1 as Two; VACUUM; ANALYZE;")}## Not run: if (require(RMariaDB)) { con <- dbConnect(RMariaDB::MySQL(), default.file = path.expand("~/.my.cnf"),   group = "client", user = NULL, password = NULL, dbname = "mysql", host = "127.0.0.1") dbRunScript(con, script = sql) dbRunScript(con, script = sql2) dbRunScript(con, script = sql3) dbDisconnect(con)}## End(Not run)

Wipe out all tables in a database

Description

Wipe out all tables in a database

Usage

dbWipe(conn, ...)

Arguments

conn

aDBI::DBIConnection object

...

arguments passed toDBI::dbRemoveTable()

Details

Finds all tables within a database and removes them

Return the database type for an ETL or DBI connection

Description

Return the database type for an ETL or DBI connection

Usage

db_type(obj, ...)## S3 method for class 'src_dbi'db_type(obj, ...)## S3 method for class 'DBIConnection'db_type(obj, ...)

Arguments

obj

andetl orDBI::DBIConnection object

...

currently ignored

Examples

if (require(RMariaDB) && mariadbHasDefault()) {  # connect to test database using rs-dbi  db <- src_mysql_cnf()  class(db)  db  # connect to another server using the 'client' group  db_type(db)  db_type(db$con)}

Initialize an`etl` object

Description

Initialize anetl object

Usage

etl(x, db = NULL, dir = tempdir(), ...)## Default S3 method:etl(x, db = NULL, dir = tempdir(), ...)## S3 method for class 'etl'summary(object, ...)is.etl(object)## S3 method for class 'etl'print(x, ...)

Arguments

x

the name of theetl package that you wish to populate with data.This determines the class of the resultingetl object, whichdetermines method dispatch of⁠etl_*()⁠ functions. There is no default,but you can usedatasets::mtcars as an test example.

db

a database connection that inherits fromdplyr::src_dbi().It is NULL by default, which results in aRSQLite::SQLite() connectionbeing created indir.

dir

a directory to store the raw and processed data files

...

arguments passed to methods (currently ignored)

object

an object for which a summary is desired.

Details

A constructor function that instantiates anetl object.Anetl object extends adplyr::src_dbi() object.It also has attributes for:

pkg: the name of theetl package corresponding to the data source
dir: the directory where the raw and processed data are stored
raw_dir: the directory where the raw data files are stored
load_dir: the directory where the processed data files are stored

Just like anydplyr::src_dbi() object, anetl objectis a data source backed by an SQL database. However, anetl objecthas additional functionality based on the presumption that the SQL databasewill be populated from data files stored on the local hard disk. The ETL functionsdocumented inetl_create() provide the necessary functionalityforextracting data from the Internet toraw_dir,transforming those dataand placing the cleaned up data (usually in CSV format) intoload_dir,and finallyloading the clean data into the SQL database.

Value

Foretl, an object of classetl_x andetl that inheritsfromdplyr::src_dbi()

Foris.etl(),TRUE orFALSE,depending on whetherx has classetl

Examples

# Instantiate the etl objectcars <- etl("mtcars")str(cars)is.etl(cars)summary(cars)## Not run: # connect to a PostgreSQL serverif (require(RPostgreSQL)) {  db <- src_postgres("mtcars", user = "postgres", host = "localhost")  cars <- etl("mtcars", db)}## End(Not run)# Do it step-by-stepcars |>  etl_extract() |>  etl_transform() |>  etl_load()src_tbls(cars)cars |>  tbl("mtcars") |>  group_by(cyl) |>  summarize(N = n(), mean_mpg = mean(mpg))# Do it all in one stepcars2 <- etl("mtcars")cars2 |>  etl_update()src_tbls(cars2)# generic summary function provides information about the objectcars <- etl("mtcars")summary(cars)cars <- etl("mtcars")# returns TRUEis.etl(cars)# returns FALSEis.etl("hello world")cars <- etl("mtcars") |>  etl_create()cars

ETL functions for working with medium sized data

Description

These generic functions provide a systematic approachfor performing ETL (exchange-transform-load) operations on mediumsized data.

Usage

etl_cleanup(obj, ...)## Default S3 method:etl_cleanup(  obj,  delete_raw = FALSE,  delete_load = FALSE,  pattern = "\\.(csv|zip)$",  ...)etl_create(obj, ...)## Default S3 method:etl_create(obj, ...)etl_update(obj, ...)## Default S3 method:etl_update(obj, ...)etl_extract(obj, ...)## Default S3 method:etl_extract(obj, ...)## S3 method for class 'etl_mtcars'etl_extract(obj, ...)## S3 method for class 'etl_cities'etl_extract(obj, ...)etl_load(obj, ...)## Default S3 method:etl_load(obj, ...)etl_transform(obj, ...)## Default S3 method:etl_transform(obj, ...)## S3 method for class 'etl_cities'etl_transform(obj, ...)

Arguments

obj

anetl object

...

arguments passed to methods

delete_raw

should files be deleted from theraw_dir?

delete_load

should files be deleted from theload_dir?

pattern

regular expression matching file names to be deleted. By default,this matches filenames ending in.csv and.zip.

Details

The purposes of these functions are to download data from aparticular data source from the Internet, process it, and load itinto a SQL database server.

There are five primary functions:

etl_init(): Initialize the database schema.
etl_extract(): Download data from the Internet and store it locally inits raw form.
etl_transform(): Manipulate the raw data such that it can be loadedinto a database table. Usually, this means converting the raw data to(a series of) CSV files, which are also stored locally.
etl_load(): Load the transformed data into the database.
etl_cleanup(): Perform housekeeping, such as deleting unnecessaryraw data files.

Additionally, two convenience functions chain these operations together:

etl_create(): Run all five functions in succession.This is useful when you wantto create the database from scratch.
etl_update(): Run theetl_extract()-etl_transform()-etl_load() functionsin succession.This is usefulwhen the database already exists, but you want to insert some new data.

Value

Each one of these functions returns anetl object, invisibly.

Examples

## Not run: if (require(RPostgreSQL)) {  db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")  cars <- etl("mtcars", db)}if (require(RMariaDB) && mariadbHasDefault()) {  db <- src_mysql(dbname = "mtcars", user = "r-user",                  host = "localhost", password = "mypass")  cars <- etl("mtcars", db)}## End(Not run)cars <- etl("mtcars")cars |> etl_extract() |> etl_transform() |> etl_load() |> etl_cleanup()carscars |> tbl(from = "mtcars") |> group_by(cyl) |> summarise(N = n(), mean_mpg = mean(mpg)) # do it all in one step, and peek at the SQL creation script cars |>   etl_create(echo = TRUE) # specify a directory for the data ## Not run:  cars <- etl("mtcars", dir = "~/dumps/mtcars/") str(cars) ## End(Not run)cars <- etl("mtcars")# Do it step-by-stepcars |>  etl_extract() |>  etl_transform() |>  etl_load()# Note the somewhat imprecise data types for the columns. These are the default.tbl(cars, "mtcars")# But you can also specify your own schema if you wantschema <- system.file("sql", "init.sqlite", package = "etl")cars |>  etl_init(schema) |>  etl_load()

Initialize a database using a defined schema

Description

Initialize a database using a defined schema

Usage

etl_init(  obj,  script = NULL,  schema_name = "init",  pkg = attr(obj, "pkg"),  ext = NULL,  ...)## Default S3 method:etl_init(  obj,  script = NULL,  schema_name = "init",  pkg = attr(obj, "pkg"),  ext = NULL,  ...)find_schema(obj, schema_name = "init", pkg = attr(obj, "pkg"), ext = NULL, ...)

Arguments

obj

Anetl object

script

either a vector of SQL commands to be executed, ora file path as a character vector containing an SQL initialization script.IfNULL (the default), then the appropriate built-inschema will be fetched by [find_schema(), if it exists. Notethat the flavor of SQL in this file must match the type of the source. That is,if your object is of type [dplyr::src_mysql(), then make sure thatthe schema you specify here is written in MySQL (and not PostgreSQL). Pleasenote that SQL syntax is not, in general, completely portable. Use with caution, as this mayclobber any existing data you have in an existing database.

schema_name

The name of the schema. Default isinit.

pkg

The package defining the schema. Should be set inetl.

ext

The file extension used for the SQL schema file. If NULL (the default) itbe inferred from the⁠src_*⁠ class ofcon. For example, ifconhas classRSQLite::SQLite() thenext will besqlite.

...

Currently ignored

Details

If the table definitions are at all non-trivial,you may wish to include a pre-defined table schema. This functionwill retrieve it.

Examples

cars <- etl("mtcars")cars |>  etl_init()cars |>  etl_init(script = sql("CREATE TABLE IF NOT EXISTS mtcars_alt (id INTEGER);"))cars |>  etl_init(schema_name = "init")init_script <- find_schema(cars, schema_name = "init")cars |>  etl_init(script = init_script, echo = TRUE)src_tbls(cars)cars <- etl("mtcars")find_schema(cars)find_schema(cars, "init", "etl")find_schema(cars, "my_crazy_schema", "etl")

Match year and month vectors to filenames

Description

Match year and month vectors to filenames

Extracts a date from filenames

Usage

match_files_by_year_months(  files,  pattern,  years = as.numeric(format(Sys.Date(), "%Y")),  months = 1:12,  ...)extract_date_from_filename(files, pattern, ...)

Arguments

files

a character vector of filenames

pattern

a regular expression to be passed tolubridate::fast_strptime()

years

a numeric vector of years

months

a numeric vector of months

...

arguments passed tolubridate::fast_strptime()

Value

a character vector offiles that match thepattern,year, andmonth arguments

a vector ofbase::POSIXct dates matching the pattern

Examples

## Not run: if (require(airlines)) {  airlines <- etl("airlines", dir = "~/Data/airlines") |>    etl_extract(year = 1987)  summary(airlines)  match_files_by_year_months(list.files(attr(airlines, "raw_dir")),    pattern = "On_Time_On_Time_Performance_%Y_%m.zip", year = 1987)}## End(Not run)

Download only those files that don't already exist

Description

Download only those files that don't already exist

Usage

smart_download(obj, src, new_filenames = basename(src), clobber = FALSE, ...)

Arguments

obj

anetl object

src

a character vector of URLs that you want to download

new_filenames

an optional character vector of filenames for the new(local) files. Defaults to having the same filenames as those insrc.

clobber

do you want to clobber any existing files?

...

arguments passed todownloader::download()

Details

Downloads only those files insrc that are not already present inthe directory specified by theraw_dir attribute ofobj.

Author(s)

idiom courtesy of Hadley Wickham

Examples

## Not run: cars <- etl("mtcars")urls <- c("https://raw.githubusercontent.com/beanumber/etl/master/etl.Rproj","https://www.reddit.com/robots.txt")smart_download(cars, src = urls)# won't download again if the files are already theresmart_download(cars, src = urls)# use clobber to overwritesmart_download(cars, src = urls, clobber = TRUE)## End(Not run)

Upload a list of files to the DB

Description

Upload a list of files to the DB

Usage

smart_upload(obj, src = NULL, tablenames = NULL, ...)

Arguments

obj

Anetl object

src

a list of CSV files to upload. IfNULL, will return allCSVs in the load directory

tablenames

a list the same length assrc of tablenames in thedatabase corresponding to each of the files insrc. IfNULL,will default to the same name assrc, without paths or file extensions.

...

arguments passed toDBI::dbWriteTable()

Examples

## Not run: if (require(RMariaDB)) {  # must have pre-existing database "fec"  # if not, try  system("mysql -e 'CREATE DATABASE IF NOT EXISTS fec;'")  db <- src_mysql_cnf(dbname = "mtcars")}## End(Not run)

Connect to local MySQL Server using ~/.my.cnf

Description

Connect to local MySQL Server using ~/.my.cnf

Usage

src_mysql_cnf(dbname = "test", groups = "rs-dbi", ...)

Arguments

dbname

name of the local database you wish to connect to. Default istest, as inRMariaDB::mariadbHasDefault().

groups

section of⁠~/.my.cnf⁠ file. Default isrs-dbi as inRMariaDB::mariadbHasDefault()

...

arguments passed todplyr::src_mysql()

Examples

if (require(RMariaDB) && mariadbHasDefault()) {  # connect to test database using rs-dbi  db <- src_mysql_cnf()  class(db)  db  # connect to another server using the 'client' group  src_mysql_cnf(groups = "client")}

Ensure that years and months are within a certain time span

Description

Ensure that years and months are within a certain time span

Usage

valid_year_month(years, months, begin = "1870-01-01", end = Sys.Date())

Arguments

years

a numeric vector of years

months

a numeric vector of months

begin

the earliest valid date, defaults to the UNIX epoch

end

the most recent valid date, defaults to today

Details

Often, a data source willbegin andend atknown points in time. At the same time, many data sources are dividedinto monthly archives. Given a set ofyears andmonths,any combination of which should be considered valid, this function willreturn adata.frame in which each row is one of thosevalid year-month pairs. Further, if the optionalbegin andend arguments are specified, the rows will be filter to liewithin that time interval. Furthermore, the first and last day ofeach month are computed.

Value

adata.frame with four variables:year,month,month_begin (the first day of the month), andmonth_end (the last day of the month).

Examples

valid_year_month(years = 1999:2001, months = c(1:3, 7))# Mets in the World Series since the UNIX epochmets_ws <- c(1969, 1973, 1986, 2000, 2015)valid_year_month(years = mets_ws, months = 10)# Mets in the World Series during the Clinton administrationif (require(ggplot2)) {  clinton <- filter(presidential, name == "Clinton")  valid_year_month(years = mets_ws, months = 10,    begin = clinton$start, end = clinton$end)}

Movatterモバイル変換

Create an ETL package skeleton

Description

Usage

Arguments

Details

See Also

Examples

Execute an SQL script

Description

Usage

Arguments

Details

Value

Examples

Wipe out all tables in a database

Description

Usage

Arguments

Details

Return the database type for an ETL or DBI connection

Description

Usage

Arguments

Examples

Initialize anetl object

Description

Usage

Arguments

Details

Value

See Also

Examples

ETL functions for working with medium sized data

Description

Usage

Arguments

Details

Value

See Also

Examples

Initialize a database using a defined schema

Description

Usage

Arguments

Details

Examples

Match year and month vectors to filenames

Description

Usage

Arguments

Value

Examples

Download only those files that don't already exist

Description

Usage

Arguments

Details

Author(s)

Examples

Upload a list of files to the DB

Description

Usage

Arguments

Examples

Connect to local MySQL Server using ~/.my.cnf

Description

Usage

Arguments

See Also

Examples

Ensure that years and months are within a certain time span

Description

Usage

Arguments

Details

Value

Examples

Initialize an`etl` object