Movatterモバイル変換

Type:

Package

Title:

Quickly Get Datetime Data Ready for Analysis

Version:

0.6.3

Description:

Transforms datetime data into a format ready for analysis. It offers two core functionalities; aggregating data to a higher level interval (thicken) and imputing records where observations were absent (pad).

License:

MIT + file LICENSE

Encoding:

UTF-8

LazyData:

true

Depends:

R (≥ 3.0.0)

Imports:

Rcpp, dplyr (≥ 1.0.0), lubridate, rlang

Suggests:

ggplot2, testthat, knitr, rmarkdown, lazyeval, tidyr,data.table

RoxygenNote:

7.2.1

LinkingTo:

Rcpp

VignetteBuilder:

knitr

URL:

https://edwinth.github.io/padr/,https://github.com/EdwinTh/padr

BugReports:

https://github.com/EdwinTh/padr/issues

ByteCompile:

true

NeedsCompilation:

yes

Packaged:

2024-11-21 15:48:04 UTC; edwinthoen

Author:

Edwin Thoen [aut, cre]

Maintainer:

Edwin Thoen <edwinthoen@gmail.com>

Repository:

CRAN

Date/Publication:

2024-11-21 18:10:02 UTC

Shift to the middle of each interval

Description

After thickening all the values are eithershifted to the first or the last value of their interval.This function creates a vector fromx, with the values shifted tothe (approximate) center of the interval. This can give a more accuratepicture of the aggregated data when plotting.

Usage

center_interval(x, shift = c("up", "down"), interval = NULL)

Arguments

x

A vector of classDate,POSIXct orPOSIXlt.

shift

"up" or "down".

interval

The interval to be used for centering. IfNULL,get_interval will be applied onx.

Details

The interval will be translated to number of days whenx is of classDate, or number of seconds whenx is ofclassPOSIXt. For months and quarters this will be the averagelength of the interval. The translated units divided by twowill be added by or subtracted from each value ofx.

Value

Vector of the same class asx, with the values shifted to the(approximate) center.

Examples

library(dplyr)library(ggplot2)plot_set <- emergency %>%  thicken("hour", "h") %>%  count(h) %>%  head(24)ggplot(plot_set, aes(h, n)) + geom_col()plot_set %>%  mutate(h_center = center_interval(h)) %>%  ggplot(aes(h_center, n)) + geom_col()

Retrieve the closest given weekday

Description

Find the closest instance of the requested weekday tomin(x).Helper function forthicken with the interval "week", when the userdesires the start day of the weeks to be different from Sundays.

Usage

closest_weekday(x, wday = 1, direction = c("down", "up"))

Arguments

x

A vector of classDate,POSIXct, orPOSIXlt.

wday

Integer in the range 0-6 specifying the desired weekday start(0 = Sun, 1 = Mon, 2 = Tue, 3 = Wed, 4 = Thu, 5 = Fri, 6 = Sat).

direction

The first desired weekday before ("down") or after ("up")the first day inx.

Value

Object of classDate, reflecting the closest desired weekdaytox.

Examples

closest_weekday(coffee$time_stamp)closest_weekday(coffee$time_stamp, 5)closest_weekday(coffee$time_stamp, 1, direction = "up")closest_weekday(coffee$time_stamp, 5, direction = "up")

Coffee Data Set

Description

Made-up data set for demonstratingpadr.

Usage

coffee

Format

A data frame with 4 rows and 2 variables:

time_stamp: YYYY-MM-DD HH:MM:SS
amount: Amount spent on coffee

Emergency Calls for Montgomery County, PA

Description

The emergency calls coming in at Montgomery County, PA since 2015-12-10.Data set was created at 2016-10-17 16:15:40 CEST from the API and containsevents until 2016-10-17 09:47:03 EST. From the original set the columnsdesc and e are not included.

Usage

emergency

Format

A data frame with 120450 rows and 6 variables:

lat: Latitude from Google maps, based on the address
lng: Longitude from Google maps, based on the address
zip: Zipcode from Google, when possible
title: Title, emergency category
time_stamp: YYYY-MM-DD HH:MM:SS
twp: Township

Fill missing values by a function of the nonmissings

Description

For each specified column inx replace the missing values by afunction of the nonmissing values.

Usage

fill_by_function(x, ..., fun = mean)

Arguments

x

A data frame.

...

The unquoted column names of the variables that should be filled.

fun

The function to apply on the nonmissing values.

Value

x with the altered columns.

Examples

library(dplyr) # for the pipe operatorx <- seq(as.Date('2016-01-01'), by = 'day', length.out = 366)x <- x[sample(1:366, 200)] %>% sortx_df <- data_frame(x  = x,                   y1 = runif(200, 10, 20) %>% round,                   y2 = runif(200, 1, 50) %>% round)x_df %>% pad %>% fill_by_function(y1, y2)x_df %>% pad %>% fill_by_function(y1, y2, fun = median)

Fill missing values by the most prevalent nonmissing value

Description

For each specified column inx replace the missing values by the mostprevalent nonmissing value.

Usage

fill_by_prevalent(x, ...)

Arguments

x

A data frame.

...

The unquoted column names of the variables that should be filled.

Value

x with the altered columns.

Examples

library(dplyr) # for the pipe operatorx <- seq(as.Date('2016-01-01'), by = 'day', length.out = 366)x <- x[sample(1:366, 200)] %>% sortx_df <- data_frame(x  = x,                  y1 = rep(letters[1:3], c(80, 70, 50)) %>% sample,                  y2 = rep(letters[2:5], c(60, 80, 40, 20)) %>% sample)x_df %>% pad %>% fill_by_prevalent(y1, y2)

Fill missing values by a single value

Description

Replace all missing values in the specified columns by the same value.

Usage

fill_by_value(x, ..., value = 0)

Arguments

x

A data frame.

...

The unquoted column names of the variables that should be filled.

value

The value to replace the missing values by.

Value

x with the altered columns.

Examples

library(dplyr) # for the pipe operatorx <- seq(as.Date('2016-01-01'), by = 'day', length.out = 366)x <- x[sample(1:366, 200)] %>% sortx_df <- data_frame(x  = x,                   y1 = runif(200, 10, 20) %>% round,                   y2 = runif(200, 1, 50) %>% round,                   y3 = runif(200, 20, 40) %>% round,                   y4 = sample(letters[1:5], 200, replace = TRUE))x_padded <- x_df %>% padx_padded %>% fill_by_value(y1)x_df %>% pad %>% fill_by_value(y1, y2, value = 42)

Make a period character vector

Description

After applyingthicken all the observations of a period are mappedto a single time point. This function will convert a datetime variable toa character vector that reflects the period, instead of a single time point.strftime is used to format the start and the end of the interval.

Usage

format_interval(  x,  start_format = "%Y-%m-%d",  end_format = start_format,  sep = " ",  end_offset = 0,  units_to_last = NULL)

Arguments

x

A vector of classDate,POSIXct orPOSIXlt,of which the values are unique.

start_format

String to format the start values of each period, to be usedinstrftime.

end_format

String to format the end values of each period, to be usedinstrftime.

sep

Character string that separates thestart_format and theend_format.

end_offset

Units in days ifx isDate, or in seconds ifx isPOSIXct orPOSIXlt.Will be subtracted from the end of each period.If 0, the end of the previous period is equal to the start of the next.

units_to_last

To determine the formatting of the last value inx,the length of the last period has to be specified. IfNULL thefunction guesses based on the interval ofx.Specify in days whenx isDate, or in seconds whenx isPOSIXct orPOSIXlt.

Details

The end of the periods will be determined by the next unique valueinx. It does so without regarding the interval ofx. If a specificinterval is desired,thicken and / orpad should first beapplied to create an equally spaced datetime variable.

Value

A character vector showing the interval.

Examples

library(dplyr)library(ggplot2)plot_set <- emergency %>%  head(500) %>%  thicken("hour", "h") %>%  count(h)# this will show the data on the full hourggplot(plot_set, aes(h, n)) + geom_col()# adding a character to indicate the hours of the interval.plot_set %>%  mutate(h_int = format_interval(h, "%H", sep = "-"))

Get the interval of a datetime variable

Description

The interval is the highest datetime unit that can explain all instances of avariable of classDate, classPOSIXct, or classPOSIXct.This function will determine what the interval of the variable is.

Usage

get_interval(x)

Arguments

x

A variable of class of classDate or of classPOSIXt.

Details

Seevignette("padr") for more information on intervals.

Value

A character string indicating the interval ofx.

Examples

x_month <- seq(as.Date('2016-01-01'), as.Date('2016-05-01'), by = 'month')get_interval(x_month)x_sec <- seq(as.POSIXct('2016-01-01 00:00:00'), length.out = 100, by = 'sec')get_interval(x_sec)get_interval(x_sec[seq(0, length(x_sec), by = 5)])

Pad the datetime column of a data frame

Description

pad will fill the gaps in incomplete datetime variables, by figuring outwhat the interval of the data is and what instances are missing. It will inserta record for each of the missing time points. For allother variables in the data frame a missing value will be inserted at the padded rows.

Usage

pad(  x,  interval = NULL,  start_val = NULL,  end_val = NULL,  by = NULL,  group = NULL,  break_above = 1)

Arguments

x

A data frame containing at least one variable of classDate,POSIXct orPOSIXlt.

interval

The interval of the returned datetime variable.Any character string that would be accepted byseq.Date() orseq.POSIXt. The only exceptions is "DSTday", which is not accepted.pad will take care of daylight savings time when regular "day" is used.When NULL the the interval will be equal to the interval of the datetime variable.When specified it can only be lower than the interval and step size of the input data.See Details.

start_val

An object of classDate,POSIXct orPOSIXlt that specifies the start of the returned datetime variable.If NULL it will use the lowest value of the input variable.

end_val

An object of classDate,POSIXct orPOSIXlt that specifies the end of returned datetime variable.If NULL it will use the highest value of the input variable.

by

Only needs to be specified whenx contains multiplevariables of classDate,POSIXct orPOSIXlt.Indicates which variable to use for padding.

group

Optional character vector that specifies the groupingvariable(s). Padding will take place within the different groups. Wheninterval is not specified, it will be determined applyingget_intervalon the datetime variable as a whole, ignoring groups (see last example).

break_above

Numeric value that indicates the number of rows in millionsabove which the function will break. Safety net for situations where theinterval is different than expected and padding yields a very largedataframe, possibly overflowing memory.

Details

The interval of a datetime variable is the time unit at which theobservations occur. The eight intervals inpadr are from high to lowyear,quarter,month,week,day,hour,min, andsec. Sincepadr v.0.3.0 theinterval is no longer limited to be of a single unit.(Intervals like 5 minutes, 6 hours, 10 days are possible).pad willfigure out the interval of the input variable and the step size, and willfill the gaps for the instances that would be expected from the interval andstep size, but are missing in the input data.Note that whenstart_val and/orend_val are specified, they areconcatenated with the datetime variable before the interval is determined.

Rows with missing values in the datetime variables will be retained.However, they will be moved to the end of the returned data frame.

Value

The data framex with the datetime variable padded. Allnon-grouping variables in the data frame will have missing values at the rowsthat are padded. The result will always be sorted on the datetime variable.Ifgroup is notNULL result is sorted on grouping variable(s)first, then on the datetime variable.

Examples

simple_df <- data.frame(day = as.Date(c('2016-04-01', '2016-04-03')),                        some_value = c(3,4))pad(simple_df)pad(simple_df, interval = "day")library(dplyr) # for the pipe operatormonth <- seq(as.Date('2016-04-01'), as.Date('2017-04-01'),              by = 'month')[c(1, 4, 5, 7, 9, 10, 13)]month_df <- data.frame(month = month,                       y = runif(length(month), 10, 20) %>% round)# forward fill the padded values with tidyr's fillmonth_df %>% pad %>% tidyr::fill(y)# or fill all y with 0month_df %>% pad %>% fill_by_value(y)# padding a data.frame on group levelday_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month')x_df_grp <- data.frame(grp1 = rep(LETTERS[1:3], each =4),                       grp2 = letters[1:2],                       y    = runif(12, 10, 20) %>% round(0),                       date = sample(day_var, 12, TRUE)) %>% arrange(grp1, grp2, date)# pad by one grouping varx_df_grp %>% pad(group = 'grp1')# pad by two groups varsx_df_grp %>% pad(group = c('grp1', 'grp2'), interval = "month")# Using group argument the interval is determined over all the observations,# ignoring the groups.x <- data.frame(dt_var = as.Date(c("2017-01-01", "2017-03-01", "2017-05-01","2017-01-01", "2017-02-01", "2017-04-01")),id = rep(1:2, each = 3), val = round(rnorm(6)))pad(x, group = "id")# applying pad with do, interval is determined individualle for each groupx %>% group_by(id) %>% do(pad(.))

Pad with a custom spanning

Description

Pad the datetime variable afterthicken_cust is applied, using the samespanning.

Usage

pad_cust(x, spanned, by = NULL, group = NULL, drop_last_spanned = TRUE)

Arguments

x

A data frame containing at least one datetime variable ofclassDate,POSIXct orPOSIXlt.

spanned

A datetime vector to which the the datetime variable inx should be mapped. Seesubset_span for quickly spanningunequally spaced variables.

by

Only needs to be specified whenx contains multiplevariables of classDate,POSIXct orPOSIXlt.

group

Optional character vector that specifies the groupingvariable(s). Padding will take place within the different group values.

drop_last_spanned

Logical, indicating whether to drop the last valuefromspanned. The spanned is typically around the datetime variable.This would create an empty last record when padding. Setting toTRUEwill drop the last value inspanned and will not create an emptylast record in this situation.

Value

The data framex with the datetime column padded.

Examples

library(dplyr)# analysis of traffic accidents in traffic jam hours and other hours.accidents <- emergency %>% filter(title == "Traffic: VEHICLE ACCIDENT -")spanning <- span_time("20151210 16", "20161017 17", tz = "EST") %>%  subset_span(list(hour = c(6, 9, 16, 19)))thicken_cust(accidents, spanning, "period") %>%  count(period) %>%  pad_cust(spanning)

Pad the integer column of a data frame

Description

pad_int fills the gaps in incomplete integer variables. It will inserta record for each of the missing value. For allother variables in the data frame a missing value will be inserted at thepadded rows.

Usage

pad_int(x, by, start_val = NULL, end_val = NULL, group = NULL, step = 1)

Arguments

x

A data frame.

by

The column to be padded.

start_val

The first value of the returned variable. If NULL it willuse the lowest value of the input variable.

end_val

The last value of the returned variable. If NULL it will usethe highest value of the input variable.

group

Optional character vector that specifies the groupingvariable(s). Padding will take place within the different group values.

step

The step size of the returned variable.

Value

The data framex with the specified variable padded. Allnon-grouping variables in the data frame will have missing values at the rowsthat are padded.

Examples

int_df <- data.frame(x = c(2005, 2007, 2008, 2011),                     val = c(3, 2, 6, 3))pad_int(int_df, 'x')pad_int(int_df, 'x', start_val = 2006, end_val = 2013)int_df2 <- data.frame(x = c(2005, 2015), val = c(3, 4))pad_int(int_df2, 'x', step = 2)pad_int(int_df2, 'x', step = 5)int_df3 <- data.frame(x = c(2005, 2006, 2008, 2006, 2007, 2009),                      g = rep(LETTERS[1:2], each = 3),                      val = c(6, 6, 3, 5, 4, 3))pad_int(int_df3, 'x', group = 'g')pad_int(int_df3, 'x', group = 'g', start_val = 2005, end_val = 2009)

Span an equally spaced vector around a datetime variable

Description

Span a vector of specified interval around a variable of classDate,POSIXct, orPOSIXlt..

Usage

span_around(x, interval, start_shift = NULL, end_shift = start_shift)

Arguments

x

A vector of classDate,POSIXct, orPOSIXlt.

interval

Character, specifying the desired interval.

start_shift

Character, indicating the time to shiftback from the first observation.

end_shift

Character, indicating the time to shiftforward from the last observation.

Details

Note that use of thestart_shift andend_shiftarguments change the entire spanning when they are not in line withthe interval. It is not checked for.

Value

A datetime vector, with the first observation smaller or equal thanmin(x) and the last observation larger or equal thanmax(x).Spaces between points are equal tointerval.

Examples

span_around(coffee$time_stamp, "hour")span_around(coffee$time_stamp, "hour", end_shift = "2 hour")span_around(coffee$time_stamp, "2 day")span_around(coffee$time_stamp, "2 day", start_shift = "2 day")span_around(emergency$time_stamp, "week")span_around(emergency$time_stamp, "2 month")

Wrapper around`seq.Date`.

Description

Quickly create a sequence of dates from minimal specifications.

Usage

span_date(from, to = NULL, len_out = NULL, by = NULL)

Arguments

from

Integer or character of length 4 (yyyy), 6 (yyyymm), or 8(yyymmdd). Indicating the start value of the sequence.

to

Integer or character of length 4 (yyyy), 6 (yyyymm), or 8(yyymmdd). Optional.

len_out

The desired length of the sequence. Optional.

by

The desired interval. Optional.

Details

Minimal specification of dates, sets unspecified date parts todefault values. These are 01 for both month and day.

In addition tofrom, eitherto orlen_out must be specified.Ifby is not specified,span_date will set the interval to thehighest of the specified date parts in eitherfrom orto.For example, if they are 2011 and 2015 it will be "year", if they are 2011and 201501 it will be "month".

Value

An object of class Date.

Examples

# using "to" argumentspan_date(2011, 2015)span_date(201101, 201501)span_date(2011, 2015, by = "month")span_date(2011, 201501)span_date(20111225, 2012)# using "len_out" argumentspan_date(2011, len_out = 4)span_date(201101, len_out = 4)span_date(20110101, len_out = 4)span_date(20110101, len_out = 4, by = "month")

Wrapper around`seq.POSIXct`.

Description

Quickly create a sequence of datetimes from minimal specifications.

Usage

span_time(from, to = NULL, len_out = NULL, by = NULL, tz = "UTC")

Arguments

from

Integer or character of length 4 (yyyy), 6 (yyyymm), or 8 (yyymmdd). Character of length 11 (yyyymmdd hh), 13 (yyyymmdd hhmm), or 15 (yyyymmdd hhmmss). Indicating the start value of the sequence.

to

Integer or character of length 4 (yyyy), 6 (yyyymm), or 8 (yyymmdd). Character of length 11 (yyyymmdd hh), 13 (yyyymmdd hhmm), or 15 (yyyymmdd hhmmss). Indicating the end value of the sequence. Optional.

len_out

The desired length of the sequence. Optional.

by

The desired interval. Optional.

tz

The desired timezone.

Details

Minimal specification of datetimes, sets unspecified date parts todefault values. These are 01 for both month and day and 00 for hour, minute,and second.

In addition tofrom, eitherto orlength must be specified.If theby is not specified,span_time will set the interval tothe highest of the specified datetime parts in eitherfrom orto. For example, if they are "20160103 01" and "20160108 05" it willbe "hour", if they are "2011" and "20110101 021823" it will be "second".

Value

An object of class POSIXct.

Examples

# using tospan_time(2011, 2013)span_time("2011", "2013")span_time(2011, 201301)span_time(2011, 20130101)span_time(2011, "20110101 0023")span_time(2011, "20110101 002300")# using len_outspan_time(2011, len_out = 3)span_time("2011", len_out = 3)span_time(2011, len_out = 10, by = "month")span_time(2011, len_out = 10, by = "day")span_time(2011, len_out = 10, by = "hour")span_time("20110101 00", len_out = 10)span_time("20110101 002300", len_out = 10)

Subset a spanned datetime vector

Description

Take aDate,POSIXct, orPOSIXlt vector and subset it bya pattern of date and/or time parts.

Usage

subset_span(spanned, pattern_list)

Arguments

spanned

A vector of classDate,POSIXct, orPOSIXlt.

pattern_list

A list with the desired pattern for each of the followingdatetime parts: year, mon, mday, wday, hour, min, sec.

Details

For subsetting weekdays, they run from 0 (Sunday) to 6 (Saturday).

Value

Vector of the same class asspanned, containing all the data points inspanned that meets the requirements inpattern_list.

Examples

date_span <- span_date(20170701, len_out = 100)subset_span(date_span, list(wday = 1:5))time_span <- span_time("20170101 00", 201702)subset_span(time_span, list(hour = 7:17))subset_span(time_span, list(hour = c(10, 16), mday = seq(5, 30, 5)))

Add a variable of a higher interval to a data frame

Description

Take the datetime variable in a data frame and map thisto a variable of a higher interval. The mapping is added to the data framein a new variable.

Usage

thicken(  x,  interval,  colname = NULL,  rounding = c("down", "up"),  by = NULL,  start_val = NULL,  drop = FALSE,  ties_to_earlier = FALSE)

Arguments

x

A data frame containing at least one datetime variable ofclassDate,POSIXct orPOSIXlt.

interval

The interval of the added datetime variable.Any character string that would be accepted byseq.Date orseq.POSIXt. It can only be higher than the interval and step size ofthe input data.

colname

The column name of the added variable. IfNULL it willbe the name of the original datetime variable with the interval name added toit (including the unit), separated by underscores.

rounding

Should a value in the input datetime variable be mapped tothe closest value that is lower (down) or that is higher (up)than itself.

by

Only needs to be specified whenx contains multiplevariables of classDate,POSIXct orPOSIXlt.Indicates which to use for thickening.

start_val

By default the first instance ofinterval that is lowerthan the lowest value of the input datetime variable, with all time units ondefault value. Specifystart_val as an offset if you want the rangeto be nonstandard.

drop

Should the original datetime variable be dropped from thereturned data frame? Defaults toFALSE.

ties_to_earlier

By default when the original datetime observations istied with a value in the added datetime variable, it is assigned to thecurrent value when rounding is down or to the next value when roundingis up. WhenTRUE the ties will be assigned to the previous observationof the new variable instead.

Details

When the datetime variable contains missing values, they are leftin place in the dataframe. The added column with the new datetime variable,will have a missing values for these rows as well.

Seevignette("padr") for more information onthicken.Seevignette("padr_implementation") for detailed information ondaylight savings time, different timezones, and the implementation ofthicken.

Value

The data framex with the variable added to it.

Examples

x_hour <- seq(lubridate::ymd_hms('20160302 000000'), by = 'hour',              length.out = 200)some_df <- data.frame(x_hour = x_hour)thicken(some_df, 'week')thicken(some_df, 'month')thicken(some_df, 'day', start_val = lubridate::ymd_hms('20160301 120000'))library(dplyr)x_df <- data.frame(  x = seq(lubridate::ymd(20130101), by = 'day', length.out = 1000) %>%    sample(500),  y = runif(500, 10, 50) %>% round) %>%  arrange(x)# get the max per monthx_df %>% thicken('month') %>% group_by(x_month) %>%  summarise(y_max = max(y))# get the average per week, but you want your week to start on Mondays# instead of Sundaysx_df %>% thicken('week',                 start_val = closest_weekday(x_df$x, 2)) %>%  group_by(x_week) %>% summarise(y_avg = mean(y))# rounding up instead of downx <- data.frame(dt = lubridate::ymd_hms('20171021 160000',                                        '20171021 163100'))thicken(x, interval = "hour", rounding = "up")thicken(x, interval = "hour", rounding = "up", ties_to_earlier = TRUE)

Thicken with a custom spanning

Description

Likethicken, it will find the datetime variable inxand add a variable of a higher periodicity to it. However, the variable towhich to map the observation is provided by the user. This enables mapping totime points that are unequally spaced.

Usage

thicken_cust(x, spanned, colname, by = NULL, drop = FALSE)

Arguments

x

A data frame containing at least one datetime variable ofclassDate,POSIXct orPOSIXlt.

spanned

A datetime vector to which the the datetime variable inx should be mapped.

colname

Character, the column name of the added variable.

by

Only needs to be specified whenx contains multiplevariables of classDate,POSIXct orPOSIXlt.Indicates which to use for thickening.

drop

Should the original datetime variable be dropped from thereturned data frame? Defaults toFALSE.

Details

Only rounding down is available for custom thickening.

Value

The data framex with the variable added to it.

Examples

library(dplyr)# analysis of traffic accidents in traffic jam hours and other hours.accidents <- emergency %>% filter(title == "Traffic: VEHICLE ACCIDENT -")spanning <- span_time("20151210 16", "20161017 17", tz = "EST") %>%  subset_span(list(hour = c(6, 9, 16, 19)))thicken_cust(accidents, spanning, "period") %>%  count(period) %>%  pad_cust(spanning)