Movatterモバイル変換

Type:

Package

Title:

Intelligently Peruse Data

Version:

0.1.0

Maintainer:

Scott McKenzie <sccmckenzie@gmail.com>

Description:

Facilitate extraction of key information from common datasets.

License:

MIT + file LICENSE

Encoding:

UTF-8

LazyData:

true

Imports:

pastecs (≥ 1.3.21), stats, dplyr (≥ 1.0.0), rlang (≥0.4.3), tidyr (≥ 1.0.0), tibble, purrr, glue, tidyselect

RoxygenNote:

7.1.1

Suggests:

knitr, ggplot2, testthat (≥ 3.0.0), rmarkdown, mopac, hms,stringr, readr

Config/testthat/edition:

Depends:

R (≥ 3.3.0)

VignetteBuilder:

knitr

LinkingTo:

cpp11

SystemRequirements:

C++11

NeedsCompilation:

yes

Packaged:

2021-07-03 18:58:25 UTC; Scott

Author:

Scott McKenzie [aut, cre], RStudio [cph] (internal functions from dplyr.R)

Repository:

CRAN

Date/Publication:

2021-07-05 09:10:02 UTC

Join tables based on overlapping intervals.

Description

User-friendly interface that synthesizes power ofdplyr::left_join andfindInterval.

Usage

break_join(x, y, brk = character(), by = NULL, ...)

Arguments

x

A data frame.

y

Data frame containing desired reference information.

brk

Name of column inx andy to join by via interval overlapping. Must be coercible to numeric.

by

Joining variables, if needed. Seemutate-joins.

...

additional arguments automatically directed tofindInterval anddplyr::left_join. No partial matching.

Value

An object of the same type asx.

Allx rows will be returned.
All columns betweenx andy are returned.
Rows iny are matched withx based on overlapping values ofbrk (e.g.findInterval(x$brk, y$brk, ...)).

Examples

# joining USA + UK leaders with population time-seriesbreak_join(us_uk_pop, us_uk_leaders, brk = c("date" = "start"))# simple datasetset.seed(1)a <- data.frame(p = c(rep("A", 10), rep("B", 10)), q = runif(20, 0, 10))b <- data.frame(p = c("A", "A", "B", "B"), q = c(3, 5, 6, 9), r = c("a1", "a2", "b1", "b2"))break_join(a, b, brk = "q") # p identified as common variable automaticallybreak_join(a, b, brk = "q", by = "p") # same resultbreak_join(a, b, brk = "q", all.inside = TRUE) # note missing values have been filled# joining toll prices with vehicle time-serieslibrary(mopac)library(dplyr, warn.conflicts = FALSE)library(hms)express %>%  mutate(time_hms = as_hms(time)) %>%  break_join(rates, brk = c("time_hms" = "time"))

Simulated records of radio station communications.

Description

Dataset intended to demonstrate usage ofsift::conjecture.

Usage

comms

Format

An object of classtbl_df (inherits fromtbl,data.frame) with 50000 rows and 4 columns.

Specialized "long to wide" reshaping

Description

On the surface,conjecture() appears similar totidyr::pivot_wider(), but uses different logic tailored to a specific type of dataset:

column corresponding tonames_from contains only 2 levels
there is no determinate combination of elements to fill 2 columns per row.

See vignette("conjecture") for more details.

Usage

conjecture(data, sort_by, names_from, names_first)

Arguments

data

A data frame to reshape.

sort_by

Column name, as symbol. Plays a similar role asvalues_from inpivot_wider(), but also serves as sorting dimension for underlying conjecture algorithm.

names_from

Column name, as symbol. Used to differentiate anterior/posterior observations. Column must only contain 2 levels (missing values not allowed).

names_first

level in variable specified bynames_from indicating anterior observation.

Details

conjecture() uses the following routine to match elements:

Values insort_by are separated into two vectors: anterior and posterior.
Each anterior element is matched with the closest posterior element measured bysort_by.

Value

An object of the same type asdata.

Examples

# See vignette("conjecture") for more examplesconjecture(comms, timestamp, type, "send")

Automatically cluster 1-dimensional continuous data.

Description

Automatically cluster 1-dimensional continuous data.

Usage

kluster(x, bw = "SJ", fixed = FALSE)

Arguments

x

Vector to be clustered. Must contain at least 1 non-missing value.

bw

kernel bandwidth. Default "SJ" should suffice more application, however you can supply a custom numeric value. See ?stats::density for more information.

fixed

logical; ifTRUE, performs simple 1-dimensional clustering without kernel density estimation. default FALSE.

Value

An integer vector identifying the cluster corresponding to each element inx.

Examples

# Below vector clearly has 2 groups.# kluster will identify these groups using kernel density estimation.kluster(c(0.1, 0.2, 1))# kluster shines in cases where manually assigning groups via "eyeballing" is impractical.# Suppose we obtained vector 'x' without knowing how it was generated.set.seed(1)nodes <- runif(10, min = 0, max = 100)x <- lapply(nodes, function(x) rnorm(10, mean = x, sd = 0.1))x <- unlist(x)kluster(x) # kluster reveals the natural groupingkluster(x, bw = 10) # adjust bandwidth depending on application# Example with faithful datasetfaithful$k <- kluster(faithful$eruptions)library(ggplot2)ggplot(faithful, aes(eruptions)) +  geom_density() +  geom_rug(aes(color = factor(k))) +  theme_minimal() +  scale_color_discrete(name = "k")

2020 New York Times Headlines

Description

Includes selected headlines and additional metadata for NYT articles throughout 2020. This dataset is not a comprehensive account of all major events from 2020.

Usage

nyt2020

Format

A data frame with 1,830 rows and 6 variables:

headline: Article Headline
abstract: Brief summary of article
byline: Contributing Writers
pub_date: Date of Publication
section_name: NYT section in which article was published
web_url: Article URL

...

Source

Obtained usingNYT Developer Portal (Archive API)

Augmented data frame filtering.

Description

Imaginedplyr::filter that includes neighboring observations.Choose how many observations to include by adjusting inputssift.col andscope.

Usage

sift(.data, sift.col, scope, ...)

Arguments

.data

A data frame.

sift.col

Column name, as symbol, to serve as "sifting/augmenting" dimension. Must be non-missing and coercible to numeric.

scope

Specifies augmentation bandwidth relative to "key" observations. Parameter should share the same scale assift.col.

If length 1, bandwidth used is +/-scope.

If length 2, bandwidth used is (-scope[1], +scope[2]).

...

Expressions passed todplyr::filter, of which the results serve as the "key" observations. The same data-masking rules used indplyr::filter apply here.

Details

sift() can be understood as a 2-step process:

.data is passed todplyr::filter, using subsetting expression(s) provided in.... We'll refer to these intermediate results as "key" observations.
For each key observation,sift expands the row selection bidirectionally along dimension specified bysift.col. Any row from the original dataset withinscope units of a key observation is captured in the final result.

Essentially, this allows us to "peek" at neighboring rows surrounding the key observations.

Value

A sifted data frame, with 2 additional columns:

.cluster <int>: Identifies resulting group formed by each key observation and its neighboring rows. When the key observations are close enough together, the clusters will overlap.
.key <lgl>:TRUE indicates key observation.

Examples

# See current events from same timeframe as 2020 Utah Monolith discovery.sift(nyt2020, pub_date, scope = 2, grepl("Monolith", headline))# or Biden's presidential victory.sift(nyt2020, pub_date, scope = 2, grepl("Biden is elected", headline))# We can specify lower & upper scope to see what happened AFTER Trump tested positive.sift(nyt2020, pub_date, scope = c(0, 2), grepl("Trump Tests Positive", headline))# sift recognizes dplyr group specification.library(dplyr)library(mopac)express %>% group_by(direction) %>% sift(time, 30, plate == "EAS-1671") # row augmentation performed within groups.

Fragments of US & UK population & leaders

Description

These datasets are intended to demonstrate usage ofsift::break_join.

Usage

us_uk_popus_uk_leaders

Source

Seetidyr::who andggplot2::presidential.