Movatterモバイル変換

ztils

License: MIT License lifecycle year

Various utilities meant to aid in speeding up common statisticaloperations, such as: - removing outliers and extremes - generatingprobability density and cumulative distribution graphs with ggplot2 -running one-sample Kolmogorov-Smirnov tests against multipledistributions at once - generating prediction plots with ggplot2 -scaling data and performing principal component analysis (PCA) -plotting PCA with ggplot2

Installation

To install from CRAN

install.packages("ztils")

To install the development version:

remotes::install_github("zachpeagler/ztils")

no_outliers()

Description

This function works by keeping only rows in the dataframe containingvariable values within the quartiles +- 1.5 times the interquartilerange.

Usage

This function has no defaults, as it is entirely dependent on theuser input.

no_outliers(data,            var            )

Arguments

data: the dataframe to remove rows containingoutliers of the target variable
var: the variable to calculate outliersagainst

Returns

Returns the specified dataframedata minus the rowscontaining outliers in thevar variable.

Examples:

no_outliers(iris, Sepal.Length)

This isn’t a great example because the iris dataset does not containany statistical outliers.

no_extremes()

Description

This function works by keeping only rows in the dataframe containingvariable values within the quartiles +- 3.0 times the interquartilerange.

Usage

This function has no defaults, as it is entirely dependent on theuser input.

no_extremes(data,            var            )

Arguments

data: the dataframe to remove rows containingoutliers of the target variable
var: the variable to calculate outliersagainst

Returns

Returns the specified dataframedata minus the rowscontaining extremes in thevar variable.

Examples:

no_extremes(iris, Sepal.Length)

This isn’t a great example because the iris dataset does not containany statistical outliers.

multipdf_cont()

Description

This function gets the probability density function (PDF) forselected distributions againstcontinuous variables.Possible distributions include any combination of “normal”, “lognormal”,“gamma”, “exponential”, and “all” (which just uses all of the priordistributions).

Note that onlynon-negative numbers are supported bythe lognormal and gamma distributions. Feeding this function a negativenumber with those distributions selected will result in an error.

Usage:

multipdf_cont(var,               seq_length = 50,               distributions = "all"              )

Returns

This function returns a dataframe with row number equal toseq_length containing the real density and theprobability density function ofvar for selecteddistributions.

Arguments

var: the variable of which to get the PDF
- no default
seq_length: the length to fit the distributionagainst
- default 50
distributions: the distributions to fitvar against
- default “all”

Examples

multipdf_cont(iris$Petal.Length)multipdf_cont(iris$Sepal.Length, 100, c("normal", "lognormal"))

multipdf_plot()

Description

This function extendsmultiPDF_cont and gets theprobability density functions (PDFs) for selected distributions againstcontinuous,non-negative numbers.Possible distributions include any combination of “normal”, “lognormal”,“gamma”, “exponential”, and “all” (which just uses all of the priordistributions). It then plots this usingggplot2 and ascico palette, usingvar_name for theplot labeling, if specified. If not specified, it will usevar instead.

Usage

multipdf_plot(var,               seq_length = 50,              distributions = "all",               palette = "oslo",               var_name = NULL              )

Returns

A plot showing the PDF of the selected variable against the selecteddistributions over the selected sequence length.

Arguments

var: the variable of which to get the PDF
seq_length: the length to fit the distributionagainst
distributions: the distributions to fitvar against
palette: Ascico palette to use on thegraph, with each distribution corresponding to a color. For all possiblepalettes, callscico_palette_names().
var_name: A name to use in the title and x axislabel of the plot.

Examples

multipdf_plot(iris$Sepal.Length)multipdf_plot(iris$Sepal.Length,              seq_length = 100,              distributions = c("normal", "lognormal", "gamma"),              palette = "bilbao",              var_name = "Sepal Length (cm)"              )

multicdf_cont()

Description

This function gets the cumulative distribution function (CDF) forselected distributions againstcontinuous variables.Possible distributions include any combination of “normal”, “lognormal”,“gamma”, “exponential”, and “all” (which just uses all of the priordistributions).

Note that onlynon-negative numbers are supported bythe lognormal and gamma distributions. Feeding this function a negativenumber with those distributions selected will result in an error.

Usage:

multicdf_cont(var,               seq_length = 50,               distributions = "all"              )

Returns

This function returns a dataframe with row number equal toseq_length containing the real density and theprobability density function ofvar for selecteddistributions.

Arguments

var: the variable of which to get the PDF
- no default
seq_length: the length to fit the distributionagainst
- default 50
distributions: the distributions to fitvar against
- default “all”

Examples

multicdf_cont(iris$Petal.Length)multicdf_cont(iris$Sepal.Length,              100,               c("normal", "lognormal")              )

multicdf_plot()

Description

This function extendsmultiCDF_cont and gets thecumulative distribution functions (CDFs) for selected distributionsagainstcontinuous,non-negativenumbers. Possible distributions include any combination of “normal”,“lognormal”, “gamma”, “exponential”, and “all” (which just uses all ofthe prior distributions). It then plots this usingggplot2 and ascico palette, usingvar_name for the plot labeling, if specified. If notspecified, it will usevar instead.

Usage

multicdf_plot(var,               seq_length = 50,              distributions = "all",               palette = "oslo",               var_name = NULL              )

Returns

A plot showing the CDF of the selected variable against the selecteddistributions over the selected sequence length.

Arguments

var: the variable of which to get the CDF
seq_length: the length to fit the distributionagainst
distributions: the distributions to fitvar against
palette: Ascico palette to use on thegraph, with each distribution corresponding to a color. For all possiblepalettes, callscico_palette_names().
var_name: A name to use in the title and x axislabel of the plot.

Examples

multicdf_plot(iris$Sepal.Length)multicdf_plot(iris$Sepal.Length,              seq_length = 100,              distributions = c("normal", "lognormal", "gamma"),              palette = "bilbao",              var_name = "Sepal Length (cm)"              )

multiks_cont()

Description

This function gets the distance and p-value from a one-sampleKolmogorov-Smirnov (KS) test for selected distributions against acontinous input variable. Possible distributions include “normal”,“lognormal”, “gamma”, “exponential”, and “all”.

Usage

multiks_cont(var,             distributions = "all"                )

Note: If using “lognormal” or “gamma” distributions, the targetvariablemust be non-negative.

Arguments

var: The variable to perform one-sample KS testson
distributions: The distributions to testagainst

Returns

Returns a dataframe with the distance and p-value for each performedKS test. The distance is a relative metric of similarity. A p-value of> 0.05 indicates that the target variable’s distribution isnot significantly different from the specifieddistribution.

Examples

multiks_cont(iris$Sepal.Length)multiks_cont(iris$Sepal.Length, c("normal", "lognormal"))

gml_pseudor2

Description

This function calculates the pseudo R^2 (proportion of varianceexplained by the model) for a general linear model (glm). glms don’thave real R^2 due to the intrinsic difference between a linear model anda generalized linear model, but we can still calculate an approximiationof the R^2 as (1 - (deviance/null deviance)).

Usage

glm_pseudor2(mod)

Arguments

mod: The glm object to calculate a pseudo-R^2for.

Returns

Returns the pseudo R^2 value of the model.

Examples

gmod <- glm(Sepal.Length ~ Petal.Length + Species, data = iris)glm_pseudor2(gmod)

pca_plot()

Description

This function performs a principal component analysis (PCA) for theselectedpcavars with the option to automatically scalethe variables. It then graphs PC1 on the x axis and PC2 on the y-axisusingggplot2, coloring the graph with ascico paletteover the specifiedgroups. This is similar to thebiplot command from thestats package, but performsall the steps required in graphing a PCA for you.

Usage

pca_plot(group,         pcavars,         scaled = FALSE,         palette = "oslo         )

Arguments

group: The group column, used for assigningcolors.
pcavars: The variables (columns) to perform aprincipal component analysis on. Should beexplanatoryvariables and notresponse variables.
scaled: A boolean (TRUE or FALSE) indicated if thepcavars have already been scaled or if they should bescaled in the function.
palette: Ascico palette used to color thegraph. For all possible palettes, callscico_palette_names(). If non-scico palettes aredesired, the palette can be overridden with scale_color and scale_fillfunctions.

Returns

A ggplot object showing PC1 on the x axis and PC2 on the y axis,colored by group with vectors and labels showing the individual pcavariables.

Examples

pca_plot(iris$Species, iris[,c(1:4)])pca_plot(iris$Species, iris[,c(1:4)], FALSE, "bilbao")

pca_data()

Description

This function performs a principal component analysis (PCA) on thespecified variables,pcavars and attaches the resultingprincipal components to the specified dataframe,data,with optional variable scaling.

Usage

pca_data(data,         pcavars,         scaled = FALSE         )

Arguments

data: The dataframe to attach principal componentsto.
pcavars: The variables to use in the principalcomponent analysis.
scaled: A logical value (TRUE or FALSE) indicatingifpcavars have already been scaled or if they shouldbe scaled in the function.

Returns

Returns a dataframe with principal components as additionalcolumns.

Examples

pca_data(iris, iris[,c(1:4)], FALSE)

predict_plot()

Description

This function performs a prediction based on the suppliedmodel, then graphs it usingggplot2. Optionsare available for predicting based on the confidence or predictioninterval, as well as for applying corrections, such as exponential andlogistic.

I would like to alter this function to reduce the number of requiredinputs, as all the informationshould be available from themodel call, but that’s a work in progress. ### Usage

predict_plot(mod,             data,             rvar,             pvar,             group = NULL,             length = 50,             interval = "confidence",             correction = "normal",             palette = "oslo"             )

Arguments

mod: A univariate linear model to base predictionson.
data: The dataframe used in the model. Will be usedto pull variables for plotting.
rvar: The response variable (y-axis), must be thesame as the one in the model
pvar: The predictor variable (x-axis), must be thesame as the one in the model.
group: An optional grouping variable. If a group ispresent, separate predictions will be made for each group.
length: The length to predict over. A longer lengthwill result in more precision.
interval: Tells the function to predict over eitherthe confidence interval or the prediction interval.
- “confidence” or “prediction”
correction: If you log transform or logit transformthe variables in the model, you can choose to apply a correction to thepredicted output to reverse that transformation.
- “normal”, “exponential”, or “logit”
palette: Ascico palette used to color thegraph. For all possible palettes, callscico_palette_names(). If non-scico palettes aredesired, the palette can be overridden with scale_color and scale_fillfunctions.

Returns

Returns a plot with the observed (real) data plotted as points andthe prediction plotted as lines, with a 95% confidence or predictioninterval.

This function has a known issue with the colors on ungroupedpredictions being kind of funky, as the function uses the predictorvariable (x-axis) for the color, which works for the actual data(points), but doesn’t translate well to the predicted lines andribbon.

Examples

mod1 <- lm(Sepal.Length ~ Petal.Length + Species, data = iris)predict_plot(mod1, iris, Sepal.Length, Petal.Length, Species)

Bug reporting

If you find any bugs, please report them athttps://github.com/zachpeagler/ztils/issues.

[8]ページ先頭