Various utilities meant to aid in speeding up common statisticaloperations, such as: - removing outliers and extremes - generatingprobability density and cumulative distribution graphs with ggplot2 -running one-sample Kolmogorov-Smirnov tests against multipledistributions at once - generating prediction plots with ggplot2 -scaling data and performing principal component analysis (PCA) -plotting PCA with ggplot2
To install from CRAN
install.packages("ztils")To install the development version:
remotes::install_github("zachpeagler/ztils")This function works by keeping only rows in the dataframe containingvariable values within the quartiles +- 1.5 times the interquartilerange.
This function has no defaults, as it is entirely dependent on theuser input.
no_outliers(data, var )Returns the specified dataframedata minus the rowscontaining outliers in thevar variable.
no_outliers(iris, Sepal.Length)This isn’t a great example because the iris dataset does not containany statistical outliers.
This function works by keeping only rows in the dataframe containingvariable values within the quartiles +- 3.0 times the interquartilerange.
This function has no defaults, as it is entirely dependent on theuser input.
no_extremes(data, var )Returns the specified dataframedata minus the rowscontaining extremes in thevar variable.
no_extremes(iris, Sepal.Length)This isn’t a great example because the iris dataset does not containany statistical outliers.
This function gets the probability density function (PDF) forselected distributions againstcontinuous variables.Possible distributions include any combination of “normal”, “lognormal”,“gamma”, “exponential”, and “all” (which just uses all of the priordistributions).
Note that onlynon-negative numbers are supported bythe lognormal and gamma distributions. Feeding this function a negativenumber with those distributions selected will result in an error.
multipdf_cont(var, seq_length = 50, distributions = "all" )This function returns a dataframe with row number equal toseq_length containing the real density and theprobability density function ofvar for selecteddistributions.
multipdf_cont(iris$Petal.Length)multipdf_cont(iris$Sepal.Length, 100, c("normal", "lognormal"))This function extendsmultiPDF_cont and gets theprobability density functions (PDFs) for selected distributions againstcontinuous,non-negative numbers.Possible distributions include any combination of “normal”, “lognormal”,“gamma”, “exponential”, and “all” (which just uses all of the priordistributions). It then plots this usingggplot2 and ascico palette, usingvar_name for theplot labeling, if specified. If not specified, it will usevar instead.
multipdf_plot(var, seq_length = 50, distributions = "all", palette = "oslo", var_name = NULL )A plot showing the PDF of the selected variable against the selecteddistributions over the selected sequence length.
multipdf_plot(iris$Sepal.Length)multipdf_plot(iris$Sepal.Length, seq_length = 100, distributions = c("normal", "lognormal", "gamma"), palette = "bilbao", var_name = "Sepal Length (cm)" )This function gets the cumulative distribution function (CDF) forselected distributions againstcontinuous variables.Possible distributions include any combination of “normal”, “lognormal”,“gamma”, “exponential”, and “all” (which just uses all of the priordistributions).
Note that onlynon-negative numbers are supported bythe lognormal and gamma distributions. Feeding this function a negativenumber with those distributions selected will result in an error.
multicdf_cont(var, seq_length = 50, distributions = "all" )This function returns a dataframe with row number equal toseq_length containing the real density and theprobability density function ofvar for selecteddistributions.
multicdf_cont(iris$Petal.Length)multicdf_cont(iris$Sepal.Length, 100, c("normal", "lognormal") )This function extendsmultiCDF_cont and gets thecumulative distribution functions (CDFs) for selected distributionsagainstcontinuous,non-negativenumbers. Possible distributions include any combination of “normal”,“lognormal”, “gamma”, “exponential”, and “all” (which just uses all ofthe prior distributions). It then plots this usingggplot2 and ascico palette, usingvar_name for the plot labeling, if specified. If notspecified, it will usevar instead.
multicdf_plot(var, seq_length = 50, distributions = "all", palette = "oslo", var_name = NULL )A plot showing the CDF of the selected variable against the selecteddistributions over the selected sequence length.
multicdf_plot(iris$Sepal.Length)multicdf_plot(iris$Sepal.Length, seq_length = 100, distributions = c("normal", "lognormal", "gamma"), palette = "bilbao", var_name = "Sepal Length (cm)" )This function gets the distance and p-value from a one-sampleKolmogorov-Smirnov (KS) test for selected distributions against acontinous input variable. Possible distributions include “normal”,“lognormal”, “gamma”, “exponential”, and “all”.
multiks_cont(var, distributions = "all" )Note: If using “lognormal” or “gamma” distributions, the targetvariablemust be non-negative.
Returns a dataframe with the distance and p-value for each performedKS test. The distance is a relative metric of similarity. A p-value of> 0.05 indicates that the target variable’s distribution isnot significantly different from the specifieddistribution.
multiks_cont(iris$Sepal.Length)multiks_cont(iris$Sepal.Length, c("normal", "lognormal"))This function calculates the pseudo R^2 (proportion of varianceexplained by the model) for a general linear model (glm). glms don’thave real R^2 due to the intrinsic difference between a linear model anda generalized linear model, but we can still calculate an approximiationof the R^2 as (1 - (deviance/null deviance)).
glm_pseudor2(mod)Returns the pseudo R^2 value of the model.
gmod <- glm(Sepal.Length ~ Petal.Length + Species, data = iris)glm_pseudor2(gmod)This function performs a principal component analysis (PCA) for theselectedpcavars with the option to automatically scalethe variables. It then graphs PC1 on the x axis and PC2 on the y-axisusingggplot2, coloring the graph with ascico paletteover the specifiedgroups. This is similar to thebiplot command from thestats package, but performsall the steps required in graphing a PCA for you.
pca_plot(group, pcavars, scaled = FALSE, palette = "oslo )A ggplot object showing PC1 on the x axis and PC2 on the y axis,colored by group with vectors and labels showing the individual pcavariables.
pca_plot(iris$Species, iris[,c(1:4)])pca_plot(iris$Species, iris[,c(1:4)], FALSE, "bilbao")This function performs a principal component analysis (PCA) on thespecified variables,pcavars and attaches the resultingprincipal components to the specified dataframe,data,with optional variable scaling.
pca_data(data, pcavars, scaled = FALSE )Returns a dataframe with principal components as additionalcolumns.
pca_data(iris, iris[,c(1:4)], FALSE)This function performs a prediction based on the suppliedmodel, then graphs it usingggplot2. Optionsare available for predicting based on the confidence or predictioninterval, as well as for applying corrections, such as exponential andlogistic.
I would like to alter this function to reduce the number of requiredinputs, as all the informationshould be available from themodel call, but that’s a work in progress. ### Usage
predict_plot(mod, data, rvar, pvar, group = NULL, length = 50, interval = "confidence", correction = "normal", palette = "oslo" )Returns a plot with the observed (real) data plotted as points andthe prediction plotted as lines, with a 95% confidence or predictioninterval.
This function has a known issue with the colors on ungroupedpredictions being kind of funky, as the function uses the predictorvariable (x-axis) for the color, which works for the actual data(points), but doesn’t translate well to the predicted lines andribbon.
mod1 <- lm(Sepal.Length ~ Petal.Length + Species, data = iris)predict_plot(mod1, iris, Sepal.Length, Petal.Length, Species)If you find any bugs, please report them athttps://github.com/zachpeagler/ztils/issues.