| Title: | Summarise Continuous, Date and Categorical Variables, Check forDuplicates and Missing Data |
| Version: | 0.1 |
| Description: | Explore continuous, date and categorical variables. 'sumvar' aims to bring the ease and simplicity of the "sum" and "tab" functions from 'stata'. |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.2 |
| Imports: | dplyr, ggplot2, lubridate, magrittr, patchwork, purrr, rlang,scales, stats, tibble, tidyr, utils |
| Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
| Config/testthat/edition: | 3 |
| URL: | https://github.com/alstockdale/sumvar,https://alstockdale.github.io/sumvar/ |
| BugReports: | https://github.com/alstockdale/sumvar/issues |
| License: | MIT + file LICENSE |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2025-06-11 17:14:46 UTC; al_st |
| Author: | Alexander Stockdale [aut, cre] |
| Maintainer: | Alexander Stockdale <a.stockdale@liverpool.ac.uk> |
| Repository: | CRAN |
| Date/Publication: | 2025-06-13 20:00:02 UTC |
sumvar: Summarise Continuous and Categorical Variables in R
Description
The sumvar package explores continuous and categorical variables.sumvar brings the ease and simplicity of the "sum" and "tab" functions from Stata to R.
To explore a continuous variable, use
dist_sum(). You can stratify by a grouping variable:df %>% dist_sum(var, group)To explore dates, use
dist_date(); usage is the same asdist_sum().To summarise a single categorical variable use
tab1(), e.g.df %>% tab1(var). For a two-way table, usetab(), e.g.df %>% tab(var1, var2). Both include options for frequentist hypothesis tests.Explore duplicates and missing values with with
dup().
All functions are tidyverse/dplyr-friendly and accept the%>% pipe, outputting results as a tibble. You can save outputs for further manipulation, e.g.summary <- df %>% dist_sum(var).
Author(s)
Maintainer: Alexander Stockdalea.stockdale@liverpool.ac.uk
See Also
Useful links:
Report bugs athttps://github.com/alstockdale/sumvar/issues
Pipe operator
Description
Seemagrittr::%>% for details.
Usage
lhs %>% rhsArguments
lhs | A value or the magrittr placeholder. |
rhs | A function call using the magrittr semantics. |
Value
The result of callingrhs(lhs).
Summarize and visualize a date variable
Description
Summarises the minimum, maximum, median, and interquartile range of a date variable,optionally stratified by a grouping variable. Produces a histogram and (optionally) a density plot.
Usage
dist_date(data, var, by = NULL)Arguments
data | A data frame or tibble. |
var | The date variable to summarise. |
by | Optional grouping variable. |
Value
A tibble with summary statistics for the date variable.
See Also
dist_sum for continuous variables.
Examples
# Example ungroupeddf <- tibble::tibble( dt = as.Date("2020-01-01") + sample(0:1000, 100, TRUE))dist_date(df, dt)# Example groupeddf2 <- tibble::tibble( dt = as.Date("2020-01-01") + sample(0:1000, 100, TRUE), grp = sample(1:2, 100, TRUE))dist_date(df2, dt, grp)# Note this function accepts a pipe from dplyr eg. df %>% dist_date(date_var, group_var)Explore a continuous variable.
Description
Summarises the median, interquartile range, mean, standard deviation, confidence intervals of the mean and produces a density plot, stratified by a second grouping variable.
Provides frequentist hypothesis tests for comparison between groups: T test and Wilcoxon rank sum for 2 groups,Anova and Kruskall wallis test for 3 or more groups.
The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble.
Usage
dist_sum(data, var, by = NULL)Arguments
data | The data frame or tibble |
var | The variable you would like to summarise |
by | The grouping variable |
Value
A tibble with a summary of the variable frequency (n), number of missing observations (n_miss), median,interquartile range, mean, SD, 95% confidence intervals of the mean (using the Z distribution), and density plots.
Shows the T test (p_ttest) and Wilcoxon rank sum (p_wilcox) hypothesis tests when there are two groupsAnd an Anova test (p_anova) and Kruskal-Wallis test (p_kruskal) when there are three or more groups.
Examples
example_data <- dplyr::tibble(id = 1:100, age = rnorm(100, mean = 30, sd = 10), group = sample(c("a", "b", "c", "d"), size = 100, replace = TRUE))dist_sum(example_data, age, group)example_data <- dplyr::tibble(id = 1:100, age = rnorm(100, mean = 30, sd = 10), sex = sample(c("male", "female"), size = 100, replace = TRUE))dist_sum(example_data, age, sex)summary <- dist_sum(example_data, age, sex) # Save summary statistics as a tibble.Explore duplicate and missing data
Description
Provides an integer value for the number of duplicates found within a variableThe function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble.
eg. example_data %>% dup(variable)
Usage
dup(data, var = NULL)Arguments
data | The data frame or tibble |
var | The variable to assess |
Value
A tibble with the number and percentage of duplicate values found, and the number of missing values (NA), together with percentages.
Examples
example_data <- dplyr::tibble(id = 1:200, age = round(rnorm(200, mean = 30, sd = 50), digits=0))example_data$age[sample(1:200, size = 15)] <- NA # Replace 15 values with missing.dup(example_data, age)# It is also possible to pass a whole database to dup and it will explore all variables.example_data <- dplyr::tibble(age = round(rnorm(200, mean = 30, sd = 50), digits=0), sex = sample(c("Male", "Female"), 200, TRUE), favourite_colour = sample(c("Red", "Blue", "Purple"), 200, TRUE))example_data$age[sample(1:200, size = 15)] <- NA # Replace 15 values with missing.example_data$sex[sample(1:200, size = 32)] <- NA # Replace 32 values with missing.dup(example_data)Create a cross-tabulation of two categorial variables
Description
Creates a "n x n" cross-tabulation of two categorical variables, with row percentages.Includes options for adding frequentist hypothesis testing.
The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble.
eg. example_data %>% tab(variable1, variable2)
Usage
tab(data, variable1, variable2, test = "none")Arguments
data | The data frame or tibble |
variable1 | The first categorical variable |
variable2 | The second categorical variable |
test | Optional frequentist hypothesis test, use test=exact for Fisher's exact or test=chi for Chi squared |
Value
A tibble with a cross-tabulation of frequencies and row percentages
Examples
example_data <- dplyr::tibble(id = 1:100, group1 = sample(c("a", "b", "c", "d"), size = 100, replace = TRUE), group2= sample(c("male", "female"), size = 100, replace = TRUE))example_data$group1[sample(1:100, size = 10)] <- NA # Replace 10 with missingtab(example_data, group1, group2)summary <- tab(example_data, group1, group2) # Save summary statistics as a tibble.Summarise a categorial variable
Description
Summarises frequencies and percentages for a categorical variable.
The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble.eg. example_data %>% tab1(variable)
Usage
tab1(data, variable, dp = 1)Arguments
data | The data frame or tibble |
variable | The categorical variable you would like to summarise |
dp | The number of decimal places for percentages (default=2) |
Value
A tibble with frequencies and percentages
Examples
example_data <- dplyr::tibble(id = 1:100, group = sample(c("a", "b", "c", "d"), size = 100, replace = TRUE))example_data$group[sample(1:100, size = 10)] <- NA # Replace 10 with missingtab1(example_data, group)summary <- tab1(example_data, group) # Save summary statistics as a tibble.