How to use the volkeR package?
First, load the package, set the plot theme and get some data.
# Load the packagelibrary(volker)# Set the basic plot themetheme_set(theme_vlkr())# Load an example dataset ds from the packageds<-volker::chatgptHow to generate tables and plots?
Decide whether your data is categorical or metric and choose theappropriate function:
report_counts()shows frequency tables and generatessimple and stacked bar charts.report_metrics()creates tables with distributionparameters, visualises distributions in density plots, box plots orscatter plots.
Report functions, under the hood, call functions that generate plots,tables or calculate effects. If you only need one of those outputs, youcan call the functions directly:
tab_counts(),plot_counts()oreffect_counts()for categorical data.tab_metrics(),plot_metrics()oreffect_metrics()for metric data.
All functions expect a dataset as their first parameter. The secondand third parameters await your column selections. The column selectionsdetermine whether to analyse single variables, item lists or to compareand correlate multiple variables.
Try out the following examples!
Categorical variables
# A single variablereport_counts(ds,use_private)# A list of variablesreport_counts(ds,c(use_private,use_work))# Variables matched by a patternreport_counts(ds,starts_with("use_"))You can use all sorts of tidyverse style selections: A single column,a list of columns or patterns such asstarts_with(),ends_with(),contains() ormatches().
Metric variables
# One metric variablereport_metrics(ds,sd_age)# Multiple metric itemsreport_metrics(ds,starts_with("cg_adoption_"))Cross tabulation and group comparison
Provide a grouping column in the third parameter to compare differentgroups.
report_counts(ds,adopter,sd_gender)For metric variables, you can compare the mean values.
report_metrics(ds,sd_age,sd_gender)By default, the crossing variable is treated as categorical. You canchange this behavior using the metric-parameter to calculatecorrelations:
report_metrics(ds,sd_age,use_work, metric=TRUE)The ci parameter, where possible, adds confidence intervals to theoutputs.
ds|>filter(sd_gender!="diverse")|>report_metrics(sd_age,sd_gender, ci=TRUE)Conduct statistical tests with theeffect-parameter.
ds|>filter(sd_gender!="diverse")|>report_counts(adopter,sd_gender, effect=TRUE)See the function help (F1 key) to learn more options. For example,you can use theprop parameter to grow bars to 100%. Thenumbers parameter prints frequencies and percentages ontothe bars.
ds|>filter(sd_gender!="diverse")|>report_counts(adopter,sd_gender, prop="rows", numbers="n")Theming
Thetheme_vlkr()-function lets you customise colors:
theme_set(theme_vlkr( base_fill=c("#F0983A","#3ABEF0","#95EF39","#E35FF5","#7A9B59"), base_gradient=c("#FAE2C4","#F0983A")))Labeling
Labels used in plots and tables are stored in the comment attributeof the variable. You can inspect all labels using thecodebook()-function:
codebook(ds)#># A tibble: 97 × 6#> item_name item_group item_class item_label value_name value_label#><chr><chr><chr><chr><chr><chr>#> 1 case case numeric caseNANA#> 2 sd_age sd numeric AgeNANA#> 3 cg_activities cg character Activities with C…NANA#> 4 cg_act_write cg character cg_act_writeNANA#> 5 cg_act_test cg character cg_act_testNANA#> 6 cg_act_search cg character cg_act_searchNANA#> 7 adopter adopter factor Innovator type I try new… I try new …#> 8 adopter adopter factor Innovator type I try new… I try new …#> 9 adopter adopter factor Innovator type I wait un… I wait unt…#>10 adopter adopter factor Innovator type I only us… I only use…#># ℹ 87 more rowsSet specific column labels by providing a named list to theitems-parameter oflabs_apply():
ds%>%labs_apply( items=list("cg_adoption_advantage_01"="General advantages","cg_adoption_advantage_02"="Financial advantages","cg_adoption_advantage_03"="Work-related advantages","cg_adoption_advantage_04"="More fun"))%>%report_metrics(starts_with("cg_adoption_advantage_"))Labels for values inside a column can be adjusted by providing anamed list to the values-parameter oflabs_apply(). Inaddition, select the columns where value labels should be changed:
ds%>%labs_apply( cols=starts_with("cg_adoption"), values=list("1"="Strongly disagree","2"="Disagree","3"="Neutral","4"="Agree","5"="Strongly agree"))%>%report_metrics(starts_with("cg_adoption"))To conveniently manage all labels of a dataset, save the result ofcodebook() to an Excel file, change the labels manually ina copy of the Excel file, and finally calllabs_apply()with your revised codebook.
library(readxl)library(writexl)# Save codebook to a filecodes<-codebook(ds)write_xlsx(codes,"codebook.xlsx")# Load and apply a codebook from a filecodes<-read_xlsx("codebook_revised.xlsx")ds<-labs_apply(ds,codes)Be aware that some data operations such asmutate() fromthe tidyverse loose labels on their way. In this case, store the labels(in the codebook attribute of the data frame) before the operation andrestore them afterwards:
ds%>%labs_store()%>%mutate(sd_age=2024-sd_age)%>%labs_restore()%>%report_metrics(sd_age)The volker report template
Reports combine plots, tables and effect calculations in an RMarkdowndocument. Optionally, for item batteries, an index, clusters or factorsare calculated and reported.
To see an example or develop own reports, use the volker reporttemplate in RStudio:
- Create a new R Markdown document from the main menu
- In the popup select the “From Template” option
- Select the volker template.
- The template contains a working example. Just click knit to see theresult.
Have fun with developing own reports!
Without the template, to generate a volker-report from any R-Markdowndocument, addvolker::html_report to the output options ofyour Markdown document:
---title: "How to create reports?"output: volker::html_report---Then, you can generate combined outputs using the report-functions.One advantage of the report-functions is that plots are automaticallyscaled to fit the page. See the function help for further options (F1key).
#> ```{r echo=FALSE}#> ds %>%#> filter(sd_gender != "diverse") %>%#> report_counts(adopter, sd_gender,#> ```Custom tab sheets
By default, a header and tabsheets are automatically created. You canmix in custom content.
- If you want to add content before the report outputs, set the titleparameter to
FALSEand add your own title. - A good place for methodological details is a custom tabsheet next tothe “Plot” and the “Table” buttons. You can add a tab by setting theclose-parameter to
FALSEand adding a new header on thefifth level (5 x # followed by the tab name). Close your custom newtabsheet with#### {-}(4 x #).
Try out the following pattern in an RMarkdown document!
#> ### Adoption types#>#> ```{r echo=FALSE}#> ds %>%#> filter(sd_gender != "diverse") %>%#> report_counts(adopter, sd_gender, prop="rows", title=FALSE, close=FALSE)#> ```#>#> ##### Method#> Basis: Only male and female respondents.#>#> #### {-}Index calculation for item batteries
For quick inspections of an index from a bunch of items, set theindex parameter toTRUE. The index is calculated by theaverage value of all selected columns.
Cronbach’s Alpha and the number of items are calculated withpsych::alpha() and stored as column attribute named“psych.alpha”. The reliability values are printed byreport_metrics().
ds|>report_metrics(starts_with("cg_adoption"), index=TRUE)You can add an index as a new column usingadd_index().A new column is created with the average value of all selected columnsfor each case. Provide a custom name for the column using thenewcol parameter. Thereport_metrics()function still outputs reliability values for the column.
Add a single index
ds%>%add_index(starts_with("cg_adoption_"), newcol="idx_cg_adoption")%>%report_metrics(idx_cg_adoption)Compare the index values by group
ds%>%add_index(starts_with("cg_adoption_"), newcol="idx_cg_adoption")%>%report_metrics(idx_cg_adoption,adopter)Add multiple indizes and summarize them
ds%>%add_index(starts_with("cg_adoption_"))%>%add_index(starts_with("cg_adoption_advantage"))%>%add_index(starts_with("cg_adoption_fearofuse"))%>%add_index(starts_with("cg_adoption_social"))%>%tab_metrics(starts_with("idx_cg_adoption"))To reverse items, provide a selection of columns to thecols.reverse-parameter ofadd_index().
Factor and cluster analysis
The easiest way to conduct factor analysis or cluster analyses is touse the respective parameters in thereport_metrics()function.
ds|>report_metrics(starts_with("cg_adoption"), factors=TRUE, clusters=TRUE)Currently, cluster analysis is performed using kmeans and factoranalysis is a principal component analysis. Setting the parameters totrue, automatically generates scree plots and selects the number offactors or clusters. Alternatively, you can explicitly specify thenumbers.
Add factor or cluster analysis results to the originaldata
If you want to work with the results, useadd_factors()andadd_clusters() respectively. For factor analysis, newcolumns prefixed with “fct_” are created to store the factor loadingsbased on the specified number of factors. For clustering, an additionalcolumn prefixed with “cls_” is added that assigns each observation to acluster number.
ds|>add_factors(starts_with("cg_adoption"), k=3)|>select(starts_with("fct_"))Once you have added factor or cluster columns to your data set, youcan use them with the report functions:
ds|>add_factors(starts_with("cg_adoption"), k=3)|>report_metrics(fct_cg_adoption_1,fct_cg_adoption_2, metric=TRUE)ds|>add_clusters(starts_with("cg_adoption"), k=3)|>report_counts(sd_gender,cls_cg_adoption, prop="cols")After explicitly adding factor or cluster columns, you can inspectthe analysis results usingfactor_tab(),factor_plot() orcluster_tab(),cluster_plot().
ds|>add_factors(starts_with("cg_adoption"), k=3)|>factor_tab(starts_with("fct_"))Automatically determine the number of factors orclusters
To automatically determine the optimal number of factors or clustersbased on diagnostics, set k = NULL.
ds|>add_factors(starts_with("cg_adoption"), k=NULL)|>factor_tab(starts_with("fct_cg_adoption"))Modeling: Regression and Analysis of Variance
Modeling in the statistical sense is predicting an outcome (dependentvariable) from one or multiple predictors (independent variables).
The report_metrics() function calculates a linear model if the modelparameter is TRUE. You provide the variables in the followingparameters:
- Dependent metric variable: first parameter after the dataset (colsparameter).
- Independent categorical variables: a tidy column selection in thesecond parameter (cross parameter).
- Independent metric variables: a tidy column selection in the thirdparameter (metric parameter.
- Interaction effects: interactions-parameter with a vector ofmultiplication terms(e.g.
interactions = c(sd_age * sd_gender))
ds|>filter(sd_gender!="diverse")|>report_metrics(use_work, cross=c(sd_gender,adopter), metric=sd_age, model=TRUE, diagnostics=TRUE)Four selected diagnostic plots are generated if thediagnostics-parameter is TRUE:
- Residual vs. fitted: Residuals should be evenly distributedvertically. Horizontally, they should follow the straight line.Otherwise this could be an indicator for heteroscedasticity,non-linearity or auto-correlationb.
- Scale-location plot: Points should be evenly distributed, without apattern, as in the residual plot.
- Q-Q-Plot of fitted values: All dots should be located on a straightline. Otherweise, this may indicate non-linearity or that residuals arenot normally distributed.
- Cooks’ distance plot: High values indicate that single casesinfluence the model disproportionally. Rule of thumb: Cook’s distance> 1 is a problem.
To work with the predicted values, use add_model() instead of thereport function. This will add a new variable prefixed withprd_ holding the target scores.
ds<-ds|>add_model(use_work, categorical=c(sd_gender,adopter), metric=sd_age)report_metrics(ds,use_work,prd_use_work, metric=T)There are two functions to get the regression table or plot from thenew column:
model_tab(ds,prd_use_work)model_plot(ds,prd_use_work)By default, p values are adjusted to the number of tests bycontrolling the false discovery rate (fdr). Set the adjust-parameter toFALSE for disabling p correction.
Reliability scores (and classification performance indicators)
In content analysis, reliability is usually checked by coding thecases with different persons and then calculating the overlap. Tocalculate reliability scores, prepare one data frame for eachperson:
- All column names in the different data frames should beidentical.
Codings must be either binary (TRUE/FALSE) or contain a fixed number ofvalues such as “sports”, “politics”, “weather”. - Add a column holding initials or the name of the coder.
- One column must contain unique IDs for each case, e.g. a runningcase number.
Next, you row bind the data frames. The columns for coder and ID makesure that each coding is uniquely identified and can be related to thecases and coders.
data_coded<-bind_rows(data_coder1,data_coder2)The final data, for example, looks like:
| case | coder | topic_sports | topic_weather |
|---|---|---|---|
| 1 | anne | TRUE | FALSE |
| 2 | anne | TRUE | FALSE |
| 3 | anne | FALSE | TRUE |
| 1 | ben | TRUE | TRUE |
| 2 | ben | TRUE | FALSE |
| 3 | ben | FALSE | TRUE |
Calculating reliability is straight forward with report_counts():
- Provide the data to the first parameter.
- Add the column with codings (or a selection of multiple columns,e.g. using
starts_with()) to the second parameter. - Set the the third parameter to the column name with coder names orinitials.
- Set the ids-parameter to the column that contains case IDs or casenumbers (this tells the volker-package which cases belongtogether).
- Set the agree-parameter to “reliability” to request reliabilityscores.
Example:
report_counts(data_coded,starts_with("topic_"),coder, ids=case, prop="cols", agree="reliability")Alternatively, if you are only interested in the scores, not a plot,you get them using agree_tab. Hint: You may abbreviate the reliabilityvalue.
agree_tab(data_coded,starts_with("topic_"),coder, ids=case, method="reli")Further, you can request classification performance indicators(accuracy, precision, recall, F1) with the same function by setting themethod to “classification” (may be abbreviated). Use this option if youcompare manual codings to automated codings (classifiers, large languagemodels). By default, you get macro statistics (average precision, recalland f1 over categories).
Give you have multiple values in on column, you may focus onecategory to get micro statistics:
agree_tab(starts_with("topic_"),coder, ids=case, method="class", category="catcontent")The mystery of missing values
Cases with missing values, by default, are omitted in all methods.Thus, the calculations are only based on cases with complete values inthe selected columns.
Furthermore, each function first cleans the values:
- Residual levels defined in the
VLKR_NA_LEVELSconstantare recoded to missing values (“[NA] nicht beantwortet”, “[NA] keineAngabe”, “[no answer]” and “keine Angabe”). - Residual numeric values defined in the
VLKR_NA_NUMBERSconstant are recoded to missing values (-9, -2, and -1).
print(volker:::VLKR_NA_LEVELS)#> [1] "[NA] nicht beantwortet" "[NA] keine Angabe" "[no answer]"#> [4] "keine Angabe"print(volker:::VLKR_NA_NUMBERS)#> [1] -9 -2 -1The output always contains information about how many cases wereremoved due to missing values. You have three options to treatmissings:
- Disable recoding by setting the
clean-parameter of thefunctions toFALSE. - Override the values in VLKR_NA_LEVELS and VLKR_NA_NUMBERS by callssuch as
options(vlkr.na.levels=c("Not answered"))oroptions(vlkr.na.numbers=c(-2,-9)). If you set the value toFALSE, no values are recoded. - When analysing items, use pairwise complete data by calling
options(vlkr.na.omit=FALSE)(maximal information from allitems).
What’s behind the scenes?
The volker-package is based on standard methods for data handling andvisualisation. You could produce all outputs on your own. The packagejust makes your code dry - don’t repeat yourself - and wraps often usedsnippets into a simple interface.
Report functions call subsidiary tab, plot and effect functions,which in turn call functions specifically designed for the providedcolumn selection. Open the package help to see, to which specificfunctions the report functions are redirected.
Console and markdown output is pimped by specific print- andknit-functions. To make this work, the cleaned data, produced plots,tables and markdown snippets gain new classes (vlkr_df,vlkr_plt,vlkr_tbl,vlkr_list,vlkr_rprt).
The volker-package makes use of common tidyverse functions.Basically, most outputs are generated by three functions:
count()is used to produce countsskim()is used to produce metricsggplot()is used to assemble plots.
Statistical tests, clustering and factor analysis are largely basedon the stats, psych, car and effectsize packages.
Thanks to all the maintainers, authors and contributors of thepackages that make the world of data a magical place.
