You signed in with another tab or window.Reload to refresh your session.You signed out in another tab or window.Reload to refresh your session.You switched accounts on another tab or window.Reload to refresh your session.Dismiss alert
If R makes complex things simple, it can sometimes make simple thingsdifficult. This is whytabxplor tries to make it easy to deal withmultiple cross-tables: to create and manipulate them, but also to readthem, using color helpers to highlight important informations(differences from totals, comparisons between lines or columns,contributions to variance, margins of error, etc.). It would love toenhance your data exploration experience with simple yet powerful tools.All functions are propelled bytidyverse, pipe-friendly, and rendertibble data frames which can be easily manipulated withdplyr. Inthe same time, time-taking operations are done withdata.table to gofaster with big dataframes. Tables can be exported to Excel and in htmlwith formats and colors.
The main functions are made to be user-friendly and time-saving in dataanalysis workflows.
tab makes a simple cross-table:
library(tabxplor)tab(forcats::gss_cat,marital,race)#> # A tabxplor tab: 7 × 5#> marital Other Black White Total#> <fct> <n> <n> <n> <n>#> 1 No answer 2 2 13 17#> 2 Never married 633 1 305 3 478 5 416#> 3 Separated 110 196 437 743#> 4 Divorced 212 495 2 676 3 383#> 5 Widowed 70 262 1 475 1 807#> 6 Married 932 869 8 316 10 117#> 7 Total 1 959 3 129 16 395 21 483
When one of the row or column variables is numeric,tab calculatesmeans by category of the other variable.
tab comes with options to weight the table, print percentages, managetotals, digits and missing values, add legends, gather rare categoriesin a “Others” level.
When a third variable is provided,tab makes a table with as manysubtables as it has levels. With severaltab_vars, it makes a subtablefor each combination of their levels. The result is grouped: in dplyr,operations likesum() orall() are done within each subtable, andnot for the whole dataframe.
Colors may be added to highlight over-represented and under-representedcells, and therefore help the user read the table. By default, withcolor = "diff", colors are based on the differences between a cell andit’s related total (which only works with means and row or col pct).When a percentage is superior to the average percentage of the line orcolumn, it appears with shades of green (or blue). When it’s inferior,it appears with shades of red/orange. A color legend is added below thetable. In RStudio colors are adapted to the theme, light or dark.
data<-forcats::gss_cat|>dplyr::filter(year%in% c(2000,2006,2012),!marital%in% c("No answer","Widowed"))gss<-"Source: General social survey 2000-2014"gss2<-"Source: General social survey 2000, 2006 and 2012"tab(data,race,marital,year,subtext=gss2,pct="row",color="diff")
Thesup_cols argument adds supplementary column variables to thetable. With numeric variables, it calculates the mean for each categoryor the row variable. With text variables, only the first level is kept(you can choose which one to use by placing it first withforcats::fct_relevel). Usetab_many to keep all levels.
By default, to calculate colors, each cell is compared to the subtable’srelated total.
When a third variable or more are provided, it’s possible to comparewith the general total line instead, by settingcomp = "all". Here,only the last total row is highlighted (TOTAL ENSEMBLE appears in whitebut other total rows in grey).
Withref = "first", each row (or column) is compared to the first row(or column), which is particularly helpful to highlight historicalevolutions. The first rows then appears in white (while rows totals arethemselves colored like normal lines).
It it possible to print confidence intervals for each cell:
tab(forcats::gss_cat,race,marital,pct="row",ci="cell")#> # A tabxplor tab: 4 × 9#> race `No answer` `Never married` Separated Divorced Widowed Married Total#> <fct> <row%> <row%> <row%> <row%> <row%> <row%> <row%>#> 1 Other 0% [30;34]% [5;7]% [9;12]% [3;4]% [45;50]% 100%#> 2 Black 0% [40;43]% [5;7]% [14;17]% [7;9]% [26;29]% 100%#> 3 White 0% [21;22]% [2;3]% [16;17]% 9% [50;51]% 100%#> 4 Total 0% 25% 3% 16% 8% 47% 100%#> # ℹ 1 more variable: n <n>
It is also possible to use confidence intervals to enhance colorshelpers. Withcolor = "diff_ci", the cells are only colored if theconfidence interval of the difference between them and their referencecell (in total or first row/col) is superior to the difference itself.Otherwise, it means the cell is not significantly different from it’sreference in the total (or first) row: it turns grey, and the reader isnot anymore tempted to over-interpret the difference.
Finally, another calculation appears helpful: the difference between thecell and the total, minus the confidence interval of this difference (orin other word, what remains of the difference after having subtractedthe confidence interval).ci = "after_ci" highligths all the cellswhose value is significantly different from the relative total (or firstcell). This is particularly useful when working on small samples : wecan see at a glance which numbers we have right to read and interpret.
chi2 = TRUE add summary statistics made in the chi2 metric: degrees offreedom (df), unweighted count, pvalue and (sub)table’s variance. Chi2pvalue is colored in green when inferior to 5%, and in red when superioror equal to 5%, meaning that the table is not significantly differentfrom the independent hypothesis (the two variables may be independent).
Chi2 stats can also be used to color cells based on their contributionsto the variance of the (sub)table, withcolor = "contrib". By default,only the cells whose contribution is superior to the mean contributionare colored. It highlights the cells which would stand out in acorrespondence analysis (the two related categories would be located atthe edges of the first axes ; here, being black is associated with nevermarried and being separated).
The result oftab is atibble::tibble data frame with classtab.It gets it’s own printing methods but, in the same time, can betransformed using mostdplyr verbs, like a normaltibble.
library(dplyr)tab(storms,category,status,sup_cols= c("pressure","wind"))|>filter(category!="-1")|>dplyr::select(-`tropical depression`)arrange(is_totrow(.), desc(category))# use is_totrow to keep total rows order
Withdplyr::arrange, don’t forget to keep the order of tab variablesand total rows:
tab is a wrapper around the more powerful functiontab_many, whichcan be used to customize your tables.
It’s possible, for example, to make a summary table of as many columnsvariables as you want (showing all levels, or showing only one specificlevel like here):
first_lvs<- c("Married","$25000 or more","Strong republican","Protestant")data<-forcats::gss_cat|> mutate(across(where(is.factor),~forcats::fct_relevel(.,first_lvs[first_lvs%in% levels(.)])))tab_many(data,race, c(marital,rincome,partyid,relig,age,tvhours),levels="first",pct="row",chi2=TRUE,color="auto")
Usingtab ortab_many withpurrr::map andtibble::tribble, youcan program several tables with different parameters all at once, in areadable way:
To export a table to html with colors, tabxplor usesknitr::kable andkableExtra. In this format differences from totals, confidenceintervals, contribution to variance, and unweighted counts, areavailable in a tooltip at cells hover.
To print an html table by default (for example, in RStudio viewer), usetabxplor options:
options(tabxplor.print="kable")# default to options(tabxplor.print = "console")
tab_xl exports any table or list of tables to Excel, with all colors,chi2 stats and formatting. On Excel, it is still possible to docalculations on raw numbers (display is rounded but, below, decimals arekept).
tabs|> tab_xl(replace=TRUE,sheets="unique")
tab_plot exports any table as a plot image.
tabs|> tab_plot()
Programming withtabxplor
When not doing data analysis but writing functions, you can use thesub-functions oftab_many step by step to attain more flexibility orspeed. That way, it’s possible to write new functions to customize yourtables even more.
The whole architecture oftabxplor is powered by a special vectorclass, namedtabxplor_fmt for formatted numbers. As avctrs::record,it stores behind the scenes all the data necessary to calculate printedresults, formats and colors. A set of functions are available to accessor transform this data.?fmt to get more information.
The simple way to recover the underlying numbers as numeric vectors isget_num, which extract the currently displayed field whatever it is :
tabs<- tab(forcats::gss_cat,race,marital,pct="row")tabs|>dplyr::mutate(across(where(is_fmt),get_num))#> # A tabxplor tab: 4 × 9#> race `No answer` `Never married` Separated Divorced Widowed Married Total#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 Other 0.00102 0.323 0.0562 0.108 0.0357 0.476 1#> 2 Black 0.000639 0.417 0.0626 0.158 0.0837 0.278 1#> 3 White 0.000793 0.212 0.0267 0.163 0.0900 0.507 1#> 4 Total 0.000791 0.252 0.0346 0.157 0.0841 0.471 1#> # ℹ 1 more variable: n <dbl>
To render character vectors (without colors), useformat:
tabs|> mutate(across(where(is_fmt),format))
The following fields compose anyfmt column (though many can beNAif not calculated) :
display : name of the field to display, customisable for each cell(character)
n : raw count (integer)
wn : weighted count
pct : percentages
diff : differences from totals or reference cells
digits : digits to display, customisable for each cell (integer)
ctr : contributions of cells to variance (withcolor = "contrib")
rr : relative risks, needed to calculate odds ratio
or : odds ratios (or relative risks ratios)
in_totrow :TRUE if the cell is part of a total row,FALSEotherwise (logical)
in_tottab :TRUE if the cell is part of a total table,FALSEotherwise (logical)
in_refrow :TRUE if the cell is part of a reference row,FALSEotherwise (logical)
vctrs::vec_data(tabs$Married)#> n display digits wn pct mean diff ctr var ci rr or#> 1 932 pct 0 NA 0.4757529 1.0102402 0.004822432 NA NA NA NA NA#> 2 869 pct 0 NA 0.2777245 0.5897357 -0.193205991 NA NA NA NA NA#> 3 8316 pct 0 NA 0.5072278 1.0770757 0.036297310 NA NA NA NA NA#> 4 10117 pct 0 NA 0.4709305 1.0000000 0.000000000 NA NA NA NA NA#> in_totrow in_tottab in_refrow#> 1 FALSE FALSE FALSE#> 2 FALSE FALSE FALSE#> 3 FALSE FALSE FALSE#> 4 TRUE FALSE FALSE
To get those underlying fields you can either usevctrs::fields or,more simply,$ :
Faster to write and easier to read, you can also usedplyr::mutate()on anfmt vector. For example, to create a new column with standardsdeviations and display it with decimals :
tab_num(data,race, c(age,tvhours),marital,digits=1L,comp="all")|>dplyr::mutate(dplyr::across(#Mutate over the whole table. c(age,tvhours),~dplyr::mutate(.,#Mutate over each fmt vector's underlying data.frame.var= sqrt(var),display="var",digits=2L)|> set_color("no"),.names="{.col}_sd" ))
Some helper functions exists for total rows, total tables and referencerows (is_totrow() /as_totrow(),is_tottab() /as_tottab(),is_refrow() /as_refrow()) :
Eachfmt column have attributes, which you can access or modify withget_ andset_ functions :
type /get_type() /set_type() : the type of thefmt vector,amongc("n", "mean", "row", "col", "all", "all_tabs") ; itdetermines which calculations are done withintab_ functions.
totcol /is_totcol() /as_totcol() :TRUE if the column is atotal column,FALSE otherwise (logical)
refcol /is_refcol() /as_refcol() :TRUE if the column is areference column for comparison,FALSE otherwise (logical)
color /get_color() /set_color() : the calculation to make toprint colors ; amongc("", "no", "diff", "diff_ci", "after_ci", "contrib")
col_var /get_col_var() /set_col_var() : the name of the columnvariable (there can be many in one single table)
comp_all /get_comp_all /set_comp_all() : when there aretab_vars, is the reference for comparison the subtable (FALSE), orthe total table (TRUE) ?
ref /get_ref_type() /set_diff_type() : the type of differencecalculated, either"no","tot" for totals, an index, or a regularexpression.
ci_type /get_ci_type() /set_ci_type() : the type of confidenceinterval, either"cell" or"diff"
For example, to print the number of observations of the total column :
Note that, iftab_vars are provided, the table is grouped and alloperations are made within groups. To remove grouping (for example whenit gives errors), usedplyr::ungroup().
If you only need the simplest table, with only numeric counts (nofmt), or even a basedata.frame (not atibble) :
tab_plain(data,race,marital,num=TRUE)# counts as a numeric vectortab_plain(data,race,marital,df=TRUE)# same, with unique class = "data.frame"
About
User-Friendly Tables with Color Helpers for Data Exploration