Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

User-Friendly Tables with Color Helpers for Data Exploration

License

NotificationsYou must be signed in to change notification settings

BriceNocenti/tabxplor

Repository files navigation

R-CMD-check

If R makes complex things simple, it can sometimes make simple thingsdifficult. This is whytabxplor tries to make it easy to deal withmultiple cross-tables: to create and manipulate them, but also to readthem, using color helpers to highlight important informations(differences from totals, comparisons between lines or columns,contributions to variance, margins of error, etc.). It would love toenhance your data exploration experience with simple yet powerful tools.All functions are propelled bytidyverse, pipe-friendly, and rendertibble data frames which can be easily manipulated withdplyr. Inthe same time, time-taking operations are done withdata.table to gofaster with big dataframes. Tables can be exported to Excel and in htmlwith formats and colors.

Installation

You can install tabxplor from CRAN with:

install.packages("tabxplor")

Or from github with :

# install.packages(devtools)devtools::install_github("BriceNocenti/tabxplor")

Base usage: cross-tables with color helpers

The main functions are made to be user-friendly and time-saving in dataanalysis workflows.

tab makes a simple cross-table:

library(tabxplor)tab(forcats::gss_cat,marital,race)#> # A tabxplor tab: 7 × 5#>   marital       Other Black  White  Total#>   <fct>           <n>   <n>    <n>    <n>#> 1 No answer         2     2     13     17#> 2 Never married   633 1 305  3 478  5 416#> 3 Separated       110   196    437    743#> 4 Divorced        212   495  2 676  3 383#> 5 Widowed          70   262  1 475  1 807#> 6 Married         932   869  8 316 10 117#> 7 Total         1 959 3 129 16 395 21 483

When one of the row or column variables is numeric,tab calculatesmeans by category of the other variable.

tab comes with options to weight the table, print percentages, managetotals, digits and missing values, add legends, gather rare categoriesin a “Others” level.

tab(forcats::gss_cat,marital,race,pct="row",na="drop",other_if_less_than=1000,other_level="Custom_other_level_name")

When a third variable is provided,tab makes a table with as manysubtables as it has levels. With severaltab_vars, it makes a subtablefor each combination of their levels. The result is grouped: in dplyr,operations likesum() orall() are done within each subtable, andnot for the whole dataframe.

Colors may be added to highlight over-represented and under-representedcells, and therefore help the user read the table. By default, withcolor = "diff", colors are based on the differences between a cell andit’s related total (which only works with means and row or col pct).When a percentage is superior to the average percentage of the line orcolumn, it appears with shades of green (or blue). When it’s inferior,it appears with shades of red/orange. A color legend is added below thetable. In RStudio colors are adapted to the theme, light or dark.

data<-forcats::gss_cat|>dplyr::filter(year%in% c(2000,2006,2012),!marital%in% c("No answer","Widowed"))gss<-"Source: General social survey 2000-2014"gss2<-"Source: General social survey 2000, 2006 and 2012"tab(data,race,marital,year,subtext=gss2,pct="row",color="diff")

Thesup_cols argument adds supplementary column variables to thetable. With numeric variables, it calculates the mean for each categoryor the row variable. With text variables, only the first level is kept(you can choose which one to use by placing it first withforcats::fct_relevel). Usetab_many to keep all levels.

tab(dplyr::storms,category,status,sup_cols= c("pressure","wind"))#> # A tabxplor tab: 7 × 13#>   category disturbance extratropical hurricane `other low`#>   <fct>            <n>           <n>       <n>         <n>#> 1 1                  0             0     2 548           0#> 2 2                  0             0       993           0#> 3 3                  0             0       593           0#> 4 4                  0             0       553           0#> 5 5                  0             0       116           0#> 6 NA               171         2 151         0       1 453#> 7 Total            171         2 151     4 803       1 453#> # ℹ 8 more variables: `subtropical depression` <n>, `subtropical storm` <n>,#> #   `tropical depression` <n>, `tropical storm` <n>, `tropical wave` <n>,#> #   Total <n>, pressure <mean>, wind <mean>

References and comparison levels for colors

By default, to calculate colors, each cell is compared to the subtable’srelated total.

When a third variable or more are provided, it’s possible to comparewith the general total line instead, by settingcomp = "all". Here,only the last total row is highlighted (TOTAL ENSEMBLE appears in whitebut other total rows in grey).

tab(data,race,marital,year,subtext=gss2,pct="row",color="diff",comp="all")

Withref = "first", each row (or column) is compared to the first row(or column), which is particularly helpful to highlight historicalevolutions. The first rows then appears in white (while rows totals arethemselves colored like normal lines).

data<-data|>dplyr::mutate(year= as.factor(year))tab(data,year,marital,race,pct="row",color="diff",ref="first",tot="col",totaltab="table")

Whenref is a number, the nth row (or column) is used for comparison.

tab(data,year,marital,race,pct="row",color="diff",ref=3)

Finally, whenref is a string, it it used as a regular expression, tomatch with the names of the rows (or columns).

tab(data,year,marital,race,pct="col",tot="row",color="diff",ref="Married")

Confidence intervals

It it possible to print confidence intervals for each cell:

tab(forcats::gss_cat,race,marital,pct="row",ci="cell")#> # A tabxplor tab: 4 × 9#>   race   `No answer` `Never married` Separated Divorced Widowed  Married  Total#>   <fct>       <row%>          <row%>    <row%>   <row%>  <row%>   <row%> <row%>#> 1 Other           0%        [30;34]%    [5;7]%  [9;12]%  [3;4]% [45;50]%   100%#> 2 Black           0%        [40;43]%    [5;7]% [14;17]%  [7;9]% [26;29]%   100%#> 3 White           0%        [21;22]%    [2;3]% [16;17]%      9% [50;51]%   100%#> 4 Total           0%             25%        3%      16%      8%      47%   100%#> # ℹ 1 more variable: n <n>

It is also possible to use confidence intervals to enhance colorshelpers. Withcolor = "diff_ci", the cells are only colored if theconfidence interval of the difference between them and their referencecell (in total or first row/col) is superior to the difference itself.Otherwise, it means the cell is not significantly different from it’sreference in the total (or first) row: it turns grey, and the reader isnot anymore tempted to over-interpret the difference.

tab(forcats::gss_cat,race,marital,pct="row",color="diff_ci")

Finally, another calculation appears helpful: the difference between thecell and the total, minus the confidence interval of this difference (orin other word, what remains of the difference after having subtractedthe confidence interval).ci = "after_ci" highligths all the cellswhose value is significantly different from the relative total (or firstcell). This is particularly useful when working on small samples : wecan see at a glance which numbers we have right to read and interpret.

tab(forcats::gss_cat,race,marital,subtext=gss,pct="row",color="after_ci")

Chi2 stats and contributions of cells to variance

chi2 = TRUE add summary statistics made in the chi2 metric: degrees offreedom (df), unweighted count, pvalue and (sub)table’s variance. Chi2pvalue is colored in green when inferior to 5%, and in red when superioror equal to 5%, meaning that the table is not significantly differentfrom the independent hypothesis (the two variables may be independent).

tab(forcats::gss_cat,race,marital,chi2=TRUE)#> # A tabxplor tab: 5 × 8#>   race   `No answer` `Never married` Separated Divorced Widowed Married  Total#>   <char>   <n-mixed>             <n>       <n>      <n>     <n>     <n>    <n>#> 1 Other            2             633       110      212      70     932  1 959#> 2 Black            2           1 305       196      495     262     869  3 129#> 3 White           13           3 478       437    2 676   1 475   8 316 16 395#> 4 Total           17           5 416       743    3 383   1 807  10 117 21 483#> 5 pvalue      <0.01%

Chi2 stats can also be used to color cells based on their contributionsto the variance of the (sub)table, withcolor = "contrib". By default,only the cells whose contribution is superior to the mean contributionare colored. It highlights the cells which would stand out in acorrespondence analysis (the two related categories would be located atthe edges of the first axes ; here, being black is associated with nevermarried and being separated).

tab(forcats::gss_cat,race,marital,color="contrib")

Combinetabxplor anddplyr

The result oftab is atibble::tibble data frame with classtab.It gets it’s own printing methods but, in the same time, can betransformed using mostdplyr verbs, like a normaltibble.

library(dplyr)tab(storms,category,status,sup_cols= c("pressure","wind"))|>filter(category!="-1")|>dplyr::select(-`tropical depression`)arrange(is_totrow(.), desc(category))# use is_totrow to keep total rows order

Withdplyr::arrange, don’t forget to keep the order of tab variablesand total rows:

tab(data,race,marital,year,pct="row")|>arrange(year, is_totrow(.), desc(Married))

Draw more complex tables withtab_many

tab is a wrapper around the more powerful functiontab_many, whichcan be used to customize your tables.

It’s possible, for example, to make a summary table of as many columnsvariables as you want (showing all levels, or showing only one specificlevel like here):

first_lvs<- c("Married","$25000 or more","Strong republican","Protestant")data<-forcats::gss_cat|> mutate(across(where(is.factor),~forcats::fct_relevel(.,first_lvs[first_lvs%in% levels(.)])))tab_many(data,race, c(marital,rincome,partyid,relig,age,tvhours),levels="first",pct="row",chi2=TRUE,color="auto")

Usingtab ortab_many withpurrr::map andtibble::tribble, youcan program several tables with different parameters all at once, in areadable way:

tabs<-purrr::pmap(tibble::tribble(~row_var,~col_vars       ,~pct ,~filter              ,~subtext               ,"race"  ,"marital"       ,"no" ,NULL                 ,"Source: GSS 2000-2014","race"  ,"marital"       ,"row",NULL                 ,"Source: GSS 2000-2014","race"  ,"marital"       ,"col",NULL                 ,"Source: GSS 2000-2014","relig" , c("race","age"),"row","year %in% 2000:2010","Source: GSS 2000-2010","relig" , c("race","age"),"row","year %in% 2010:2014","Source: GSS 2010-2014",NA_character_,"race"     ,"no" ,NULL                 ,"Source: GSS 2000-2014",),.f=tab_many,data=forcats::gss_cat,color="auto",chi2=TRUE)

Export to html, Excel or plot

To export a table to html with colors, tabxplor usesknitr::kable andkableExtra. In this format differences from totals, confidenceintervals, contribution to variance, and unweighted counts, areavailable in a tooltip at cells hover.

tabs<- tab(forcats::gss_cat,race,marital,subtext="Source: GSS 2000-2014",pct="row",color="diff")tabs|> tab_kable()

To print an html table by default (for example, in RStudio viewer), usetabxplor options:

options(tabxplor.print="kable")# default to options(tabxplor.print = "console")

tab_xl exports any table or list of tables to Excel, with all colors,chi2 stats and formatting. On Excel, it is still possible to docalculations on raw numbers (display is rounded but, below, decimals arekept).

tabs|> tab_xl(replace=TRUE,sheets="unique")

tab_plot exports any table as a plot image.

tabs|> tab_plot()

Programming withtabxplor

When not doing data analysis but writing functions, you can use thesub-functions oftab_many step by step to attain more flexibility orspeed. That way, it’s possible to write new functions to customize yourtables even more.

data<-dplyr::starwars|>tab_prepare(sex,hair_color,gender,other_if_less_than=5,na_drop_all=sex)data|>tab_plain(sex,hair_color,gender,tot= c("row","col"),pct="row",comp="all")|>tab_ci("diff",color="after_ci")|>tab_chi2(calc="p")

The whole architecture oftabxplor is powered by a special vectorclass, namedtabxplor_fmt for formatted numbers. As avctrs::record,it stores behind the scenes all the data necessary to calculate printedresults, formats and colors. A set of functions are available to accessor transform this data.?fmt to get more information.

The simple way to recover the underlying numbers as numeric vectors isget_num, which extract the currently displayed field whatever it is :

tabs<- tab(forcats::gss_cat,race,marital,pct="row")tabs|>dplyr::mutate(across(where(is_fmt),get_num))#> # A tabxplor tab: 4 × 9#>   race   `No answer` `Never married` Separated Divorced Widowed Married Total#>   <fct>        <dbl>           <dbl>     <dbl>    <dbl>   <dbl>   <dbl> <dbl>#> 1 Other     0.00102            0.323    0.0562    0.108  0.0357   0.476     1#> 2 Black     0.000639           0.417    0.0626    0.158  0.0837   0.278     1#> 3 White     0.000793           0.212    0.0267    0.163  0.0900   0.507     1#> 4 Total     0.000791           0.252    0.0346    0.157  0.0841   0.471     1#> # ℹ 1 more variable: n <dbl>

To render character vectors (without colors), useformat:

tabs|> mutate(across(where(is_fmt),format))

The following fields compose anyfmt column (though many can beNAif not calculated) :

  • display : name of the field to display, customisable for each cell(character)

  • n : raw count (integer)

  • wn : weighted count

  • pct : percentages

  • diff : differences from totals or reference cells

  • digits : digits to display, customisable for each cell (integer)

  • ctr : contributions of cells to variance (withcolor = "contrib")

  • mean : means (for numeric column variables)

  • var : variance (for numeric column variables ; Chi2 variance withpct)

  • ci : confidence intervals

  • rr : relative risks, needed to calculate odds ratio

  • or : odds ratios (or relative risks ratios)

  • in_totrow :TRUE if the cell is part of a total row,FALSEotherwise (logical)

  • in_tottab :TRUE if the cell is part of a total table,FALSEotherwise (logical)

  • in_refrow :TRUE if the cell is part of a reference row,FALSEotherwise (logical)

vctrs::vec_data(tabs$Married)#>       n display digits wn       pct      mean         diff ctr var ci rr or#> 1   932     pct      0 NA 0.4757529 1.0102402  0.004822432  NA  NA NA NA NA#> 2   869     pct      0 NA 0.2777245 0.5897357 -0.193205991  NA  NA NA NA NA#> 3  8316     pct      0 NA 0.5072278 1.0770757  0.036297310  NA  NA NA NA NA#> 4 10117     pct      0 NA 0.4709305 1.0000000  0.000000000  NA  NA NA NA NA#>   in_totrow in_tottab in_refrow#> 1     FALSE     FALSE     FALSE#> 2     FALSE     FALSE     FALSE#> 3     FALSE     FALSE     FALSE#> 4      TRUE     FALSE     FALSE

To get those underlying fields you can either usevctrs::fields or,more simply,$ :

tabs|> mutate(across(where(is_fmt),~vctrs::field(.,"pct") ))tabs$Married$pcttabs$Married$ntabs|> mutate(across(where(is_fmt),~.$n))

To change the field currently displayed, for the whole table or a singlevector, you can useset_display():

tabs|> set_display("diff")tabs|> mutate(across(where(is_fmt),~ set_display(.,"diff")))

To modify a field, you can usevctrsfield<-. For example, to changethe number of digits :

tab(forcats::gss_cat,race,marital,pct="row")|>     mutate(across(where(is_fmt),~vctrs::`field<-`(.,"digits", rep(2L, length(.)))))

Faster to write and easier to read, you can also usedplyr::mutate()on anfmt vector. For example, to create a new column with standardsdeviations and display it with decimals :

tab_num(data,race, c(age,tvhours),marital,digits=1L,comp="all")|>dplyr::mutate(dplyr::across(#Mutate over the whole table.    c(age,tvhours),~dplyr::mutate(.,#Mutate over each fmt vector's underlying data.frame.var= sqrt(var),display="var",digits=2L)|>       set_color("no"),.names="{.col}_sd"  ))

Some helper functions exists for total rows, total tables and referencerows (is_totrow() /as_totrow(),is_tottab() /as_tottab(),is_refrow() /as_refrow()) :

tab(data,race,marital,year,pct="row")|>dplyr::mutate(across(     where(is_fmt),~dplyr::if_else(is_totrow(.),true= mutate(.,digits=1L),false= mutate(.,digits=2L))  ))

Eachfmt column have attributes, which you can access or modify withget_ andset_ functions :

  • type /get_type() /set_type() : the type of thefmt vector,amongc("n", "mean", "row", "col", "all", "all_tabs") ; itdetermines which calculations are done withintab_ functions.

  • totcol /is_totcol() /as_totcol() :TRUE if the column is atotal column,FALSE otherwise (logical)

  • refcol /is_refcol() /as_refcol() :TRUE if the column is areference column for comparison,FALSE otherwise (logical)

  • color /get_color() /set_color() : the calculation to make toprint colors ; amongc("", "no", "diff", "diff_ci", "after_ci", "contrib")

  • col_var /get_col_var() /set_col_var() : the name of the columnvariable (there can be many in one single table)

  • comp_all /get_comp_all /set_comp_all() : when there aretab_vars, is the reference for comparison the subtable (FALSE), orthe total table (TRUE) ?

  • ref /get_ref_type() /set_diff_type() : the type of differencecalculated, either"no","tot" for totals, an index, or a regularexpression.

  • ci_type /get_ci_type() /set_ci_type() : the type of confidenceinterval, either"cell" or"diff"

For example, to print the number of observations of the total column :

tab(data,race,marital,year,pct="row")|>  mutate(across(where(is_totcol),~ mutate(.,display="n") ))

Note that, iftab_vars are provided, the table is grouped and alloperations are made within groups. To remove grouping (for example whenit gives errors), usedplyr::ungroup().

If you only need the simplest table, with only numeric counts (nofmt), or even a basedata.frame (not atibble) :

tab_plain(data,race,marital,num=TRUE)# counts as a numeric vectortab_plain(data,race,marital,df=TRUE)# same, with unique class = "data.frame"

About

User-Friendly Tables with Color Helpers for Data Exploration

Topics

Resources

License

Stars

Watchers

Forks

Contributors2

  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp