Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Enhancing {ggplot2} plots with statistical analysis 📊📣

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
NotificationsYou must be signed in to change notification settings

IndrajeetPatil/ggstatsplot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StatusUsageMiscellaneous
R build statusTotal downloadscodecov
lifecycleDaily downloadsDOI

Raison d’être

“What is to be sought in designs for the display of information is theclear portrayal of complexity. Not the complication of the simple;rather … the revelation of the complex.” - Edward R. Tufte

{ggstatsplot} is anextension of{ggplot2} packagefor creating graphics with details from statistical tests included inthe information-rich plots themselves. In a typical exploratory dataanalysis workflow, data visualization and statistical modeling are twodifferent phases: visualization informs modeling, and modeling in itsturn can suggest a different visualization method, and so on and soforth. The central idea of{ggstatsplot} is simple: combine these twophases into one in the form of graphics with statistical details, whichmakes data exploration simpler and faster.

Installation

TypeCommand
Releaseinstall.packages("ggstatsplot")
Developmentpak::pak("IndrajeetPatil/ggstatsplot")

Citation

If you want to cite this package in a scientific journal or in any othercontext, run the following code in yourR console:

citation("ggstatsplot")Tocitepackage'ggstatsplot'inpublicationsuse:Patil, I. (2021).Visualizationswithstatisticaldetails:The'ggstatsplot'approach.JournalofOpenSourceSoftware,6(61),3167,doi:10.21105/joss.03167ABibTeXentryforLaTeXusersis@Article{,doi= {10.21105/joss.03167},url= {https://doi.org/10.21105/joss.03167},year= {2021},publisher= {{TheOpenJournal}},volume= {6},number= {61},pages= {3167},author= {IndrajeetPatil},title= {{Visualizationswithstatisticaldetails:The {'ggstatsplot'}approach}},journal= {{JournalofOpenSourceSoftware}},  }

Acknowledgments

I would like to thank all the contributors to{ggstatsplot} whopointed out bugs or requested features I hadn’t considered. I wouldespecially like to thank other package developers (especially DanielLüdecke, Dominique Makowski, Mattan S. Ben-Shachar, Brenton Wiernik,Patrick Mair, Salvatore Mangiafico, etc.) who have patiently anddiligently answered my relentless questions and supported featurerequests in their projects. I also want to thank Chuck Powell for hisinitial contributions to the package.

The hexsticker was generously designed by Sarah Otterstetter (Max PlanckInstitute for Human Development, Berlin). This package has alsobenefited from the larger#rstats community on Twitter, LinkedIn, andStackOverflow.

Thanks are also due to my postdoc advisers (Mina Cikara and FieryCushman at Harvard University; Iyad Rahwan at Max Planck Institute forHuman Development) who patiently supported me spending hundreds (?) ofhours working on this package rather than what I was paid to do. 😁

Documentation and Examples

To see the detailed documentation for each function in the stableCRAN version of the package, see:

Summary of available plots

FunctionPlotDescription
ggbetweenstats()violin plotsfor comparisonsbetween groups/conditions
ggwithinstats()violin plotsfor comparisonswithin groups/conditions
gghistostats()histogramsfor distribution about numeric variable
ggdotplotstats()dot plots/chartsfor distribution about labeled numeric variable
ggscatterstats()scatterplotsfor correlation between two variables
ggcorrmat()correlation matricesfor correlations between multiple variables
ggpiestats()pie chartsfor categorical data
ggbarstats()bar chartsfor categorical data
ggcoefstats()dot-and-whisker plotsfor regression models and meta-analysis

In addition to these basic plots,{ggstatsplot} also providesgrouped_ versions (see below) that makes it easy to repeat thesame analysis for any grouping variable.

Summary of types of statistical analyses

The table below summarizes all the different types of analyses currentlysupported in this package-

FunctionsDescriptionParametricNon-parametricRobustBayesian
ggbetweenstats()Between group/condition comparisons
ggwithinstats()Within group/condition comparisons
gghistostats(),ggdotplotstats()Distribution of a numeric variable
ggcorrmatCorrelation matrix
ggscatterstats()Correlation between two variables
ggpiestats(),ggbarstats()Association between categorical variables
ggpiestats(),ggbarstats()Equal proportions for categorical variable levels
ggcoefstats()Regression model coefficients
ggcoefstats()Random-effects meta-analysis

Summary of Bayesian analysis

AnalysisHypothesis testingEstimation
(one/two-sample)t-test
one-way ANOVA
correlation
(one/two-way) contingency table
random-effects meta-analysis

Statistical reporting

Forall statistical tests reported in the plots, the defaulttemplate abides by the gold standard for statistical reporting. Forexample, here are results from Yuen’s test for trimmed means (robustt-test):

Summary of statistical tests and effect sizes

Statistical analysis is carried out by{statsExpressions} package, andthus a summary table of all the statistical tests currently supportedacross various functions can be found in article for that package:https://indrajeetpatil.github.io/statsExpressions/articles/stats_details.html

Primary functions

ggbetweenstats()

This function creates either a violin plot, a box plot, or a mix of twoforbetween-group orbetween-condition comparisons with resultsfrom statistical tests in the subtitle. The simplest function call lookslike this-

set.seed(123)ggbetweenstats(data=iris,x=Species,y=Sepal.Length,title="Distribution of sepal length across Iris species")

Defaults return

✅ raw data + distributions
✅ descriptive statistics
✅inferential statistics
✅ effect size + CIs
✅ pairwisecomparisons
✅ Bayesian hypothesis-testing
✅ Bayesianestimation

A number of other arguments can be specified to make this plot even moreinformative or change some of the default options. Additionally, thereis also agrouped_ variant of this function that makes it easy torepeat the same operation across asingle grouping variable:

set.seed(123)grouped_ggbetweenstats(data=dplyr::filter(movies_long,genre%in% c("Action","Comedy")),x=mpaa,y=length,grouping.var=genre,ggsignif.args=list(textsize=4,tip_length=0.01),p.adjust.method="bonferroni",palette="default_jama",package="ggsci",plotgrid.args=list(nrow=1),annotation.args=list(title="Differences in movie length by mpaa ratings for different genres"))

Details about underlying functions used to create graphics andstatistical tests carried out can be found in the functiondocumentation:https://indrajeetpatil.github.io/ggstatsplot/reference/ggbetweenstats.html

For more, also read the following vignette:https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/ggbetweenstats.html

ggwithinstats()

ggbetweenstats() function has an identical twin functionggwithinstats() for repeated measures designs that behaves in the samefashion with a few minor tweaks introduced to properly visualize therepeated measures design. As can be seen from an example below, the onlydifference between the plot structure is that now the group means areconnected by paths to highlight the fact that these data are paired witheach other.

set.seed(123)library(WRS2)## for datalibrary(afex)## to run ANOVAggwithinstats(data=WineTasting,x=Wine,y=Taste,title="Wine tasting")

Defaults return

✅ raw data + distributions
✅ descriptive statistics
✅inferential statistics
✅ effect size + CIs
✅ pairwisecomparisons
✅ Bayesian hypothesis-testing
✅ Bayesianestimation

As with theggbetweenstats(), this function also has agrouped_variant that makes repeating the same analysis across a single groupingvariable quicker. We will see an example with only repeatedmeasurements-

set.seed(123)grouped_ggwithinstats(data=dplyr::filter(bugs_long,region%in% c("Europe","North America"),condition%in% c("LDLF","LDHF")),x=condition,y=desire,type="np",xlab="Condition",ylab="Desire to kill an artrhopod",grouping.var=region)

Details about underlying functions used to create graphics andstatistical tests carried out can be found in the functiondocumentation:https://indrajeetpatil.github.io/ggstatsplot/reference/ggwithinstats.html

For more, also read the following vignette:https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/ggwithinstats.html

gghistostats()

To visualize the distribution of a single variable and check if its meanis significantly different from a specified value with a one-sampletest,gghistostats() can be used.

set.seed(123)gghistostats(data=ggplot2::msleep,x=awake,title="Amount of time spent awake",test.value=12,binwidth=1)

Defaults return

✅ counts + proportion for bins
✅ descriptive statistics
✅inferential statistics
✅ effect size + CIs
✅ Bayesianhypothesis-testing
✅ Bayesian estimation

There is also agrouped_ variant of this function that makes it easyto repeat the same operation across asingle grouping variable:

set.seed(123)grouped_gghistostats(data=dplyr::filter(movies_long,genre%in% c("Action","Comedy")),x=budget,test.value=50,type="nonparametric",xlab="Movies budget (in million US$)",grouping.var=genre,ggtheme=ggthemes::theme_tufte(),## modify the defaults from `{ggstatsplot}` for each plotplotgrid.args=list(nrow=1),annotation.args=list(title="Movies budgets for different genres"))

Details about underlying functions used to create graphics andstatistical tests carried out can be found in the functiondocumentation:https://indrajeetpatil.github.io/ggstatsplot/reference/gghistostats.html

For more, also read the following vignette:https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/gghistostats.html

ggdotplotstats()

This function is similar togghistostats(), but is intended to be usedwhen the numeric variable also has a label.

set.seed(123)ggdotplotstats(data=dplyr::filter(gapminder::gapminder,continent=="Asia"),y=country,x=lifeExp,test.value=55,type="robust",title="Distribution of life expectancy in Asian continent",xlab="Life expectancy")

Defaults return

✅ descriptives (mean + sample size)
✅ inferential statistics
✅ effect size + CIs
✅ Bayesian hypothesis-testing
✅Bayesian estimation

As with the rest of the functions in this package, there is also agrouped_ variant of this function to facilitate looping the sameoperation for all levels of a single grouping variable.

set.seed(123)grouped_ggdotplotstats(data=dplyr::filter(ggplot2::mpg,cyl%in% c("4","6")),x=cty,y=manufacturer,type="bayes",xlab="city miles per gallon",ylab="car manufacturer",grouping.var=cyl,test.value=15.5,point.args=list(color="red",size=5,shape=13),annotation.args=list(title="Fuel economy data"))

Details about underlying functions used to create graphics andstatistical tests carried out can be found in the functiondocumentation:https://indrajeetpatil.github.io/ggstatsplot/reference/ggdotplotstats.html

For more, also read the following vignette:https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/ggdotplotstats.html

ggscatterstats()

This function creates a scatterplot with marginal distributions overlaidon the axes and results from statistical tests in the subtitle:

ggscatterstats(data=ggplot2::msleep,x=sleep_rem,y=awake,xlab="REM sleep (in hours)",ylab="Amount of time spent awake (in hours)",title="Understanding mammalian sleep")

Defaults return

✅ raw data + distributions
✅ marginal distributions
✅inferential statistics
✅ effect size + CIs
✅ Bayesianhypothesis-testing
✅ Bayesian estimation

There is also agrouped_ variant of this function that makes it easyto repeat the same operation across asingle grouping variable.

set.seed(123)grouped_ggscatterstats(data=dplyr::filter(movies_long,genre%in% c("Action","Comedy")),x=rating,y=length,grouping.var=genre,label.var=title,label.expression=length>200,xlab="IMDB rating",ggtheme=ggplot2::theme_grey(),ggplot.component=list(ggplot2::scale_x_continuous(breaks= seq(2,9,1),limits= (c(2,9)))),plotgrid.args=list(nrow=1),annotation.args=list(title="Relationship between movie length and IMDB ratings"))

Details about underlying functions used to create graphics andstatistical tests carried out can be found in the functiondocumentation:https://indrajeetpatil.github.io/ggstatsplot/reference/ggscatterstats.html

For more, also read the following vignette:https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/ggscatterstats.html

ggcorrmat

ggcorrmat makes a correlalogram (a matrix of correlation coefficients)with minimal amount of code. Just sticking to the defaults itselfproduces publication-ready correlation matrices. But, for the sake ofexploring the available options, let’s change some of the defaults. Forexample, multiple aesthetics-related arguments can be modified to changethe appearance of the correlation matrix.

set.seed(123)## as a default this function outputs a correlation matrix plotggcorrmat(data=ggplot2::msleep,colors= c("#B2182B","white","#4D4D4D"),title="Correlalogram for mammals sleep dataset",subtitle="sleep units: hours; weight units: kilograms")

Defaults return

✅ effect size + significance
✅ careful handling ofNAs

If there areNAs present in the selected variables, the legend willdisplay minimum, median, and maximum number of pairs used forcorrelation tests.

There is also agrouped_ variant of this function that makes it easyto repeat the same operation across asingle grouping variable:

set.seed(123)grouped_ggcorrmat(data=dplyr::filter(movies_long,genre%in% c("Action","Comedy")),type="robust",colors= c("#cbac43","white","#550000"),grouping.var=genre,matrix.type="lower")

Details about underlying functions used to create graphics andstatistical tests carried out can be found in the functiondocumentation:https://indrajeetpatil.github.io/ggstatsplot/reference/ggcorrmat.html

For more, also read the following vignette:https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/ggcorrmat.html

ggpiestats()

This function creates a pie chart for categorical or nominal variableswith results from contingency table analysis (Pearson’s chi-squared testfor between-subjects design and McNemar’s chi-squared test forwithin-subjects design) included in the subtitle of the plot. If onlyone categorical variable is entered, results from one-sample proportiontest (i.e., a chi-squared goodness of fit test) will be displayed as asubtitle.

To study an interaction between two categorical variables:

set.seed(123)ggpiestats(data=mtcars,x=am,y=cyl,package="wesanderson",palette="Royal1",title="Dataset: Motor Trend Car Road Tests",legend.title="Transmission")

Defaults return

✅ descriptives (frequency + %s)
✅ inferential statistics
✅effect size + CIs
✅ Goodness-of-fit tests
✅ Bayesianhypothesis-testing
✅ Bayesian estimation

There is also agrouped_ variant of this function that makes it easyto repeat the same operation across asingle grouping variable.Following example is a case where the theoretical question is aboutproportions for different levels of a single nominal variable:

set.seed(123)grouped_ggpiestats(data=mtcars,x=cyl,grouping.var=am,label.repel=TRUE,package="ggsci",palette="default_ucscgb")

Details about underlying functions used to create graphics andstatistical tests carried out can be found in the functiondocumentation:https://indrajeetpatil.github.io/ggstatsplot/reference/ggpiestats.html

For more, also read the following vignette:https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/ggpiestats.html

ggbarstats()

In case you are not a fan of pie charts (for very good reasons), you canalternatively useggbarstats() function which has a similar syntax.

N.B. Thep-values from one-sample proportion test are displayed on topof each bar.

set.seed(123)library(ggplot2)ggbarstats(data=movies_long,x=mpaa,y=genre,title="MPAA Ratings by Genre",xlab="movie genre",legend.title="MPAA rating",ggplot.component=list(ggplot2::scale_x_discrete(guide=ggplot2::guide_axis(n.dodge=2))),palette="Set2")

Defaults return

✅ descriptives (frequency + %s)
✅ inferential statistics
✅effect size + CIs
✅ Goodness-of-fit tests
✅ Bayesianhypothesis-testing
✅ Bayesian estimation

And, needless to say, there is also agrouped_ variant of thisfunction-

## setupset.seed(123)grouped_ggbarstats(data=mtcars,x=am,y=cyl,grouping.var=vs,package="wesanderson",palette="Darjeeling2"# ,# ggtheme      = ggthemes::theme_tufte(base_size = 12))

Details about underlying functions used to create graphics andstatistical tests carried out can be found in the functiondocumentation:https://indrajeetpatil.github.io/ggstatsplot/reference/ggbarstats.html

For more, also read the following vignette:https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/ggbarstats.html

ggcoefstats()

The functionggcoefstats() generatesdot-and-whisker plots forregression models. The tidy data frames are prepared usingparameters::model_parameters(). Additionally, if available, the modelsummary indices are also extracted fromperformance::model_performance().

set.seed(123)## modelmod<-stats::lm(formula=mpg~am*cyl,data=mtcars)ggcoefstats(mod)

Defaults return

✅ inferential statistics
✅ estimate + CIs
✅ model summary(AIC and BIC)

Details about underlying functions used to create graphics andstatistical tests carried out can be found in the functiondocumentation:https://indrajeetpatil.github.io/ggstatsplot/reference/ggcoefstats.html

For more, also read the following vignette:https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/ggcoefstats.html

Extracting expressions and data frames with statistical details

{ggstatsplot} also offers a convenience function to extract dataframes with statistical details that are used to create expressionsdisplayed in{ggstatsplot} plots.

set.seed(123)p<- ggbetweenstats(mtcars,cyl,mpg)# extracting expression present in the subtitleextract_subtitle(p)#> list(italic("F")["Welch"](2, 18.03) == "31.62", italic(p) ==#>     "1.27e-06", widehat(omega["p"]^2) == "0.74", CI["95%"] ~#>     "[" * "0.53", "1.00" * "]", italic("n")["obs"] == "32")# extracting expression present in the captionextract_caption(p)#> list(log[e] * (BF["01"]) == "-14.92", widehat(italic(R^"2"))["Bayesian"]^"posterior" ==#>     "0.71", CI["95%"]^HDI ~ "[" * "0.57", "0.79" * "]", italic("r")["Cauchy"]^"JZS" ==#>     "0.71")# a list of tibbles containing statistical analysis summariesextract_stats(p)#> $subtitle_data#> # A tibble: 1 × 14#>   statistic    df df.error    p.value#>       <dbl> <dbl>    <dbl>      <dbl>#> 1      31.6     2     18.0 0.00000127#>   method                                                   effectsize estimate#>   <chr>                                                    <chr>         <dbl>#> 1 One-way analysis of means (not assuming equal variances) Omega2        0.744#>   conf.level conf.low conf.high conf.method conf.distribution n.obs expression#>        <dbl>    <dbl>     <dbl> <chr>       <chr>             <int> <list>#> 1       0.95    0.531         1 ncp         F                    32 <language>#>#> $caption_data#> # A tibble: 6 × 17#>   term     pd prior.distribution prior.location prior.scale     bf10#>   <chr> <dbl> <chr>                       <dbl>       <dbl>    <dbl>#> 1 mu    1     cauchy                          0       0.707 3008850.#> 2 cyl-4 1     cauchy                          0       0.707 3008850.#> 3 cyl-6 0.780 cauchy                          0       0.707 3008850.#> 4 cyl-8 1     cauchy                          0       0.707 3008850.#> 5 sig2  1     cauchy                          0       0.707 3008850.#> 6 g_cyl 1     cauchy                          0       0.707 3008850.#>   method                          log_e_bf10 effectsize         estimate std.dev#>   <chr>                                <dbl> <chr>                 <dbl>   <dbl>#> 1 Bayes factors for linear models       14.9 Bayesian R-squared    0.714  0.0503#> 2 Bayes factors for linear models       14.9 Bayesian R-squared    0.714  0.0503#> 3 Bayes factors for linear models       14.9 Bayesian R-squared    0.714  0.0503#> 4 Bayes factors for linear models       14.9 Bayesian R-squared    0.714  0.0503#> 5 Bayes factors for linear models       14.9 Bayesian R-squared    0.714  0.0503#> 6 Bayes factors for linear models       14.9 Bayesian R-squared    0.714  0.0503#>   conf.level conf.low conf.high conf.method n.obs expression#>        <dbl>    <dbl>     <dbl> <chr>       <int> <list>#> 1       0.95    0.574     0.788 HDI            32 <language>#> 2       0.95    0.574     0.788 HDI            32 <language>#> 3       0.95    0.574     0.788 HDI            32 <language>#> 4       0.95    0.574     0.788 HDI            32 <language>#> 5       0.95    0.574     0.788 HDI            32 <language>#> 6       0.95    0.574     0.788 HDI            32 <language>#>#> $pairwise_comparisons_data#> # A tibble: 3 × 9#>   group1 group2 statistic   p.value alternative distribution p.adjust.method#>   <chr>  <chr>      <dbl>     <dbl> <chr>       <chr>        <chr>#> 1 4      6          -6.67 0.00110   two.sided   q            Holm#> 2 4      8         -10.7  0.0000140 two.sided   q            Holm#> 3 6      8          -7.48 0.000257  two.sided   q            Holm#>   test         expression#>   <chr>        <list>#> 1 Games-Howell <language>#> 2 Games-Howell <language>#> 3 Games-Howell <language>#>#> $descriptive_data#> NULL#>#> $one_sample_data#> NULL#>#> $tidy_data#> NULL#>#> $glance_data#> NULL#>#> attr(,"class")#> [1] "ggstatsplot_stats" "list"

Note that all of this analysis is carried out by{statsExpressions}package:https://indrajeetpatil.github.io/statsExpressions/

Using{ggstatsplot} statistical details with custom plots

Sometimes you may not like the default plots produced by{ggstatsplot}. In such cases, you can use othercustom plots (from{ggplot2} or other plotting packages) and still use{ggstatsplot}functions to display results from relevant statistical test.

For example, in the following chunk, we will create our own plot using{ggplot2} package, and use{ggstatsplot} function for extractingexpression:

## loading the needed librariesset.seed(123)library(ggplot2)## using `{ggstatsplot}` to get expression with statistical resultsstats_results<- ggbetweenstats(morley,Expt,Speed) %>% extract_subtitle()## creating a custom plot of our choosingggplot(morley, aes(x= as.factor(Expt),y=Speed))+  geom_boxplot()+  labs(title="Michelson-Morley experiments",subtitle=stats_results,x="Speed of light",y="Experiment number"  )

Summary of benefits of using{ggstatsplot}

  • No need to use scores of packages for statistical analysis (e.g., oneto get stats, one to get effect sizes, another to get Bayes Factors,and yet another to get pairwise comparisons, etc.).

  • Minimal amount of code needed for all functions (typically onlydata,x, andy), which minimizes chances of error and makes fortidy scripts.

  • Conveniently toggle between statistical approaches.

  • Truly makes your figures worth a thousand words.

  • No need to copy-paste results to the text editor (MS-Word, e.g.).

  • Disembodied figures stand on their own and are easy to evaluate forthe reader.

  • More breathing room for theoretical discussion and other text.

  • No need to worry about updating figures and statistical detailsseparately.

Misconceptions about{ggstatsplot}

This package is…

❌ an alternative to learning{ggplot2}
✅ (The better you know{ggplot2}, the more you can modify the defaults to your liking.)

❌ meant to be used in talks/presentations
✅ (Default plots can betoo complicated for effectively communicating results intime-constrained presentation settings, e.g. conference talks.)

❌ the only game in town
✅ (GUI software alternatives:JASP andjamovi).

Extensions

In case you use the GUI softwarejamovi,you can install a module calledjjstatsplot, which is awrapper around{ggstatsplot}.

Contributing

I’m happy to receive bug reports, suggestions, questions, and (most ofall) contributions to fix problems and add features. I personally preferusing theGitHub issues system over trying to reach out to me in otherways (personal e-mail, Twitter, etc.). Pull Requests for contributionsare encouraged.

Here are some simple ways in which you can contribute (in the increasingorder of commitment):

  • Read and correct any inconsistencies in thedocumentation
  • Raise issues about bugs or wanted features
  • Review code
  • Add new functionality (in the form of new plotting functions orhelpers for preparing subtitles)

Please note that this project is released with aContributor Code ofConduct.By participating in this project you agree to abide by its terms.


[8]ページ先頭

©2009-2025 Movatter.jp