- Notifications
You must be signed in to change notification settings - Fork1
The companion R package for *Lessons in Statistical Thinking*
License
Unknown, MIT licenses found
Licenses found
dtkaplan/LSTbook
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
The{LSTbook} package provides software and datasets forLessons inStatisticalThinking.
Version 0.6 of{LSTbook} was released on CRAN in December 2024. Notethat previous versions did not include thetake_sample() function,which is used extensively inLessons. The CRAN version is alsopublished for use with webr (as are most CRAN packages).
For more recent updates:
- Install the development version of
{LSTbook}fromGitHub with:
# install.packages("devtools")devtools::install_github("dtkaplan/LSTbook")
Via r-universe.dev:https://dtkaplan.r-universe.dev/LSTbook
In the YAML for a webr document, refer to the repository thus underthe
webrindex:repos: ["https://dtkaplan.r-universe.dev"]
The{LSTbook} package has been developed to help students andinstructors learn and teach statistics and early data science.{LSTbook} supports the 2024 textbookLessons in StatisticalThinking, but instructors may want to use{LSTbook} even with othertextbooks.
The statistics component ofLessons may fairly be called a radicalinnovation. As anintroductory, university-levelcourse,Lessons gives students access to important modern themes in statisticsincluding modeling, simulation, co-variation, and causal inference. Datascientists, who use data to make genuine decisions, will get the toolsthey need. This includes a complete rethinking of statistical inference,starting with confidence intervals very early in the course, then gentlyintroducing the structure of Bayesian inference. The coverage ofhypothesis testing has greatly benefited from the discussions promptedby the American Statistical Association’sStatement onP-valuesand is approached in a way that, I hope, will be appreciated by allsides of the debate.
The data-science part of the course includes the concepts and wranglingneeded to undertake statistical investigations (not including datacleaning). It is based, as you might expect, on the tidyverse and{dplyr}.
Some readers may be familiar with the{mosaic} suite of packages whichprovides, for many students and instructors, their first framework forstatistical computation. But there have been many R developments since2011 when{mosaic} was introduced. These include pipes and thetidyverse style of referring to variables.{mosaic} has an uneasyequilibrium with the tidyverse. In contrast, the statistical functionsin{LSTbook} fit in with the tidyverse style and mesh well with{dplyr} commands.
The{LSTbook} function set is highly streamlined and internallyconsistent. There is a tight set of only four object types produced bythe{LSTbook} computations:
- Data frames
- Graphic frames (
{ggplot2}compatible but much streamlined) - Models, which are summarized to produce either data frames or graphicframes.
- Data simulations (via DAGs) which are sampled from to produce dataframes
Vignettes provide an instructor-level tutorial introduction to{LSTbook}. The student-facing introduction is theLessons inStatistical Thinking textbook.
Every instructor of introductory statistics is familiar with textbooksthat devote separate chapters to each of a half-dozen basic tests:means, differences in means, proportions, differences in proportions,and simple regression. It’s been known for a century that these topicsinvoke the same statistical concepts. Moreover, they are merelyprecursors to the essential multivariable modeling techniques used inmainstream data-science tasks such as dealing with confounding.
To illustrate how{LSTbook} supports teaching such topics in a unifiedand streamlined way, consider to datasets provided by the{mosaicData}package:Galton, which contains the original data used by FrancisGalton in the 1880s to study the heritability of genetic traits,specifically, human height; andWhickham results from a 20-yearfollow-up survey to study smoking and health.
Start by installing{LSTbook} as described above, then loading it intothe R session:
library(LSTbook)In the examples that follow, we will use the{LSTbook} functionpoint_plot() which handles both numerical and categorical variablesusing one syntax. Here’s a graphic for looking at the difference betweentwo means.
Galton|> point_plot(height~sex)
Point plots can be easily annotated with models. To illustrate thedifference between the two means, add a model annotation:
Galton|> point_plot(height~sex,annot="model")
Otherpoint_plot() annotations areviolin andbw.
InLessons, models are always graphed in the context of the underlyingdata and shown as confidence intervals.
The same graphics and modeling conventions apply to categoricalvariables:
Whickham|> point_plot(outcome~smoker,annot="model")
Simple regression works in the same way:
Galton|> point_plot(height~mother,annot="model")
Whickham|> point_plot(outcome~age,annot="model")
The syntax extends naturally to handle the inclusion of covariates. Forexample, the simple calculation of difference between two proportions ismisleading;age, not smoking status, plays the primary role inexplaning mortality.
Whickham|> point_plot(outcome~age+smoker,annot="model")
NOTE: To highlight statistical inference, we have been working with ann=200 sub-sample of Galton:
Galton<-Galton|> take_sample(n=100,.by=sex)
Quantitative modeling has the same syntax, but rather than rely on thedefault R reports for models,{LSTbook} offers concise summaries.
Whickham|> model_train(outcome~age+smoker)|> conf_interval()#> Waiting for profiling to be done...#> # A tibble: 3 × 4#> term .lwr .coef .upr#> <chr> <dbl> <dbl> <dbl>#> 1 (Intercept) -8.50 -7.60 -6.77#> 2 age 0.110 0.124 0.138#> 3 smokerYes -0.124 0.205 0.537
To help students develop an deeper appreciation of the importance ofcovariates, we can turn to data-generating simulations where we know therules behind the data and can check whether modeling reveals themfaithfully.
print(sim_08)#> Simulation object#> ------------#> [1] c <- rnorm(n)#> [2] x <- c + rnorm(n)#> [3] y <- x + c + 3 + rnorm(n)dag_draw(sim_08)
From the rules, we can see thaty increases directly withx, thecoefficient being 1. A simple model gets this wrong:
sim_08|> take_sample(n=100)|> model_train(y~x)|> conf_interval()#> # A tibble: 2 × 4#> term .lwr .coef .upr#> <chr> <dbl> <dbl> <dbl>#> 1 (Intercept) 2.75 3.01 3.27#> 2 x 1.14 1.32 1.51
I’ll leave it as an exercise to the reader to see what happens whencis included in the model as a covariate.
Finally, an advanced example that’s used as a demonstration butillustrates the flexibility of unifying modeling, simulation, andwrangling. We’ll calculate the width of thex confidence interval as afunction of the sample sizen and averaging over 100 trials.
sim_08|> take_sample(n=sample_size)|> model_train(y~x)|> conf_interval()|> trials(times=2,sample_size= c(100,400,1600,6400,25600))|> filter(term=="x")|> mutate(width=.upr-.lwr)#> .trial sample_size term .lwr .coef .upr width#> 1 1 100 x 1.372233 1.560350 1.748467 0.37623423#> 2 2 100 x 1.332790 1.490733 1.648677 0.31588668#> 3 1 400 x 1.421812 1.505607 1.589403 0.16759077#> 4 2 400 x 1.499239 1.580337 1.661436 0.16219628#> 5 1 1600 x 1.424356 1.466122 1.507888 0.08353176#> 6 2 1600 x 1.436348 1.477919 1.519491 0.08314316#> 7 1 6400 x 1.474246 1.494887 1.515529 0.04128310#> 8 2 6400 x 1.487190 1.508327 1.529465 0.04227480#> 9 1 25600 x 1.495289 1.505822 1.516355 0.02106636#> 10 2 25600 x 1.498624 1.509280 1.519937 0.02131267
I’ve used only two trials to show the output oftrials(), but increaseit to, say,times = 100 and finish off the wrangling with the{dplyr} functionsummarize(mean(width), .by = sample_size).
#> sample_size mean(width)#> 1 100 0.34800368#> 2 400 0.17059320#> 3 1600 0.08483481#> 4 6400 0.04251015#> 5 25600 0.02123563About
The companion R package for *Lessons in Statistical Thinking*
Resources
License
Unknown, MIT licenses found
Licenses found
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.






