dtkaplan/LSTbookPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star5

The companion R package for *Lessons in Statistical Thinking*

License

Unknown, MIT licenses found

Licenses found

5 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
.github		.github
R		R
data-raw		data-raw
data		data
docs		docs
inst		inst
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CRAN-SUBMISSION		CRAN-SUBMISSION
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
LSTbook.Rproj		LSTbook.Rproj
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
TODO.qmd		TODO.qmd
_pkgdown.yml		_pkgdown.yml
cran-comments.md		cran-comments.md

Repository files navigation

{LSTbook}: An R package forLessons in Statistical Thinking

The{LSTbook} package provides software and datasets forLessons inStatisticalThinking.

Installation

Version 0.6 of{LSTbook} was released on CRAN in December 2024. Notethat previous versions did not include thetake_sample() function,which is used extensively inLessons. The CRAN version is alsopublished for use with webr (as are most CRAN packages).

For more recent updates:

Install the development version of{LSTbook} fromGitHub with:

# install.packages("devtools")devtools::install_github("dtkaplan/LSTbook")

Via r-universe.dev:https://dtkaplan.r-universe.dev/LSTbook
In the YAML for a webr document, refer to the repository thus underthewebr index:repos: ["https://dtkaplan.r-universe.dev"]

Overview

The{LSTbook} package has been developed to help students andinstructors learn and teach statistics and early data science.{LSTbook} supports the 2024 textbookLessons in StatisticalThinking, but instructors may want to use{LSTbook} even with othertextbooks.

The statistics component ofLessons may fairly be called a radicalinnovation. As anintroductory, university-levelcourse,Lessons gives students access to important modern themes in statisticsincluding modeling, simulation, co-variation, and causal inference. Datascientists, who use data to make genuine decisions, will get the toolsthey need. This includes a complete rethinking of statistical inference,starting with confidence intervals very early in the course, then gentlyintroducing the structure of Bayesian inference. The coverage ofhypothesis testing has greatly benefited from the discussions promptedby the American Statistical Association’sStatement onP-valuesand is approached in a way that, I hope, will be appreciated by allsides of the debate.

The data-science part of the course includes the concepts and wranglingneeded to undertake statistical investigations (not including datacleaning). It is based, as you might expect, on the tidyverse and{dplyr}.

Some readers may be familiar with the{mosaic} suite of packages whichprovides, for many students and instructors, their first framework forstatistical computation. But there have been many R developments since2011 when{mosaic} was introduced. These include pipes and thetidyverse style of referring to variables.{mosaic} has an uneasyequilibrium with the tidyverse. In contrast, the statistical functionsin{LSTbook} fit in with the tidyverse style and mesh well with{dplyr} commands.

The{LSTbook} function set is highly streamlined and internallyconsistent. There is a tight set of only four object types produced bythe{LSTbook} computations:

Data frames
Graphic frames ({ggplot2} compatible but much streamlined)
Models, which are summarized to produce either data frames or graphicframes.
Data simulations (via DAGs) which are sampled from to produce dataframes

Vignettes provide an instructor-level tutorial introduction to{LSTbook}. The student-facing introduction is theLessons inStatistical Thinking textbook.

Statistics for data science

Every instructor of introductory statistics is familiar with textbooksthat devote separate chapters to each of a half-dozen basic tests:means, differences in means, proportions, differences in proportions,and simple regression. It’s been known for a century that these topicsinvoke the same statistical concepts. Moreover, they are merelyprecursors to the essential multivariable modeling techniques used inmainstream data-science tasks such as dealing with confounding.

To illustrate how{LSTbook} supports teaching such topics in a unifiedand streamlined way, consider to datasets provided by the{mosaicData}package:Galton, which contains the original data used by FrancisGalton in the 1880s to study the heritability of genetic traits,specifically, human height; andWhickham results from a 20-yearfollow-up survey to study smoking and health.

Start by installing{LSTbook} as described above, then loading it intothe R session:

library(LSTbook)

In the examples that follow, we will use the{LSTbook} functionpoint_plot() which handles both numerical and categorical variablesusing one syntax. Here’s a graphic for looking at the difference betweentwo means.

Galton|> point_plot(height~sex)

Point plots can be easily annotated with models. To illustrate thedifference between the two means, add a model annotation:

Galton|> point_plot(height~sex,annot="model")

Otherpoint_plot() annotations areviolin andbw.

InLessons, models are always graphed in the context of the underlyingdata and shown as confidence intervals.

The same graphics and modeling conventions apply to categoricalvariables:

Whickham|> point_plot(outcome~smoker,annot="model")

Simple regression works in the same way:

Galton|> point_plot(height~mother,annot="model")

Whickham|> point_plot(outcome~age,annot="model")

The syntax extends naturally to handle the inclusion of covariates. Forexample, the simple calculation of difference between two proportions ismisleading;age, not smoking status, plays the primary role inexplaning mortality.

Whickham|> point_plot(outcome~age+smoker,annot="model")

NOTE: To highlight statistical inference, we have been working with ann=200 sub-sample of Galton:

Galton<-Galton|> take_sample(n=100,.by=sex)

Quantitative modeling has the same syntax, but rather than rely on thedefault R reports for models,{LSTbook} offers concise summaries.

Whickham|> model_train(outcome~age+smoker)|> conf_interval()#> Waiting for profiling to be done...#> # A tibble: 3 × 4#>   term          .lwr  .coef   .upr#>   <chr>        <dbl>  <dbl>  <dbl>#> 1 (Intercept) -8.50  -7.60  -6.77#> 2 age          0.110  0.124  0.138#> 3 smokerYes   -0.124  0.205  0.537

To help students develop an deeper appreciation of the importance ofcovariates, we can turn to data-generating simulations where we know therules behind the data and can check whether modeling reveals themfaithfully.

print(sim_08)#> Simulation object#> ------------#> [1] c <- rnorm(n)#> [2] x <- c + rnorm(n)#> [3] y <- x + c + 3 + rnorm(n)dag_draw(sim_08)

From the rules, we can see thaty increases directly withx, thecoefficient being 1. A simple model gets this wrong:

sim_08|>   take_sample(n=100)|>  model_train(y~x)|>  conf_interval()#> # A tibble: 2 × 4#>   term         .lwr .coef  .upr#>   <chr>       <dbl> <dbl> <dbl>#> 1 (Intercept)  2.75  3.01  3.27#> 2 x            1.14  1.32  1.51

I’ll leave it as an exercise to the reader to see what happens whencis included in the model as a covariate.

Finally, an advanced example that’s used as a demonstration butillustrates the flexibility of unifying modeling, simulation, andwrangling. We’ll calculate the width of thex confidence interval as afunction of the sample sizen and averaging over 100 trials.

sim_08|>   take_sample(n=sample_size)|>  model_train(y~x)|>  conf_interval()|>  trials(times=2,sample_size= c(100,400,1600,6400,25600))|>   filter(term=="x")|>   mutate(width=.upr-.lwr)#>    .trial sample_size term     .lwr    .coef     .upr      width#> 1       1         100    x 1.372233 1.560350 1.748467 0.37623423#> 2       2         100    x 1.332790 1.490733 1.648677 0.31588668#> 3       1         400    x 1.421812 1.505607 1.589403 0.16759077#> 4       2         400    x 1.499239 1.580337 1.661436 0.16219628#> 5       1        1600    x 1.424356 1.466122 1.507888 0.08353176#> 6       2        1600    x 1.436348 1.477919 1.519491 0.08314316#> 7       1        6400    x 1.474246 1.494887 1.515529 0.04128310#> 8       2        6400    x 1.487190 1.508327 1.529465 0.04227480#> 9       1       25600    x 1.495289 1.505822 1.516355 0.02106636#> 10      2       25600    x 1.498624 1.509280 1.519937 0.02131267

I’ve used only two trials to show the output oftrials(), but increaseit to, say,times = 100 and finish off the wrangling with the{dplyr} functionsummarize(mean(width), .by = sample_size).

#>   sample_size mean(width)#> 1         100  0.34800368#> 2         400  0.17059320#> 3        1600  0.08483481#> 4        6400  0.04251015#> 5       25600  0.02123563

About

The companion R package for *Lessons in Statistical Thinking*

dtkaplan.github.io/LST/

Resources

Readme

License

Unknown, MIT licenses found

Releases1

First CRAN release Latest

Feb 28, 2024

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Licenses found

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

{LSTbook}: An R package forLessons in Statistical Thinking

Installation

Overview

Statistics for data science

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases1

Packages

Uh oh!

Languages

Movatterモバイル変換

License

Licenses found

dtkaplan/LSTbook

Folders and files

Latest commit

History

Repository files navigation

{LSTbook}: An R package forLessons in Statistical Thinking

Installation

Overview

Statistics for data science

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Uh oh!

Languages

Packages