Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

The companion R package for *Lessons in Statistical Thinking*

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
NotificationsYou must be signed in to change notification settings

dtkaplan/LSTbook

Repository files navigation

The{LSTbook} package provides software and datasets forLessons inStatisticalThinking.

Installation

Version 0.6 of{LSTbook} was released on CRAN in December 2024. Notethat previous versions did not include thetake_sample() function,which is used extensively inLessons. The CRAN version is alsopublished for use with webr (as are most CRAN packages).

For more recent updates:

  • Install the development version of{LSTbook} fromGitHub with:
# install.packages("devtools")devtools::install_github("dtkaplan/LSTbook")

Overview

The{LSTbook} package has been developed to help students andinstructors learn and teach statistics and early data science.{LSTbook} supports the 2024 textbookLessons in StatisticalThinking, but instructors may want to use{LSTbook} even with othertextbooks.

The statistics component ofLessons may fairly be called a radicalinnovation. As anintroductory, university-levelcourse,Lessons gives students access to important modern themes in statisticsincluding modeling, simulation, co-variation, and causal inference. Datascientists, who use data to make genuine decisions, will get the toolsthey need. This includes a complete rethinking of statistical inference,starting with confidence intervals very early in the course, then gentlyintroducing the structure of Bayesian inference. The coverage ofhypothesis testing has greatly benefited from the discussions promptedby the American Statistical Association’sStatement onP-valuesand is approached in a way that, I hope, will be appreciated by allsides of the debate.

The data-science part of the course includes the concepts and wranglingneeded to undertake statistical investigations (not including datacleaning). It is based, as you might expect, on the tidyverse and{dplyr}.

Some readers may be familiar with the{mosaic} suite of packages whichprovides, for many students and instructors, their first framework forstatistical computation. But there have been many R developments since2011 when{mosaic} was introduced. These include pipes and thetidyverse style of referring to variables.{mosaic} has an uneasyequilibrium with the tidyverse. In contrast, the statistical functionsin{LSTbook} fit in with the tidyverse style and mesh well with{dplyr} commands.

The{LSTbook} function set is highly streamlined and internallyconsistent. There is a tight set of only four object types produced bythe{LSTbook} computations:

  • Data frames
  • Graphic frames ({ggplot2} compatible but much streamlined)
  • Models, which are summarized to produce either data frames or graphicframes.
  • Data simulations (via DAGs) which are sampled from to produce dataframes

Vignettes provide an instructor-level tutorial introduction to{LSTbook}. The student-facing introduction is theLessons inStatistical Thinking textbook.

Statistics for data science

Every instructor of introductory statistics is familiar with textbooksthat devote separate chapters to each of a half-dozen basic tests:means, differences in means, proportions, differences in proportions,and simple regression. It’s been known for a century that these topicsinvoke the same statistical concepts. Moreover, they are merelyprecursors to the essential multivariable modeling techniques used inmainstream data-science tasks such as dealing with confounding.

To illustrate how{LSTbook} supports teaching such topics in a unifiedand streamlined way, consider to datasets provided by the{mosaicData}package:Galton, which contains the original data used by FrancisGalton in the 1880s to study the heritability of genetic traits,specifically, human height; andWhickham results from a 20-yearfollow-up survey to study smoking and health.

Start by installing{LSTbook} as described above, then loading it intothe R session:

library(LSTbook)

In the examples that follow, we will use the{LSTbook} functionpoint_plot() which handles both numerical and categorical variablesusing one syntax. Here’s a graphic for looking at the difference betweentwo means.

Galton|> point_plot(height~sex)

Point plots can be easily annotated with models. To illustrate thedifference between the two means, add a model annotation:

Galton|> point_plot(height~sex,annot="model")

Otherpoint_plot() annotations areviolin andbw.

InLessons, models are always graphed in the context of the underlyingdata and shown as confidence intervals.

The same graphics and modeling conventions apply to categoricalvariables:

Whickham|> point_plot(outcome~smoker,annot="model")

Simple regression works in the same way:

Galton|> point_plot(height~mother,annot="model")

Whickham|> point_plot(outcome~age,annot="model")

The syntax extends naturally to handle the inclusion of covariates. Forexample, the simple calculation of difference between two proportions ismisleading;age, not smoking status, plays the primary role inexplaning mortality.

Whickham|> point_plot(outcome~age+smoker,annot="model")

NOTE: To highlight statistical inference, we have been working with ann=200 sub-sample of Galton:

Galton<-Galton|> take_sample(n=100,.by=sex)

Quantitative modeling has the same syntax, but rather than rely on thedefault R reports for models,{LSTbook} offers concise summaries.

Whickham|> model_train(outcome~age+smoker)|> conf_interval()#> Waiting for profiling to be done...#> # A tibble: 3 × 4#>   term          .lwr  .coef   .upr#>   <chr>        <dbl>  <dbl>  <dbl>#> 1 (Intercept) -8.50  -7.60  -6.77#> 2 age          0.110  0.124  0.138#> 3 smokerYes   -0.124  0.205  0.537

To help students develop an deeper appreciation of the importance ofcovariates, we can turn to data-generating simulations where we know therules behind the data and can check whether modeling reveals themfaithfully.

print(sim_08)#> Simulation object#> ------------#> [1] c <- rnorm(n)#> [2] x <- c + rnorm(n)#> [3] y <- x + c + 3 + rnorm(n)dag_draw(sim_08)

From the rules, we can see thaty increases directly withx, thecoefficient being 1. A simple model gets this wrong:

sim_08|>   take_sample(n=100)|>  model_train(y~x)|>  conf_interval()#> # A tibble: 2 × 4#>   term         .lwr .coef  .upr#>   <chr>       <dbl> <dbl> <dbl>#> 1 (Intercept)  2.75  3.01  3.27#> 2 x            1.14  1.32  1.51

I’ll leave it as an exercise to the reader to see what happens whencis included in the model as a covariate.

Finally, an advanced example that’s used as a demonstration butillustrates the flexibility of unifying modeling, simulation, andwrangling. We’ll calculate the width of thex confidence interval as afunction of the sample sizen and averaging over 100 trials.

sim_08|>   take_sample(n=sample_size)|>  model_train(y~x)|>  conf_interval()|>  trials(times=2,sample_size= c(100,400,1600,6400,25600))|>   filter(term=="x")|>   mutate(width=.upr-.lwr)#>    .trial sample_size term     .lwr    .coef     .upr      width#> 1       1         100    x 1.372233 1.560350 1.748467 0.37623423#> 2       2         100    x 1.332790 1.490733 1.648677 0.31588668#> 3       1         400    x 1.421812 1.505607 1.589403 0.16759077#> 4       2         400    x 1.499239 1.580337 1.661436 0.16219628#> 5       1        1600    x 1.424356 1.466122 1.507888 0.08353176#> 6       2        1600    x 1.436348 1.477919 1.519491 0.08314316#> 7       1        6400    x 1.474246 1.494887 1.515529 0.04128310#> 8       2        6400    x 1.487190 1.508327 1.529465 0.04227480#> 9       1       25600    x 1.495289 1.505822 1.516355 0.02106636#> 10      2       25600    x 1.498624 1.509280 1.519937 0.02131267

I’ve used only two trials to show the output oftrials(), but increaseit to, say,times = 100 and finish off the wrangling with the{dplyr} functionsummarize(mean(width), .by = sample_size).

#>   sample_size mean(width)#> 1         100  0.34800368#> 2         400  0.17059320#> 3        1600  0.08483481#> 4        6400  0.04251015#> 5       25600  0.02123563

About

The companion R package for *Lessons in Statistical Thinking*

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp