
skimr provides a frictionless approach to summarystatistics which conforms to theprincipleof least surprise, displaying summary statistics the user can skimquickly to understand their data. It handles different data types andreturns askim_df object which can be included in apipeline or displayed nicely for the human reader.
Note:skimr version 2 has major changes whenskimr is used programmatically. Upgraders should review this document,the release notes and vignettes carefully.
The current released version ofskimr can be installedfrom CRAN. If you wish to install the current build of the next releaseyou can do so using the following:
# install.packages("devtools")devtools::install_github("ropensci/skimr")The APIs for this branch should be considered reasonably stable butstill subject to change if an issue is discovered.
To install the version with the most recent changes that have not yetbeen incorporated in the main branch (and may not be):
devtools::install_github("ropensci/skimr", ref = "develop")Do not rely on APIs from the develop branch, as they are likely tochange.
skimr:
summary(),including missing, complete, n, and sd.skim(chickwts)## ── Data Summary ────────────────────────## Values ## Name chickwts## Number of rows 71 ## Number of columns 2 ## _______________________ ## Column type frequency: ## factor 1 ## numeric 1 ## ________________________ ## Group variables None ## ## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate ordered n_unique top_counts ## 1 feed 0 1 FALSE 6 soy: 14, cas: 12, lin: 12, sun: 12## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 weight 0 1 261. 78.1 108 204. 258 324. 423 ▆▆▇▇▃skim(iris)## ── Data Summary ────────────────────────## Values## Name iris ## Number of rows 150 ## Number of columns 5 ## _______________________ ## Column type frequency: ## factor 1 ## numeric 4 ## ________________________ ## Group variables None ## ## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate ordered n_unique top_counts ## 1 Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 Sepal.Length 0 1 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂## 2 Sepal.Width 0 1 3.06 0.436 2 2.8 3 3.3 4.4 ▁▆▇▂▁## 3 Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂## 4 Petal.Width 0 1 1.20 0.762 0.1 0.3 1.3 1.8 2.5 ▇▁▇▅▃skim(dplyr::starwars)## ── Data Summary ────────────────────────## Values ## Name dplyr::starwars## Number of rows 87 ## Number of columns 14 ## _______________________ ## Column type frequency: ## character 8 ## list 3 ## numeric 3 ## ________________________ ## Group variables None ## ## ── Variable type: character ────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate min max empty n_unique whitespace## 1 name 0 1 3 21 0 87 0## 2 hair_color 5 0.943 4 13 0 11 0## 3 skin_color 0 1 3 19 0 31 0## 4 eye_color 0 1 3 13 0 15 0## 5 sex 4 0.954 4 14 0 4 0## 6 gender 4 0.954 8 9 0 2 0## 7 homeworld 10 0.885 4 14 0 48 0## 8 species 4 0.954 3 14 0 37 0## ## ── Variable type: list ─────────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate n_unique min_length max_length## 1 films 0 1 24 1 7## 2 vehicles 0 1 11 0 2## 3 starships 0 1 16 0 5## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 height 6 0.931 175. 34.8 66 167 180 191 264 ▂▁▇▅▁## 2 mass 28 0.678 97.3 169. 15 55.6 79 84.5 1358 ▇▁▁▁▁## 3 birth_year 44 0.494 87.6 155. 8 35 52 72 896 ▇▁▁▁▁skim(iris) |> summary()## ── Data Summary ────────────────────────## Values## Name iris ## Number of rows 150 ## Number of columns 5 ## _______________________ ## Column type frequency: ## factor 1 ## numeric 4 ## ________________________ ## Group variables Noneskim(iris, Sepal.Length, Petal.Length)## ── Data Summary ────────────────────────## Values## Name iris ## Number of rows 150 ## Number of columns 5 ## _______________________ ## Column type frequency: ## numeric 2 ## ________________________ ## Group variables None ## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 Sepal.Length 0 1 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂## 2 Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂skim() can handle data that has been grouped usingdplyr::group_by().
iris |> dplyr::group_by(Species) |> skim()## ── Data Summary ────────────────────────## Values ## Name dplyr::group_by(iris, Spe...## Number of rows 150 ## Number of columns 5 ## _______________________ ## Column type frequency: ## numeric 4 ## ________________________ ## Group variables Species ## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────## skim_variable Species n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 Sepal.Length setosa 0 1 5.01 0.352 4.3 4.8 5 5.2 5.8 ▃▃▇▅▁## 2 Sepal.Length versicolor 0 1 5.94 0.516 4.9 5.6 5.9 6.3 7 ▂▇▆▃▃## 3 Sepal.Length virginica 0 1 6.59 0.636 4.9 6.22 6.5 6.9 7.9 ▁▃▇▃▂## 4 Sepal.Width setosa 0 1 3.43 0.379 2.3 3.2 3.4 3.68 4.4 ▁▃▇▅▂## 5 Sepal.Width versicolor 0 1 2.77 0.314 2 2.52 2.8 3 3.4 ▁▅▆▇▂## 6 Sepal.Width virginica 0 1 2.97 0.322 2.2 2.8 3 3.18 3.8 ▂▆▇▅▁## 7 Petal.Length setosa 0 1 1.46 0.174 1 1.4 1.5 1.58 1.9 ▁▃▇▃▁## 8 Petal.Length versicolor 0 1 4.26 0.470 3 4 4.35 4.6 5.1 ▂▂▇▇▆## 9 Petal.Length virginica 0 1 5.55 0.552 4.5 5.1 5.55 5.88 6.9 ▃▇▇▃▂## 10 Petal.Width setosa 0 1 0.246 0.105 0.1 0.2 0.2 0.3 0.6 ▇▂▂▁▁## 11 Petal.Width versicolor 0 1 1.33 0.198 1 1.2 1.3 1.5 1.8 ▅▇▃▆▁## 12 Petal.Width virginica 0 1 2.03 0.275 1.4 1.8 2 2.3 2.5 ▂▇▆▅▇iris |> skim() |> dplyr::filter(numeric.sd > 1)## ── Data Summary ────────────────────────## Values## Name iris ## Number of rows 150 ## Number of columns 5 ## _______________________ ## Column type frequency: ## numeric 1 ## ________________________ ## Group variables None ## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂Simply skimming a data frame will produce the horizontal print layoutshown above. We provide aknit_print method for the typesof objects in this package so that similar results are produced indocuments. To use this, make sure theskimmed object is thelast item in your code chunk.
faithful |> skim()| Name | faithful |
| Number of rows | 272 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Data summary
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| eruptions | 0 | 1 | 3.49 | 1.14 | 1.6 | 2.16 | 4 | 4.45 | 5.1 | ▇▂▂▇▇ |
| waiting | 0 | 1 | 70.90 | 13.59 | 43.0 | 58.00 | 76 | 82.00 | 96.0 | ▃▃▂▇▂ |
Although skimr provides opinionated defaults, it is highlycustomizable. Users can specify their own statistics, change theformatting of results, create statistics for new classes and developskimmers for data structures that are not data frames.
Users can specify their own statistics using a list combined with theskim_with() function factory.skim_with()returns a newskim function that can be called on yourdata. You can use this factory to produce summaries for any type ofcolumn within your data.
Assignment within a call toskim_with() relies on ahelper function,sfl orskimr function list.By default, functions in thesfl call are appended to thedefault skimmers, and names are automatically generated as well.
my_skim <- skim_with(numeric = sfl(mad))my_skim(iris, Sepal.Length)But you can also helpers from thetidyverse to createnew anonymous functions that set particular function arguments. Thebehavior is the same as inpurrr ordplyr,with both. and.x as acceptable pronouns.Setting theappend = FALSE argument uses only thosefunctions that you’ve provided.
my_skim <- skim_with( numeric = sfl( iqr = IQR, p01 = ~ quantile(.x, probs = .01) p99 = ~ quantile(., probs = .99) ), append = FALSE)my_skim(iris, Sepal.Length)And you can remove default skimmers by setting them toNULL.
my_skim <- skim_with(numeric = sfl(hist = NULL))my_skim(iris, Sepal.Length)skimr has summary functions for the following types ofdata by default:
numeric (which includes bothdouble andinteger)characterfactorlogicalcomplexDatePOSIXcttsAsIsskimr also provides a small API for writing packagesthat provide their own default summary functions for data types notcovered above. It relies on R S3 methods for theget_skimmers function. This function should return asfl, similar to customization withinskim_with(), but you should also provide a value for theclass argument. Here’s an example.
get_skimmers.my_data_type <- function(column) { sfl( .class = "my_data_type", p99 = quantile(., probs = .99) )}We are aware that there are issues with rendering the inlinehistograms and line charts in various contexts, some of which aredescribed below.
With versions of R before 4.2.1, there are known issues with printingthe spark-histogram characters when printing a data frame. For example,"▂▅▇" is printed as"<U+2582><U+2585><U+2587>". Thislongstanding problemoriginatesin the low-level code for printing dataframes. While some cases havebeen addressed, there are, for example, reports of this issue in EmacsESS. While this is a deep issue, there isongoingwork to address it in base R. We recommend upgrading to at least R4.2.1 to address this issue.
This means that whileskimr can render the histograms tothe console and in RMarkdown documents, it cannot in othercircumstances. This includes:
skimr data frame to a vanilla R dataframe, but tibbles render correctlyOne workaround for showing these characters in Windows is to set theCTYPE part of your locale to Chinese/Japanese/Korean withSys.setlocale("LC_CTYPE", "Chinese"). The helper functionfix_windows_histograms() does this for you.
And last but not least, we provideskim_without_charts()as a fallback. This makes it easy to still get summaries of your data,even if unicode issues continue.
Spark-bar and spark-line work in the console, but may not work whenyou knit them to a specific document format. The same session thatproduces a correctly rendered HTML document may produce an incorrectlyrendered PDF, for example. This issue can generally be addressed bychanging fonts to one with good building block (for histograms) andBraille support (for line graphs). For example, the open font “DejaVuSans” from theextrafont package supports these. You mayalso want to try wrapping your results inknitr::kable().Please see the vignette on using fonts for details.
Displays in documents of different types will vary. For example, oneuser found that the font “Yu Gothic UI Semilight” produced consistentresults for Microsoft Word and Libre Office Write.
The earliest use of unicode characters to generate sparklines appearsto befrom2009.
Exercising these ideas to their fullest requires a font with goodsupport for block drawing characters.PragamataPro is onesuch font.
We welcome issue reports and pull requests, including potentiallyadding support for commonly used variable classes. However, in general,we encourage users to take advantage of skimr’s flexibility to add theirown customized classes. Please see thecontributingandconductdocuments.