Movatterモバイル変換

skimr

skimr provides a frictionless approach to summarystatistics which conforms to theprincipleof least surprise, displaying summary statistics the user can skimquickly to understand their data. It handles different data types andreturns askim_df object which can be included in apipeline or displayed nicely for the human reader.

Note:skimr version 2 has major changes whenskimr is used programmatically. Upgraders should review this document,the release notes and vignettes carefully.

Installation

The current released version ofskimr can be installedfrom CRAN. If you wish to install the current build of the next releaseyou can do so using the following:

# install.packages("devtools")devtools::install_github("ropensci/skimr")

The APIs for this branch should be considered reasonably stable butstill subject to change if an issue is discovered.

To install the version with the most recent changes that have not yetbeen incorporated in the main branch (and may not be):

devtools::install_github("ropensci/skimr", ref = "develop")

Do not rely on APIs from the develop branch, as they are likely tochange.

Skim statistics in theconsole

skimr:

Provides a larger set of statistics thansummary(),including missing, complete, n, and sd.
reports each data types separately
handles dates, logicals, and a variety of other types
supports spark-bar and spark-line based on thepillar package.

Separates variables by class:

skim(chickwts)## ── Data Summary ────────────────────────##                            Values  ## Name                       chickwts## Number of rows             71      ## Number of columns          2       ## _______________________            ## Column type frequency:             ##   factor                   1       ##   numeric                  1       ## ________________________           ## Group variables            None    ## ## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────##   skim_variable n_missing complete_rate ordered n_unique top_counts                        ## 1 feed                  0             1 FALSE          6 soy: 14, cas: 12, lin: 12, sun: 12## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────##   skim_variable n_missing complete_rate mean   sd  p0  p25 p50  p75 p100 hist ## 1 weight                0             1 261. 78.1 108 204. 258 324.  423 ▆▆▇▇▃

Presentation isin a compact horizontal format:

skim(iris)## ── Data Summary ────────────────────────##                            Values## Name                       iris  ## Number of rows             150   ## Number of columns          5     ## _______________________          ## Column type frequency:           ##   factor                   1     ##   numeric                  4     ## ________________________         ## Group variables            None  ## ## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────##   skim_variable n_missing complete_rate ordered n_unique top_counts               ## 1 Species               0             1 FALSE          3 set: 50, ver: 50, vir: 50## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────##   skim_variable n_missing complete_rate mean    sd  p0 p25  p50 p75 p100 hist ## 1 Sepal.Length          0             1 5.84 0.828 4.3 5.1 5.8  6.4  7.9 ▆▇▇▅▂## 2 Sepal.Width           0             1 3.06 0.436 2   2.8 3    3.3  4.4 ▁▆▇▂▁## 3 Petal.Length          0             1 3.76 1.77  1   1.6 4.35 5.1  6.9 ▇▁▆▇▂## 4 Petal.Width           0             1 1.20 0.762 0.1 0.3 1.3  1.8  2.5 ▇▁▇▅▃

Builtin support for strings, lists and other column classes

skim(dplyr::starwars)## ── Data Summary ────────────────────────##                            Values         ## Name                       dplyr::starwars## Number of rows             87             ## Number of columns          14             ## _______________________                   ## Column type frequency:                    ##   character                8              ##   list                     3              ##   numeric                  3              ## ________________________                  ## Group variables            None           ## ## ── Variable type: character ────────────────────────────────────────────────────────────────────────##   skim_variable n_missing complete_rate min max empty n_unique whitespace## 1 name                  0         1       3  21     0       87          0## 2 hair_color            5         0.943   4  13     0       11          0## 3 skin_color            0         1       3  19     0       31          0## 4 eye_color             0         1       3  13     0       15          0## 5 sex                   4         0.954   4  14     0        4          0## 6 gender                4         0.954   8   9     0        2          0## 7 homeworld            10         0.885   4  14     0       48          0## 8 species               4         0.954   3  14     0       37          0## ## ── Variable type: list ─────────────────────────────────────────────────────────────────────────────##   skim_variable n_missing complete_rate n_unique min_length max_length## 1 films                 0             1       24          1          7## 2 vehicles              0             1       11          0          2## 3 starships             0             1       16          0          5## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────##   skim_variable n_missing complete_rate  mean    sd p0   p25 p50   p75 p100 hist ## 1 height                6         0.931 175.   34.8 66 167   180 191    264 ▂▁▇▅▁## 2 mass                 28         0.678  97.3 169.  15  55.6  79  84.5 1358 ▇▁▁▁▁## 3 birth_year           44         0.494  87.6 155.   8  35    52  72    896 ▇▁▁▁▁

Has a useful summaryfunction

skim(iris) |>  summary()## ── Data Summary ────────────────────────##                            Values## Name                       iris  ## Number of rows             150   ## Number of columns          5     ## _______________________          ## Column type frequency:           ##   factor                   1     ##   numeric                  4     ## ________________________         ## Group variables            None

Individualcolumns can be selected using tidyverse-style selectors

skim(iris, Sepal.Length, Petal.Length)## ── Data Summary ────────────────────────##                            Values## Name                       iris  ## Number of rows             150   ## Number of columns          5     ## _______________________          ## Column type frequency:           ##   numeric                  2     ## ________________________         ## Group variables            None  ## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────##   skim_variable n_missing complete_rate mean    sd  p0 p25  p50 p75 p100 hist ## 1 Sepal.Length          0             1 5.84 0.828 4.3 5.1 5.8  6.4  7.9 ▆▇▇▅▂## 2 Petal.Length          0             1 3.76 1.77  1   1.6 4.35 5.1  6.9 ▇▁▆▇▂

Handles grouped data

skim() can handle data that has been grouped usingdplyr::group_by().

iris |>  dplyr::group_by(Species) |>  skim()## ── Data Summary ────────────────────────##                            Values                      ## Name                       dplyr::group_by(iris, Spe...## Number of rows             150                         ## Number of columns          5                           ## _______________________                                ## Column type frequency:                                 ##   numeric                  4                           ## ________________________                               ## Group variables            Species                     ## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────##    skim_variable Species    n_missing complete_rate  mean    sd  p0  p25  p50  p75 p100 hist ##  1 Sepal.Length  setosa             0             1 5.01  0.352 4.3 4.8  5    5.2   5.8 ▃▃▇▅▁##  2 Sepal.Length  versicolor         0             1 5.94  0.516 4.9 5.6  5.9  6.3   7   ▂▇▆▃▃##  3 Sepal.Length  virginica          0             1 6.59  0.636 4.9 6.22 6.5  6.9   7.9 ▁▃▇▃▂##  4 Sepal.Width   setosa             0             1 3.43  0.379 2.3 3.2  3.4  3.68  4.4 ▁▃▇▅▂##  5 Sepal.Width   versicolor         0             1 2.77  0.314 2   2.52 2.8  3     3.4 ▁▅▆▇▂##  6 Sepal.Width   virginica          0             1 2.97  0.322 2.2 2.8  3    3.18  3.8 ▂▆▇▅▁##  7 Petal.Length  setosa             0             1 1.46  0.174 1   1.4  1.5  1.58  1.9 ▁▃▇▃▁##  8 Petal.Length  versicolor         0             1 4.26  0.470 3   4    4.35 4.6   5.1 ▂▂▇▇▆##  9 Petal.Length  virginica          0             1 5.55  0.552 4.5 5.1  5.55 5.88  6.9 ▃▇▇▃▂## 10 Petal.Width   setosa             0             1 0.246 0.105 0.1 0.2  0.2  0.3   0.6 ▇▂▂▁▁## 11 Petal.Width   versicolor         0             1 1.33  0.198 1   1.2  1.3  1.5   1.8 ▅▇▃▆▁## 12 Petal.Width   virginica          0             1 2.03  0.275 1.4 1.8  2    2.3   2.5 ▂▇▆▅▇

Behaves nicely in pipelines

iris |>  skim() |>  dplyr::filter(numeric.sd > 1)## ── Data Summary ────────────────────────##                            Values## Name                       iris  ## Number of rows             150   ## Number of columns          5     ## _______________________          ## Column type frequency:           ##   numeric                  1     ## ________________________         ## Group variables            None  ## ## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────##   skim_variable n_missing complete_rate mean   sd p0 p25  p50 p75 p100 hist ## 1 Petal.Length          0             1 3.76 1.77  1 1.6 4.35 5.1  6.9 ▇▁▆▇▂

Knitted results

Simply skimming a data frame will produce the horizontal print layoutshown above. We provide aknit_print method for the typesof objects in this package so that similar results are produced indocuments. To use this, make sure theskimmed object is thelast item in your code chunk.

faithful |>  skim()

Data summary
Name	faithful
Number of rows	272
Number of columns	2
_______________________
Column type frequency:
numeric	2
________________________
Group variables	None

Data summary

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
eruptions	0	1	3.49	1.14	1.6	2.16	4	4.45	5.1	▇▂▂▇▇
waiting	0	1	70.90	13.59	43.0	58.00	76	82.00	96.0	▃▃▂▇▂

Customizing skimr

Although skimr provides opinionated defaults, it is highlycustomizable. Users can specify their own statistics, change theformatting of results, create statistics for new classes and developskimmers for data structures that are not data frames.

Specify your ownstatistics and classes

Users can specify their own statistics using a list combined with theskim_with() function factory.skim_with()returns a newskim function that can be called on yourdata. You can use this factory to produce summaries for any type ofcolumn within your data.

Assignment within a call toskim_with() relies on ahelper function,sfl orskimr function list.By default, functions in thesfl call are appended to thedefault skimmers, and names are automatically generated as well.

my_skim <- skim_with(numeric = sfl(mad))my_skim(iris, Sepal.Length)

But you can also helpers from thetidyverse to createnew anonymous functions that set particular function arguments. Thebehavior is the same as inpurrr ordplyr,with both. and.x as acceptable pronouns.Setting theappend = FALSE argument uses only thosefunctions that you’ve provided.

my_skim <- skim_with(  numeric = sfl(    iqr = IQR,    p01 = ~ quantile(.x, probs = .01)    p99 = ~ quantile(., probs = .99)  ),  append = FALSE)my_skim(iris, Sepal.Length)

And you can remove default skimmers by setting them toNULL.

my_skim <- skim_with(numeric = sfl(hist = NULL))my_skim(iris, Sepal.Length)

Skimming other objects

skimr has summary functions for the following types ofdata by default:

numeric (which includes bothdouble andinteger)
character
factor
logical
complex
Date
POSIXct
ts
AsIs

skimr also provides a small API for writing packagesthat provide their own default summary functions for data types notcovered above. It relies on R S3 methods for theget_skimmers function. This function should return asfl, similar to customization withinskim_with(), but you should also provide a value for theclass argument. Here’s an example.

get_skimmers.my_data_type <- function(column) {  sfl(    .class = "my_data_type",    p99 = quantile(., probs = .99)  )}

Limitations of currentversion

We are aware that there are issues with rendering the inlinehistograms and line charts in various contexts, some of which aredescribed below.

Support for spark histograms

With versions of R before 4.2.1, there are known issues with printingthe spark-histogram characters when printing a data frame. For example,"▂▅▇" is printed as"<U+2582><U+2585><U+2587>". Thislongstanding problemoriginatesin the low-level code for printing dataframes. While some cases havebeen addressed, there are, for example, reports of this issue in EmacsESS. While this is a deep issue, there isongoingwork to address it in base R. We recommend upgrading to at least R4.2.1 to address this issue.

This means that whileskimr can render the histograms tothe console and in RMarkdown documents, it cannot in othercircumstances. This includes:

converting askimr data frame to a vanilla R dataframe, but tibbles render correctly
in the context of rendering to a pdf using an engine that does notsupport utf-8.

One workaround for showing these characters in Windows is to set theCTYPE part of your locale to Chinese/Japanese/Korean withSys.setlocale("LC_CTYPE", "Chinese"). The helper functionfix_windows_histograms() does this for you.

And last but not least, we provideskim_without_charts()as a fallback. This makes it easy to still get summaries of your data,even if unicode issues continue.

Printingspark histograms and line graphs in knitted documents

Spark-bar and spark-line work in the console, but may not work whenyou knit them to a specific document format. The same session thatproduces a correctly rendered HTML document may produce an incorrectlyrendered PDF, for example. This issue can generally be addressed bychanging fonts to one with good building block (for histograms) andBraille support (for line graphs). For example, the open font “DejaVuSans” from theextrafont package supports these. You mayalso want to try wrapping your results inknitr::kable().Please see the vignette on using fonts for details.

Displays in documents of different types will vary. For example, oneuser found that the font “Yu Gothic UI Semilight” produced consistentresults for Microsoft Word and Libre Office Write.

Inspirations

TextPlotsfor use of Braille characters
spark for use ofblock characters.

The earliest use of unicode characters to generate sparklines appearsto befrom2009.

Exercising these ideas to their fullest requires a font with goodsupport for block drawing characters.PragamataPro is onesuch font.

Contributing

We welcome issue reports and pull requests, including potentiallyadding support for commonly used variable classes. However, in general,we encourage users to take advantage of skimr’s flexibility to add theirown customized classes. Please see thecontributingandconductdocuments.

[8]ページ先頭