Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Historical statistics of every R package ever

NotificationsYou must be signed in to change notification settings

ropensci-review-tools/pkgstats

Repository files navigation

R build statuscodecovProject Status: Active

pkgstats

Extract summary statistics of R package structure and functionality.Also includes a function to extract statistics of all R packages from alocal CRAN mirror. Not all statistics of course, but a good go atbalancing insightful statistics while ensuring computationalfeasibility.

What statistics?

Statistics are derived from these primary sources:

  1. Numbers of lines of code, documentation, and white space (bothbetween and within lines) in each directory and language
  2. Summaries of packageDESCRIPTION file and related packagemeta-statistics
  3. Summaries of all objects created via package code across multiplelanguages and all directories containing source code (./R,./src, and./inst/include).
  4. A function call network derived from function definitions obtainedfromctags, and references (“calls”) to thoseobtained fromgtags. Thisnetwork roughly connects every object making a call (asfrom) withevery object being called (to).
  5. An additional function call network connecting calls within Rfunctions to all functions from other R packages.

Theprimary function,pkgstats(),returns a list of these various components, including fulldata.frameobjects for the final three components described above. The statisticalproperties of this list can be aggregated by thepkgstats_summary()function,which returns adata.frame with a single row of summary statistics.See below for further details.

Installation

The easiest way to install this package is via theassociatedr-universe.As shown there, simply enable the universe with

options(repos= c(ropenscireviewtools="https://ropensci-review-tools.r-universe.dev",CRAN="https://cloud.r-project.org"))

And then install the usual way with,

install.packages("pkgstats")

Alternatively, the package can be installed by running one of thefollowing lines:

remotes::install_github ("ropensci-review-tools/pkgstats")pak::pkg_install ("ropensci-review-tools/pkgstats")

The package can then loaded for use with

library (pkgstats)

Installation on Linux systems

This package requires the system librariesctags-universal andGNUglobal, both of which areautomatically installed along with the package on both Windows and MacOSsystems. Most Linux distributions do not include a sufficientlyup-to-date version ofctags-universal, and so itmust be compiled from source. This can be done by running a singlefunction,ctags_install(), which will install bothctags-universal andGNUglobal.

Thepkgstats package includes a function to ensure your localinstallations ofuniversal-ctags andglobal work correctly. Pleaseensure you see the following prior to proceeding:

ctags_test ()
## ctags installation works as expected## [1] TRUE

Note that GNUglobal can be linked at installation to the UniversalCtags plug-in parser to expand thedefault 5 languages to30. This makes no difference topkgstats results, asgtags output is only used to trace functioncall networks, which is only possible for compiled languages able todynamically share pointers to the same objects. This is possible withthe default parser regardless. The wealth of extra information obtainedfrom linkingglobal to the Universal Ctags parser is ultimatelydiscarded anyway, yet parsing may take considerably longer. If this isthe case, “default” behaviour may be recovered by first running thefollowing command:

Sys.unsetenv (c ("GTAGSCONF","GTAGSLABEL"))

Seeinformation on how to install thepluginfor more details.

Demonstration

The following code demonstrates the output of the main function,pkgstats, applied to the relatively simplemagrittrpackage. Thesystem.time callalso shows that these statistics are extracted quite quickly.

tarball<-"magrittr_2.0.1.tar.gz"u<- paste0 ("https://cran.r-project.org/src/contrib/",tarball)f<- file.path (tempdir (),tarball)download.file (u,f)system.time (p<- pkgstats (f)    )
##    user  system elapsed ##   0.922   0.141   1.961
names (p)
## [1] "loc"            "vignettes"      "data_stats"     "desc"          ## [5] "translations"   "objects"        "network"        "external_calls"

The result is a list of various data extracted from the code. All exceptforobjects andnetwork represent summary data:

p [!names (p)%in% c ("objects","network","external_calls")]
## $loc## # A tibble: 3 × 12## # Groups:   language, dir [3]##   language dir   nfiles nlines ncode  ndoc nempty nspaces nchars nexpr ntabs##   <chr>    <chr>  <int>  <int> <int> <int>  <int>   <int>  <int> <dbl> <int>## 1 C        src        2    590   447    22    121    1136  10826     1     0## 2 R        R          7    699   163   484     52    2835  15645     1     1## 3 R        tests     10    374   259    13    102     867   8527     2     4## # … with 1 more variable: indentation <int>## ## $vignettes## vignettes     demos ##         2         0 ## ## $data_stats##           n  total_size median_size ##           0           0           0 ## ## $desc##    package version                date            license## 1 magrittr   2.0.1 2020-11-17 16:20:06 MIT + file LICENSE##                                                                     urls## 1 https://magrittr.tidyverse.org,\nhttps://github.com/tidyverse/magrittr##                                           bugs aut ctb fnd rev ths trl depends## 1 https://github.com/tidyverse/magrittr/issues   2   0   1   0   0   0      NA##   imports                                suggests linking_to## 1      NA covr, knitr, rlang, rmarkdown, testthat         NA## ## $translations## [1] NA

The first item,loc, contains the following Lines-Of-Code and relatedstatistics, separated into distinct combinations of computer languageand directory:

  1. nfiles = Numbers of files in each directory and language.
  2. nlines = Total numbers of lines in all files.
  3. nlines = Total numbers of lines of code.
  4. ndoc = Total numbers of documentation or comment lines.
  5. nempty = Total numbers of empty of blank lines.
  6. nspaces = Total numbers of white spaces in all code lines,excluding leading indentation spaces.
  7. nchars = Total numbers of non-white-space characters in all codelines.
  8. nexpr = Median numbers of nested expressions in all lines whichhave any expressions (see below).
  9. ntabs = Number of lines of code with initial tab indentation.
  10. indentation = Number of spaces by which code is indented (with-1 denoting tab-indentation).

Numbers of nested expressions are counted as numbers of brackets of anytype nested on a single line. The following line has one nested bracket:

x<- myfn ()

while the following has four:

x<-function () {return (myfn ()) }

Code with fewer nested expressions per line is generally easier to read,and this metric is provided as one indication of the general readabilityof code. A second relative indication may be extracted by convertingnumbers of spaces and characters to a measure of relative numbers ofwhite spaces, noting that thenchars value quantifies total charactersincluding white spaces.

index<- which (p$loc$dir%in% c ("R","src"))# consider source code onlysum (p$loc$nspaces [index])/ sum (p$loc$nchars [index])
## [1] 0.1500132

Finally, thentabs statistic can be used to identify whether code usestab characters as indentation, otherwise theindentation statisticsindicate median numbers of white spaces by which code is indented. Theobjects,network, andexternal_calls items returned by thepkgstats()functionare described further below.

Thepkgstats_summary() function

A summary of thepkgstats data can be obtained by submitting theobject returned frompkgstats() to thepkgstats_summary()function:

s<- pkgstats_summary (p)

This function reduces the result of thepkgstats()functionto a single line with 91 entries, represented as adata.frame with onerow and that number of columns. This format is intended to enablesummary statistics from multiple packages to be aggregated by simplybinding rows together. While 91 statistics might seem like overkill, thepkgstats_summary()functionaims to return as many usable raw statistics as possible in order toflexibly allow higher-level statistics to be derived through combinationand aggregation. These 91 statistics can be roughly grouped into thefollowing categories (not shown in the order in which they actuallyappear), with variable names in parentheses after each description.

Package Summaries

  • name (package)
  • Package version (version)
  • Package date, as modification time ofDESCRIPTION file where notexplicitly stated (date)
  • License (license)
  • Languages, as a single comma-separated character value(languages), and excludingR itself.
  • List of translations where package includes translations files,given as list of (spoken) language codes (translations).

Information fromDESCRIPTION file

  • Package URL(s) (url)
  • URL for BugReports (bugs)
  • Number of contributors with role ofauthor (desc_n_aut),contributor (desc_n_ctb),funder (desc_n_fnd),reviewer(desc_n_rev),thesis advisor (ths), andtranslator (trl,relating to translation between computer and not spoken languages).
  • Comma-separated character entries for alldepends,imports,suggests, andlinking_to packages.

Numbers of entries in each the of the last two kinds of items can beobtained from by a simplestrsplit call, like this:

length (strsplit (s$suggests,",") [[1]])
## [1] 5

Numbers of files and associated data

  • Number of vignettes (num_vignettes)
  • Number of demos (num_demos)
  • Number of data files (num_data_files)
  • Total size of all package data (data_size_total)
  • Median size of package data files (data_size_median)
  • Numbers of files in main sub-directories (files_R,files_src,files_inst,files_vignettes,files_tests), where numbers arerecursively counted in all sub-directories, and whereinst onlycounts files in theinst/include sub-directory.

Statistics on lines of code

  • Total lines of code in each sub-directory (loc_R,loc_src,loc_ins,loc_vignettes,loc_tests).
  • Total numbers of blank lines in each sub-directory (blank_lines_R,blank_lines_src,blank_lines_inst,blank_lines_vignette,blank_lines_tests).
  • Total numbers of comment lines in each sub-directory(comment_lines_R,comment_lines_src,comment_lines_inst,comment_lines_vignettes,comment_lines_tests).
  • Measures of relative white space in each sub-directory(rel_space_R,rel_space_src,rel_space_inst,rel_space_vignettes,rel_space_tests), as well as an overallmeasure for theR/,src/, andinst/ directories (rel_space).
  • The number of spaces used to indent code (indentation), withvalues of -1 indicating indentation with tab characters.
  • The median number of nested expression per line of code, countingonly those lines which have any expressions (nexpr).

Statistics on individual objects (including functions)

These statistics all refer to “functions”, but actually represent moregeneral “objects,” such as global variables or class definitions(generally from languages other than R), as detailed below.

  • Numbers of functions in R (n_fns_r)
  • Numbers of exported and non-exported R functions(n_fns_r_exported,n_fns_r_not_exported)
  • Number of functions (or objects) in other computer languages(n_fns_src), including functions in bothsrc andinst/includedirectories.
  • Number of functions (or objects) per individual file in R and in allother (src) directories (n_fns_per_file_r,n_fns_per_file_src).
  • Median and mean numbers of parameters per exported R function(npars_exported_mn,npars_exported_md).
  • Mean and median lines of code per function in R and other languages,including distinction between exported and non-exported R functions(loc_per_fn_r_mn,loc_per_fn_r_md,loc_per_fn_r_exp_m,loc_per_fn_r_exp_md,loc_per_fn_r_not_exp_mn,loc_per_fn_r_not_exp_m,loc_per_fn_src_mn,loc_per_fn_src_md).
  • Equivalent mean and median numbers of documentation lines perfunction (doclines_per_fn_exp_mn,doclines_per_fn_exp_md,doclines_per_fn_not_exp_m,doclines_per_fn_not_exp_md,docchars_per_par_exp_mn,docchars_per_par_exp_m).

Network Statistics

The full structure of thenetwork table is described below, withsummary statistics including:

  • Number of edges, including distinction between languages (n_edges,n_edges_r,n_edges_src).
  • Number of distinct clusters in package network (n_clusters).
  • Mean and median centrality of all network edges, calculated fromboth directed and undirected representations of network(centrality_dir_mn,centrality_dir_md,centrality_undir_mn,centrality_undir_md).
  • Equivalent centrality values excluding edges with centrality of zero(centrality_dir_mn_no0,centrality_dir_md_no0,centrality_undir_mn_no0,centrality_undir_md_no).
  • Numbers of terminal edges (num_terminal_edges_dir,num_terminal_edges_undir).
  • Summary statistics on node degree (node_degree_mn,node_degree_md,node_degree_max)

External Call Statistics

The final column in the result ofthepkgstats_summary()functionsummarises theexternal_calls object detailing all calls make toexternal packages (including to base and recommended packages). Thissummary is represented as a single character string:

s$external_calls
## [1] "base:22:12,magrittr:16:11"

This is structured to allow numbers of calls to all packages to bereadily extracted with code like the following:

calls<- do.call (rbind,                  strsplit (strsplit (s$external_call,",") [[1]],":"))calls<-data.frame (package=calls [,1],n_total= as.integer (calls [,2]),n_unique= as.integer (calls [,3]))print (calls)
##    package n_total n_unique## 1     base      22       12## 2 magrittr      16       11

The two numeric columns respectively show the total number of calls madeto each package, and the total number of unique functions used withinthose packages. While this result is relatively uninformative for themagrittr package, which imports no other packages and relies only onbase R functions, these results will generally provide detailedinformation on numbers of calls made and functions used.

The following sub-sections provide further detail on theobjects,network, andexternal_call items, which could be used to extractadditional statistics beyond those described here.

Objects

Theobjects item contains all code objects identified by thecode-tagging libraryctags. For R, those areprimarily functions, but for other languages may be a variety ofentities such as class or structure definitions, or sub-members thereof.Object tables look like this:

head (p$objects)
##     file_name     fn_name     kind language loc npars has_dots exported## 1 R/aliases.R     extract function        R   1    NA       NA     TRUE## 2 R/aliases.R    extract2 function        R   1    NA       NA     TRUE## 3 R/aliases.R  use_series function        R   1    NA       NA     TRUE## 4 R/aliases.R         add function        R   1    NA       NA     TRUE## 5 R/aliases.R    subtract function        R   1    NA       NA     TRUE## 6 R/aliases.R multiply_by function        R   1    NA       NA     TRUE##   param_nchars_md param_nchars_mn num_doclines## 1              NA              NA           54## 2              NA              NA           54## 3              NA              NA           54## 4              NA              NA           54## 5              NA              NA           54## 6              NA              NA           54

Themagrittr package has a total of 191 objects, which the followinglines provide some insight into.

table (p$objects$language)
## ##   C   R ##  64 127
table (p$objects$kind)
## ##        enum    function functionVar   globalVar        list       macro ##           1          92          27          30           1           3 ##      member      struct    variable ##           4           2          31
table (p$objects$kind [p$objects$language=="R"])
## ##    function functionVar   globalVar        list ##          69          27          30           1
table (p$objects$kind [p$objects$language=="C"])
## ##     enum function    macro   member   struct variable ##        1       23        3        4        2       31
table (p$objects$kind [p$objects$language=="C++"])
## < table of extent 0 >

Network

Thenetwork item details all relationships between objects, whichgenerally reflects one object calling or otherwise depending on anotherobject. Each row thus represents one edge of a “function call” network,with each entry in thefrom andto columns representing the networkvertices or nodes.

head (p$network)
##             file line1       from        to language cluster_dir centrality_dir## 1       R/pipe.R   297 new_lambda   freduce        R           1              1## 2    R/getters.R    14  `[[.fseq` functions        R           2              0## 3    R/getters.R    23   `[.fseq` functions        R           2              0## 4  R/functions.R    26 print.fseq functions        R           2              0## 5 R/debug_pipe.R    28 debug_fseq functions        R           2              0## 6 R/debug_pipe.R    35 debug_fseq functions        R           2              0##   cluster_undir centrality_undir## 1             1               17## 2             2                0## 3             2                0## 4             2                0## 5             2                0## 6             2                0
nrow (p$network)
## [1] 39

The network table includes additional statistics on the centrality ofeach edge, measured as betweenness centrality assuming edges to be bothdirected (centrality_dir) and undirected (centrality_undir). Morecentral edges reflect connections between objects that are more centralto package functionality, and vice versa. The distinct components of thenetwork are also represented by discrete cluster numbers, calculatedboth for directed and undirected versions of the network. Each distinctcluster number represents a distinct group of objects, internallyrelated to other members of the same cluster, yet independent of allobjects with different cluster numbers.

The network can be viewed as an interactivevis.js network through passing the result ofpkgstats – here,p – to theplot_network()function.

External Calls

Theexternal_calls item is structured similar to thenetwork object,but identifies all calls to functions from external packages. However,unlike thenetowrk andobject data, which provide information onobjects and relationships in all computer languages used within apackage, theexternal_calls object maps calls within R code only, inorder to provide insight into the use within a package of of functionsfrom other packages, including R’s base and recommended packages. Theobject looks like this:

head (p$external_calls)
##   tags_line       call                  tag           file        kind start## 1         1    .onLoad              .onLoad   R/magrittr.R    function    45## 2         7     lapply     `_function_list`       R/pipe.R functionVar   294## 3         7 as_pipe_fn     `_function_list`       R/pipe.R functionVar   294## 4        11        cat anonFunc6fbaaec50100  R/functions.R    function    30## 5        12  invisible anonFuncb07b5cc00100 R/debug_pipe.R    function    35## 6        12      debug anonFuncb07b5cc00100 R/debug_pipe.R    function    35##   end  package## 1  47 magrittr## 2 294     base## 3 294 magrittr## 4  30     base## 5  35     base## 6  35     base

These data are converted to a summary form by thepkgstats_summary()function,which tabulates numbers of external calls and unique functions from eachpackage. These data are presented as a single character string which canbe easily converted to the corresponding numeric values using code likethe following:

x<- strsplit (s$external_calls,",") [[1]]x<- do.call (rbind, strsplit (x,":"))x<-data.frame (pkg=x [,1],n_total= as.integer (x [,2]),n_unique= as.integer (x [,3]))x$n_total_rel<- round (x$n_total/ sum (x$n_total),3)x$n_unique_rel<- round (x$n_unique/ sum (x$n_unique),3)print (x)
##        pkg n_total n_unique n_total_rel n_unique_rel## 1     base      22       12       0.579        0.522## 2 magrittr      16       11       0.421        0.478

Those data reveal, for example, that themagrittr package makes 22individual calls to 12 unique functions from the “base” package.

Code of Conduct

Please note that this package is released with aContributor Code ofConduct. By contributing to thisproject, you agree to abide by its terms.

About

Historical statistics of every R package ever

Topics

Resources

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors6


[8]ページ先頭

©2009-2025 Movatter.jp