Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

R Package for Computing and Visualizing CRAN and Bioconductor Downloads

NotificationsYou must be signed in to change notification settings

lindbrook/packageRank

Repository files navigation

CRAN_Status_BadgeGitHub_Status_Badge

packageRank: compute and visualize package download counts and percentile ranks

‘packageRank’ is an Rpackage that helps put package download counts into context. It does sovia two core functions,cranDownloads() andpackageRank(), a set offilters that reduce download count inflation, and a host of otherassorted functions.

You can read more about the package in the sections below:

  • I Download Counts describes howcranDownloads() givescranlogs::cran_downloads()a more user-friendly interface and makes visualizing those data easyvia its generic Rplot() method.
  • II Download Percentile Ranksdescribes howpackageRank() makes use of percentile ranks. Thisnonparametric statistic computes the percentage of packages that withfewer downloads than yours: a package is in the 74th percentile hasmore downloads than 74% of packages. This facilitates comparison andhelps you locate your package in the overall distribution ofCRAN package downloads.
  • III Inflation Filters describes fourfilter functions that remove software and behavioral artifacts thatinflatenominal download counts. This functionality is available inpackageRank() andpackageLog().
  • IV Availability of Results discusseswhen results become available, how to uselogInfo() to check theavailability of results, and the effect of time zones.
  • V Reverse lookup of counts, ranks andpercentilesdiscussesqueryCount(),queryRank(),queryPercentile() andcranDistribution().
  • VI Data Fixes discusses two problems with downloadcounts. The first involves issues with logs collected between the endof 2012 and the beginning of 2013. This is fixed infixDate_2012()andfixCranlogs(). The second is an issue with‘cranlogs’ that doublesor triples the number of R application download counts between2023-09-13 and 2023-10-02. This is fixed infixRCranlogs().
  • VII Data Note discusses the spike in the downloadof the Windows version of the R application on Sundays and Wednesdaysbetween 06 November 2022 and 19 March 2023.
  • VIII et cetera discusses country code top-leveldomains (e.g.,countryPackage() andpackageCountry()), the use ofmemoization and the internet connection time out problem.

getting started

To install‘packageRank’ fromCRAN:

install.packages("packageRank")

To install the development version from GitHub:

# You may need to first install 'remotes' via install.packages("remotes").remotes::install_github("lindbrook/packageRank",build_vignettes=TRUE)

I - download counts

cranDownloads() uses all the same arguments ascranlogs::cran_downloads():

cranlogs::cran_downloads(packages="HistData")
>         date count  package> 1 2020-05-01   338 HistData

The only difference is thatcranDownloads() adds four features:

i) “spell check” for package names

cranDownloads(packages="GGplot2")
## Error in cranDownloads(packages = "GGplot2") :##   GGplot2: misspelled or not on CRAN.

cranDownloads(packages="ggplot2")
>         date count cumulative package> 1 2020-05-01 56357      56357 ggplot2


Note that his also works for inactive or “retired” packages in theArchive:

cranDownloads(packages="vr")
## Error in cranDownloads(packages = "vr") :##  vr: misspelled or not on CRAN/Archive.

cranDownloads(packages="VR")
>         date count cumulative package> 1 2020-05-01    11         11      VR

ii) two additional date formats

Withcranlogs::cran_downloads(), you specify a time frame using thefrom andto arguments. The downside of this is that youmust use“yyyy-mm-dd”. For convenience’s sake,cranDownloads() also allows youto use “yyyy-mm” or yyyy (“yyyy” also works).

“yyyy-mm”

Let’s say you want the download counts for‘HistData’ for February2020. Withcranlogs::cran_downloads(), you’d have to type out thewhole date and remember that 2020 was a leap year:

cranlogs::cran_downloads(packages="HistData",from="2020-02-01",to="2020-02-29")


WithcranDownloads(), you can just specify the year and month:

cranDownloads(packages="HistData",from="2020-02",to="2020-02")
yyyy or “yyyy”

Let’s say you want the download counts for‘rstan’ for 2020. Withcranlogs::cran_downloads(), you’d type something like:

cranlogs::cran_downloads(packages="rstan",from="2022-01-01",to="2022-12-31")


WithcranDownloads(), you can use:

cranDownloads(packages="rstan",from=2020,to=2020)

or

cranDownloads(packages="rstan",from="2020",to="2020")

iii) shortcuts withfrom = andto = incranDownloads()

These additional date formats help to create convenient shortcuts. Let’ssay you want the year-to-date download counts for‘rstan’. Withcranlogs::cran_downloads(), you’d type something like:

cranlogs::cran_downloads(packages="rstan",from="2023-01-01",to= Sys.Date()-1)


WithcranDownloads(), you can just pass the current year tofrom =:

cranDownloads(packages="rstan",from=2023)

And if you wanted the entire download history, pass the current year toto =:

cranDownloads(packages="rstan",to=2023)

Note that the Posit/RStudio logs begin on 01 October 2012.

iv) check date validity

cranDownloads(packages="HistData",from="2019-01-15",to="2019-01-35")
## Error in resolveDate(to, type = "to") : Not a valid date.

v) cumulative count for selected time frame

cranDownloads(packages="HistData",when="last-week")
>         date count cumulative  package> 1 2020-05-01   338        338 HistData> 2 2020-05-02   259        597 HistData> 3 2020-05-03   321        918 HistData> 4 2020-05-04   344       1262 HistData> 5 2020-05-05   324       1586 HistData> 6 2020-05-06   356       1942 HistData> 7 2020-05-07   324       2266 HistData

pro.mode

The “spell check” or validation of packages described above, requiressome additional background downloads. While those data are cached viathe ‘meomoise’ package, this will add time the first timecranDownloads() is run. For faster results, you can bypass thosefeatures by settingpro.mode = TRUE. The downside is that you’ll seezero downloads for packages on dates before they’re published on CRANand zero downloads for mis-spelled/non-existent packages. You’ll alsowon’t be able to use theto = argument by itself.

For example, ‘packageRank’ was first published on CRAN on 2019-05-16 -you can verify this viapackageHistory("packageRank"). But if you usecranlogs::cran_downloads() orcranDownloads(pro.mode = TRUE) beforethat date, you’ll see zero downloads for dates before 2019-05-16:

cranDownloads("packageRank",from="2019-05-10",to="2019-05-16",pro.mode=TRUE)>datecountcumulativepackage>12019-05-1000packageRank>22019-05-1100packageRank>32019-05-1200packageRank>42019-05-1300packageRank>52019-05-1400packageRank>62019-05-1500packageRank>72019-05-166868packageRank

You’ll notice this particularly when one of the packages you’reincluding newer packages in cranDownloads().

If you mis-spell a package :

cranDownloads("vr",from="2019-05-10",to="2019-05-16",pro.mode=TRUE)>datecountcumulativepackage>12019-05-1000vr>22019-05-1100vr>32019-05-1200vr>42019-05-1300vr>52019-05-1400vr>62019-05-1500vr>72019-05-1600vr

If you just useto = without a value forfrom =, you’ll get anerror:

cranDownloads(to=2024,pro.mode=TRUE)
Error: You must also provide a date for "from".

visualizing package download counts

cranDownloads() makes visualizing package downloads easy by usingplot():

plot(cranDownloads(packages="HistData",from="2019",to="2019"))

If you pass a vector of package names for a single day,plot() returnsa dotchart:

plot(cranDownloads(packages= c("ggplot2","data.table","Rcpp"),from="2020-03-01",to="2020-03-01"))

If you pass a vector of package names for multiple days,plot() uses‘ggplot2’ facets:

plot(cranDownloads(packages= c("ggplot2","data.table","Rcpp"),from="2020",to="2020-03-20"))


To plot those data in a single frame, setmulti.plot = TRUE:

plot(cranDownloads(packages= c("ggplot2","data.table","Rcpp"),from="2020",to="2020-03-20"),multi.plot=TRUE)


To plot those data in separate plots on the same scale, setgraphics = "base" and you’ll be prompted for each plot:

plot(cranDownloads(packages= c("ggplot2","data.table","Rcpp"),from="2020",to="2020-03-20"),graphics="base")

To do the above on separate, independent scales, setsame.xy = FALSE:

plot(cranDownloads(packages= c("ggplot2","data.table","Rcpp"),from="2020",to="2020-03-20"),graphics="base",same.xy=FALSE)

logarithm of download counts

To use the base 10 logarithm of the download count in a plot, setlog.y = TRUE:

plot(cranDownloads(packages="HistData",from="2019",to="2019"),log.y=TRUE)

Note that for the sake of the plot, zero counts are replaced by ones sothat the logarithm can be computed (This does not affect the datareturned bycranDownloads()).

packages = NULL

cranlogs::cran_download(packages = NULL) computes the total number ofpackage downloads fromCRAN. You can plotthese data by using:

plot(cranDownloads(from=2019,to=2019))

packages = "R"

cranlogs::cran_download(packages = "R") computes the total number ofdownloads of the R application (note that you can only use “R” or avector of packages names, not both!). You can plot these data by using:

plot(cranDownloads(packages="R",from=2019,to=2019))

If you want the total count of R downloads, setr.total = TRUE:

plot(cranDownloads(packages="R",from=2019,to=2019),r.total=TRUE)

Note that since Sunday 06 November 2022 and Wednesday, 18 January 2023,there’ve been spikes of downloads of the Windows version of R on Sundaysand Wednesdays (details below inR Windows Sunday and Wednesdaydownloads).

smoothers and confidence intervals

To add a smoother to your plot, usesmooth = TRUE:

plot(cranDownloads(packages="rstan",from="2019",to="2019"),smooth=TRUE)

With graphs that use ‘ggplot2’,se = TRUE will add a confidenceinterval:

plot(cranDownloads(packages= c("HistData","rnaturalearth","Zelig"),from="2020",to="2020-03-20"),smooth=TRUE,se=TRUE)

In general, loess is the chosen smoother. Note that with base graphics,lowess is used when there are 7 or fewer observations. Thus, to controlthe degree of smoothness, you’ll typically use thespan argument (thedefault is span = 0.75). With base graphics with 7 or fewerobservations, you control the degree of smoothness using thefargument (the default is f = 2/3):

plot(cranDownloads(packages= c("HistData","rnaturalearth","Zelig"),from="2020",to="2020-03-20"),smooth=TRUE,span=0.75)plot(cranDownloads(packages= c("HistData","rnaturalearth","Zelig"),from="2020",to="2020-03-20"),smooth=TRUE,graphics="ggplot2",span=0.33)

package and R release dates

To annotate a graph with a package’s release dates (base graphics only):

plot(cranDownloads(packages="rstan",from="2019",to="2019"),package.version=TRUE)

To annotate a graph with R release dates:

plot(cranDownloads(packages="rstan",from="2019",to="2019"),r.version=TRUE)

plot growth curves (cumulative download counts)

To plot growth curves, setstatistic = "cumulative":

plot(cranDownloads(packages= c("ggplot2","data.table","Rcpp"),from="2020",to="2020-03-20"),statistic="cumulative",multi.plot=TRUE,points=FALSE)

population plot

To visualize a package’s downloads relative to “all” other packages overtime:

plot(cranDownloads(packages="HistData",from="2020",to="2020-03-20"),population.plot=TRUE)

This longitudinal view plots the date (x-axis) against the base 10logarithm of the selected package’s download counts (y-axis). To get asense of how the selected package’s performance stacks up against “all”other packages, a set of smoothed curves representing a stratifiedrandom sample of packages is plotted in gray in the background (this isthe “typical” pattern of downloads onCRAN for the selected time period).1

unit of observation

The default unit of observation for bothcranDownloads() andcranlogs::cran_dowanlods() is the day. The graph below plots the dailydownloads for‘cranlogs’from 01 January 2022 through 15 April 2022.

plot(cranDownloads(packages="cranlogs",from=2022,to="2022-04-15"))

To view the data from a less granular perspective, changeplot.cranDownloads()’sunit.observation argument from “day” to “week”,“month”, or “year”.

unit.observation = "month"

The graph below plots the data aggregated by month (with an addedsmoother):

plot(cranDownloads(packages="cranlogs",from=2022,to="2022-04-15"),unit.observation="month",smooth=TRUE,graphics="ggplot2")

Three things to note. First, if the last/current month (far right) isstill in-progress (it’s not yet the end of the month), that observationwill be split in two: one point for the in-progress total (empty blacksquare), another for the estimated total (empty red circle). Theestimate is based on the proportion of the month completed. In theexample above, the 635 observed downloads from April 1 through April 15translates into an estimate of 1,270 downloads for the entire month (30/ 15 * 635). Second, if a smoother is included, it will only use“complete” observations, not in-progress or estimated data. Third, allpoints are plotted along the x-axis on the first day of the month.

unit.observation = "week"

The graph below plots the data aggregated by week (weeks begin onSunday).

plot(cranDownloads(packages="cranlogs",from=2022,to="2022-06-15"),unit.observation="week",smooth=TRUE)

Four things to note. First, if the first week (far left) is incomplete(the ‘from’ date is not a Sunday), that observation will be split intwo: one point for the observed total on the start date (gray emptysquare) and another point for thebackdated total. Backdating involvescompleting the week by pushing the nominal start date back to includethe previous Sunday (blue asterisk). In the example above, the nominalstart date (01 January 2022) is moved back to include data through theprevious Sunday (26 December 2021). This is useful because with a weeklyunit of observation the first observation is likely to be truncated andwould not give the most representative picture of the data. Second, ifthe last week (far right) is in-progress (the ‘to’ date is not aSaturday), that observation will be split in two: the observed total(gray empty square) and the estimated total based on the proportion ofweek completed (red empty circle). Third, just like the monthly plot,smoothers only use complete observations, including backdated data butexcluding in-progress and estimated data. Fourth, with the exception offirst week’s observed count, which is plotted at its nominal date,points are plotted along the x-axis on Sundays, the first day of theweek.

my default plots

For what it’s worth, below are my go-to commands for graphs. They takeadvantage of RStudio IDE’s plot history panel, which allows you to cyclethrough and compare graphs. Typically, I’ll look at the data for thelast year or so at the three available units of observation: day, weekand month. I use base graphics, viagraphics = "base", to takeadvantage of prompts and “nicer” axes annotation. This also allows me toeasily add graphical elements afterwards as needed, e.g.,abline(h = 100, lty = "dotted").

plot(cranDownloads(packages= c("cholera","packageRank"),from=2023),graphics="base",package.version=TRUE,smooth=TRUE,unit.observation="day")plot(cranDownloads(packages= c("cholera","packageRank"),from=2023),graphics="base",package.version=TRUE,smooth=TRUE,unit.observation="week")# Note that I disable smoothing for monthly dataplot(cranDownloads(packages= c("cholera","packageRank"),from=2023),graphics="base",package.version=TRUE,smooth=FALSE,unit.observation="month")

pro.mode

Perhaps the biggest downside of using cranDownloads(pro.mode = TRUE) isthat you might draw mistaken inferences from plotting the data since itadds false zeroes to your data.

Using the example of ‘packageRank’, which was published on 2019-05-16:

plot(cranDownloads("packageRank",from="2019-05",to="2019-05",pro.mode=TRUE),smooth=TRUE)

plot(cranDownloads("packageRank",from="2019-05",to="2019-05",pro.mode=FALSE),smooth=TRUE)

II - download percentile ranks

After spending some time with nominal download counts, the “compared towhat?” question will come to mind. For instance, consider the data forthe ‘cholera’ package from the first week of March 2020:

plot(cranDownloads(packages="cholera",from="2020-03-01",to="2020-03-07"))

Do Wednesday and Saturday reflect surges of interest in the package orsurges of traffic toCRAN? To put itdifferently, how can we know if a given download count is typical orunusual?

To answer these questions, we can start by looking at the total numberof package downloads:

plot(cranDownloads(from="2020-03-01",to="2020-03-07"))

Here we see that there’s a big difference between the work week and theweekend. This seems to indicate that the download activity for‘cholera’ on the weekendseems high. Moreover, the Wednesday peak for‘cholera’ downloads seemshigher than the mid-week peak of total downloads.

One way to better address these observations is to locate your package’sdownload counts in the overall frequency distribution of downloadcounts. ‘cholera’ allows you to do so viapackageDistribution(). Beloware the distributions of logarithm of download counts for Wednesday andSaturday. Each vertical segment (along the x-axis) represents a downloadcount. The height of a segment represents that download count’sfrequency. The location of‘cholera’ in thedistribution is highlighted in red.

plot(packageDistribution(package="cholera",date="2020-03-04"))

plot(packageDistribution(package="cholera",date="2020-03-07"))

While these plots give us a better picture of where‘cholera’ is located,comparisons between Wednesday and Saturday are still impressionistic:all we can confidently say is that the download counts for both dayswere greater than the mode.

To facilitate interpretation and comparison, I use thepercentile rankof a download count instead of the simple nominal download count. Thisnonparametric statistic tells you the percentage of packages that hadfewer downloads. In other words, it gives you the location of yourpackage relative to the locations of all other packages. Moreimportantly, by rescaling download counts to lie on the bounded intervalbetween 0 and 100, percentile ranks make it easier to compare packageswithin and across distributions.

For example, we can compare Wednesday (“2020-03-04”) to Saturday(“2020-03-07”):

packageRank(package="cholera",date="2020-03-04")>datepackagecountrankpercentile>12020-03-04cholera385,788of18,03867.9

On Wednesday, we can see that‘cholera’ had 38downloads, came in 5,788th place out of the 18,038 different packagesdownloaded, and earned a spot in the 68th percentile.

packageRank(package="cholera",date="2020-03-07")>datepackagecountrankpercentile>12020-03-07cholera293,189of15,95080

On Saturday, we can see that‘cholera’ had 29downloads, came in 3,189st place out of the 15,950 different packagesdownloaded, and earned a spot in the 80th percentile.

So contrary to what the nominal counts tell us, one could say that theinterest in‘cholera’ wasactually greater on Saturday than on Wednesday.

computing percentile rank

To compute percentile ranks, I do the following. For each package, Itabulate the number of downloads and then compute the percentage ofpackages with fewer downloads. Here are the details using‘cholera’ from Wednesdayas an example:

pkg.rank<- packageRank(packages="cholera",date="2020-03-04")downloads<-pkg.rank$cran.data$countnames(downloads)<-pkg.rank$cran.data$packageround(100* mean(downloads<downloads["cholera"]),1)> [1]67.9

To put it differently:

(pkgs.with.fewer.downloads<- sum(downloads<downloads["cholera"]))> [1]12250(tot.pkgs<- length(downloads))> [1]18038round(100*pkgs.with.fewer.downloads/tot.pkgs,1)> [1]67.9

competition v. nominal ranks

In the example above, 38 downloads puts ‘cholera’ in 5,788th place if weallow for ties usingcompetition(i.e., “1224” ranking) and 5,556th place if we don’t by usingnominal/ordinal(i.e., “1234” ranking).

Prior to v0.9.2.9008, only nominal/ordinal ranking was available.Competition ranking is now the default viapackageRank(rank.ties = TRUE). If you want ordinal ranking, usepackageRank(rank.ties = FALSE).

visualizing package download percentile ranks

To visualizepackageRank(), useplot().

plot(packageRank(packages="cholera",date="2020-03-04"))


plot(packageRank(packages="cholera",date="2020-03-07"))

These graphs above, which are customized here to be on the same scale,plot therank order of packages’ download counts (x-axis) against thelogarithm of those counts (y-axis). It then highlights (in red) apackage’s position in the distribution along with its percentile rankand download count. In the background, the 75th, 50th and 25thpercentiles are plotted as dotted vertical lines. The package with themost downloads,‘magrittr’ in both cases,is at top left (in blue). The total number of downloads is at the topright (in blue).

III - inflation filters

‘cranlogs’ computes thenumber of package downloads by simply counting log entries. Whilestraightforward, this approach can run into problems. Putting aside thequestion of whether package dependencies should be counted, what I havein mind here is what I believe to be two types of “invalid” log entries.The first, a software artifact, stems from entries that are smaller,often orders of magnitude smaller, than a package’s actual binary orsource file. The second, a behavioral artifact, emerges from efforts todownload all ofCRAN. In both cases, areliance on nominal counts will give you an inflated sense of the degreeof interest in your package. For those interested, an early but detailedanalysis and discussion of both types of inflation is included as partof thisR-hub blogpost.

software artifacts

When looking at package download logs, the first thing you’ll notice arewrongly sized log entries. They come in two sizes. The “small” entriesare approximately 500 bytes in size. The “medium” entries vary in size,falling somewhere between a “small” entry and a full download (i.e.,“small” <= “medium” <= full download). “Small” entries manifestthemselves as standalone entries, paired with a full download, or aspart of a triplet along side a “medium” and a full download. “Medium”entries manifest themselves as either standalone entries or as part of atriplet.

The example below illustrates a triplet:

packageLog(date="2020-07-01")[4:6,-(4:6)]>datetimesizepackageversioncountryip_id>39986332020-07-0107:56:1599622cholera0.7.0US4760>39990662020-07-0107:56:154161948cholera0.7.0US4760>39991782020-07-0107:56:15536cholera0.7.0US4760

The “medium” entry is the first observation (99,622 bytes). The fulldownload is the second entry (4,161,948 bytes). The “small” entry is thelast observation (536 bytes). At a minimum, what makes a triplet atriplet (or a pair a pair) is that all members share systemconfiguration (e.g. IP address, etc.) and have identical or adjacenttime stamps.

To deal with the inflationary effect of “small” entries, I filter outobservations smaller than 1,000 bytes (the smallest package onCRAN appears to be‘LifeInsuranceContracts’,whose source file weighs in at 1,100 bytes). “Medium” entries are harderto handle. I remove them using a filter functions that looks up apackage’s actual size.

behavioral artifacts

While wrongly sized entries are fairly easy to spot, seeing the effectof efforts to download all ofCRANrequire a change of perspective. While details and further evidence canbe found in theR-hub blogpostmentioned above, I’ll illustrate the problem with the following example:

packageLog(packages="cholera",date="2020-07-31")[8:14,-(4:6)]
>              date     time    size package version country ip_id> 132509 2020-07-31 21:03:06 3797776 cholera   0.2.1      US    14> 132106 2020-07-31 21:03:07 4285678 cholera   0.4.0      US    14> 132347 2020-07-31 21:03:07 4109051 cholera   0.3.0      US    14> 133198 2020-07-31 21:03:08 3766514 cholera   0.5.0      US    14> 132630 2020-07-31 21:03:09 3764848 cholera   0.5.1      US    14> 133078 2020-07-31 21:03:11 4275831 cholera   0.6.0      US    14> 132644 2020-07-31 21:03:12 4284609 cholera   0.6.5      US    14

Here, we see that seven different versions of the package weredownloaded as a sequential bloc. A little digging shows that these sevenversions representall versions of ‘cholera’ available on that date:

packageHistory(package="cholera")
>   Package Version       Date Repository> 1 cholera   0.2.1 2017-08-10    Archive> 2 cholera   0.3.0 2018-01-26    Archive> 3 cholera   0.4.0 2018-04-01    Archive> 4 cholera   0.5.0 2018-07-16    Archive> 5 cholera   0.5.1 2018-08-15    Archive> 6 cholera   0.6.0 2019-03-08    Archive> 7 cholera   0.6.5 2019-06-11    Archive> 8 cholera   0.7.0 2019-08-28       CRAN

While there are “legitimate” reasons for downloading past versions(e.g., research, container-based software distribution, etc.), I’d arguethat examples like the above are “fingerprints” of efforts to downloadCRAN. While this is not necessarilyproblematic, it does mean that when your package is downloaded as partof such efforts, that download is more a reflection of an interest inCRAN itself (a collection of packages)than of an interest in your packageper se. And since one of the usesof counting package downloads is to assess interest inyour package,it may be useful to exclude such entries.

To do so, I try to filter out these entries in two ways. The firstidentifies IP addresses that download “too many” packages and thenfilters outcampaigns, large blocs of downloads that occur in (nearly)alphabetical order. The second looks for campaigns not associated with“greedy” IP addresses and filters out sequences of past versionsdownloaded in a narrowly defined time window.

example usage

To get an idea of how inflated your package’s download count may be, usefilteredDownloads(). Below are the results for ‘ggplot2’ for 15September 2021.

filteredDownloads(package="ggplot2",date="2021-09-15")>datepackagedownloadsfiltered.downloadsdeltainflation>12021-09-15ggplot211384211166221801.95 %

While there were 113,842 nominal downloads, applying all the filtersreduced that number to 111,662, an inflation of 1.95%.

Excluding the time it takes to download the log file (typically the bulkof the computation time), the above example take approximate 15additional seconds to run on a single core on a 3.1 GHz Dual-Core IntelCore i5 processor.

There are 4 filters. You can control them using the following arguments(listed in order of application):

  • ip.filter: removes campaigns of “greedy” IP addresses.
  • small.filter: removes entries smaller than 1,000 bytes.
  • sequence.filter: removes blocs of past versions.
  • size.filter: removes entries smaller than a package’s binary orsource file.

ForfilteredDownloads(), they are all on by default. ForpackageLog() andpackageRank(), they are off by default. To applythem, simply set the argument for the filter you want to TRUE:

packageRank(package="cholera",small.filter=TRUE)

Alternatively, forpackageLog() andpackageRank() you can simply setall.filters = TRUE.

packageRank(package="cholera",all.filters=TRUE)

Note that theall.filters = TRUE is contextual. Depending on thefunction used, you’ll either get the CRAN-specific or thepackage-specific set of filters. The former setsip.filter = TRUE andsize.filter = TRUE; it works independently of packages at the level ofthe entire log. The latter sets sequence.filter = TRUEandsize.filterTRUE`; it relies on package specific information (e.g., size of sourceor binary file).

Ideally, we’d like to use both sets. However, the package-specific setis computationally expensive because they need to be appliedindividually to all packages in the log, which can involve tens ofthousands of packages. While not unfeasible, currently this takes a longtime. For this reason, whenall.filters = TRUE,packageRank(),ipPackage(),countryPackage(),countryDistribution() andpackageDistribution() use only CRAN specific filters whilepackageLog(),packageCountry(), andfilteredDownloads() use bothCRAN and package specific filters.

IV - availability of results

To understand when results become available, you need to be aware that‘packageRank’ has twoupstream, online dependencies. The first is Posit/RStudio’sCRANpackage download logs, which recordtraffic to the “0-Cloud” mirror at cloud.r-project.org (formerlyPosit/RStudio’s CRAN mirror). The second is Gábor Csárdi’s‘cranlogs’ R package,which uses those logs to compute the download counts of both the Rapplication and R packages.

TheCRAN package download logs for theprevious day are typically posted by 17:00 UTC. The results for‘cranlogs’ usually becomeavailable soon thereafter (sometimes as much as a day later).

why aren’t today’s logs and results available?

Occasionally problems with “today’s” data can emerge due to the upstreamdependencies (illustrated below).

CRAN Download Logs --> 'cranlogs' --> 'packageRank'

If there’s a problem with thelogs(e.g., they’re not posted on time), both‘cranlogs’ and‘packageRank’ will beaffected. If this happens, you’ll see things like an unexpected zerocount(s) for your package(s) (actually, you’ll see a zero download countfor both your package and for all ofCRAN), data from “yesterday”, or a “Logis not (yet) on the server” error message.

'cranlogs' --> packageRank::cranDownloads()

If there’s a problem with‘cranlogs’ but not withthelogs, onlypackageRank::cranDownalods() will be affected. In that case, you mightget a warning that only “previous” results will be used. All other‘packageRank’functions should work since they either directly access the logs or usesome other source. Usually, these errors resolve themselves the nexttime the underlying scripts are run (“tomorrow”, if not sooner).

logInfo()

To check the status of the download logs and ‘cranlogs’, uselogInfo(). This function checks whether 1) “today’s” log is posted onPosit/RStudio’s server and 2) “today’s” results have been computed by‘cranlogs’.

logInfo()
$`Today's log/result`[1] "2023-02-01"$`Today's log posted?`[1] "Yes"$`Today's results on 'cranlogs'?`[1] "No"$status[1] "Today's log is typically posted by 01 Feb 09:00 PST | 01 Feb 17:00 UTC."

time zones

Because you’re typically interested intoday’s log file, another thingthat affects availability is your time zone. For example, let’s say thatit’s 09:01 on 01 January 2021 and you want to compute the percentilerank for‘ergm’ for the lastday of 2020. You might be tempted to use the following:

packageRank(packages="ergm")

However, depending onwhere you make this request, you may not get thedata you expect. In Honolulu, USA, you will. In Sydney, Australia youwon’t. The reason is that you’ve somehow forgotten a key piece oftrivia: Posit/RStudio typically postsyesterday’s log around 17:00 UTCthe following day.

The expression works in Honolulu because 09:01 HST on 01 January 2021 is19:01 UTC 01 January 2021. So the log you want has been available for 2hours. The expression fails in Sydney because 09:01 AEDT on 01 January2021 is 31 December 2020 22:00 UTC. The log you want won’t actually beavailable for another 19 hours.

To make life a little easier,‘packageRank’ does twothings. First, when the log for the date you want is not available (dueto time zone rather than server issues), you’ll just get the lastavailable log. If you specified a date in the future, you’ll either getan error message or a warning with an estimate of when the log you wantshould be available.

Using the Sydney example and the expression above, you’d get the resultsfor 30 December 2020:

packageRank(packages="ergm")
>         date package count          rank percentile> 1 2020-12-30    ergm   292 878 of 20,077       95.6

If you had specified the date, you’d get an additional warning:

packageRank(packages="ergm",date="2021-01-01")
>         date package count          rank percentile> 1 2020-12-30    ergm   292 878 of 20,077       95.6Warning message:2020-12-31 log arrives in ~19 hours at 02 Jan 04:00 AEDT. Using previous!

Keep in mind that 17:00 UTC is not a hard deadline. Barring serverissues, the logs are usually posted a littlebefore that time. I don’tknow when the script starts but the posting time seems to be a functionof the number of entries: closer to 17:00 UTC when there are moreentries (e.g., weekdays); earlier than 17:00 UTC when there are fewerentries (e.g., weekends). Again, barring server issues, the ‘cranlogs’results are usually availablebefore 18:00 UTC.

Here’s what you’d see using the Honolulu example:

logInfo(details=TRUE)
$`Today's log/result`[1] "2020-12-31"$`Today's log posted?`[1] "Yes"$`Today's results on 'cranlogs'?`[1] "Yes"$`Available log/result`[1] "Posit/RStudio (2020-12-31); 'cranlogs' (2020-12-31)."$status[1] "Everything OK."

The function uses your local time zone, which depends on R’s ability tocompute your local time and time zone (e.g.,Sys.time() andSys.timezone()). My understanding is that there may be operatingsystem or platform specific issues that could undermine this.

V - Reverse lookup of counts, ranks and percentiles

To query the log for a specific count, rank or percentile rank, use thefunctions below:

queryCount()

To find the packages that had 100 downloads (the default is 1, thelowest number of observable downloads):

queryCount(100)
>         package count rank nominal.rank percentile> 1     analogsea   100 2143         2129       92.1> 2  ComplexUpset   100 2143         2130       92.1> 3      detrendr   100 2143         2131       92.1> 4          drat   100 2143         2132       92.1> 5       enrichR   100 2143         2133       92.1> 6      exact2x2   100 2143         2134       92.1> 7       fdapace   100 2143         2135       92.1> 8          fdth   100 2143         2136       92.1> 9        ggmcmc   100 2143         2137       92.1> 10      jsTreeR   100 2143         2138       92.1> 11       likert   100 2143         2139       92.1> 12      praznik   100 2143         2140       92.1> 13     rayimage   100 2143         2141       92.1> 14       rlemon   100 2143         2142       92.1> 15        worcs   100 2143         2143       92.1

queryRank()

To find the package that was ranked 20th in downloads (the default is1st, the most downloaded package):

queryRank(20)
>   package count rank nominal.rank percentile> 1 stringr 33041   20           20       99.9

queryPercentile()

If you want the packages with a particular percentile rank, usequeryPercentile(). Note that due to the discrete nature of counts,your choice of percentile may not be available because they may fall inthe vertical gaps in the observed data:

For this reason,queryPercentile() rounds you selection to wholenumbers. Also, the default value, which is set to 50, usesmedian()toguarantee a result.

# head() is used because there will be many observations with median count.head(queryPercentile())
>    package count  rank nominal.rank percentile> 1 AATtools    12 13697        12845       49.2> 2    abdiv    12 13697        12846       49.2> 3 abglasso    12 13697        12847       49.2> 4  ablasso    12 13697        12848       49.2> 5   Ac3net    12 13697        12849       49.2> 6      acp    12 13697        12850       49.2

You can also set a range of percentile ranks using the ‘lo’ and/or ‘hi’arguments. If you get an error message, you may need to widen yourinterval:

head(queryPercentile(lo=95,hi=96),3)tail(queryPercentile(lo=95,hi=96),3)
>      package count rank nominal.rank percentile> 1    mapdata   420  931          931       96.5> 2 shinyalert   418  932          932       96.5> 3       klaR   416  935          933       96.5>                package count rank nominal.rank percentile> 536 PortfolioAnalytics   189 1466         1466       94.6> 537              binom   188 1468         1467       94.6> 538            prefmod   188 1468         1468       94.6

cranDistribution()

The above functions leveragecranDistribution(), which computes theranks and the distribution of download counts for a given day’s log.

Its print method provides the date, the number of unique packagesdownloaded, the total number of downloads (the total number ofrows/observations in the log) and the count and rank data for the top 20packages:

cranDistribution()
> $date> [1] "2024-08-01 Thursday"> > $unique.packages.downloaded> [1] "26,959"> > $total.downloads> [1] "5,760,937"> > $top.n>       package count rank nominal.rank percentile> 1       rlang 56311    1            1      100.0> 2     ggplot2 53981    2            2      100.0> 3       withr 51577    3            3      100.0> 4         cli 51509    4            4      100.0> 5   lifecycle 47771    5            5      100.0> 6       dplyr 46734    6            6      100.0> 7       vctrs 45173    7            7      100.0> 8    jsonlite 43280    8            8      100.0> 9        Rcpp 40935    9            9      100.0> 10     tibble 39908   10           10      100.0> 11       glue 39430   11           11      100.0> 12     pillar 37875   12           12      100.0> 13   magrittr 35133   13           13      100.0> 14      bslib 35106   14           14       99.9> 15 colorspace 34707   15           15       99.9> 16       xfun 34442   16           16       99.9> 17     scales 34052   17           17       99.9> 18         R6 33516   18           18       99.9> 19      fansi 33111   19           19       99.9> 20    stringr 33041   20           20       99.9

Note that if you want to specify the number of top N packages, you’llhave to explicitly use the print() and the ‘top.n’ argument:

print(cranDistribution(),top.n=7)

Alternatively, you can usequeryRank():

queryRank(1:7)

The summary method provides the number of unique packages downloaded,the total number of downloads and the five number summary (plus thearithmetic mean):

summary(cranDistribution())
> $unique.packages.downloaded> [1] 26959> > $total.downloads> [1] 5760937> > $download.summary>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. >     1.0     6.0    12.0   213.7    28.0 56311.0

The plot method graphs the distribution of base 10 logarithm of downloadcounts. Each plot is annotated with the median, mean and maximumdownload counts, as well as the total number of downloads and the totalnumber of unique packages observed.

plot(cranDistribution())

VI - data fixes

‘packageRank’ fixestwo data problems.

The first data problem involves logs collected between late 2012 and thebeginning of 2013. It’s a bit complicated. To understand it, we need tobe know that the Posit/RStudio download logs are stored as separatefiles with a name/URL that embeds the log’s date:

http://cran-logs.rstudio.com/2022/2022-01-01.csv.gz

For the logs in question, this convention was broken in three ways: i)some logs are effectively duplicated (same log, multiple names), ii) atleast one is mislabeled and iii) the logs from 13 October through 28December are offset by +3 days (e.g., the file with the name/URL“2012-12-01” contains the log for “2012-11-28”). As a result, we geterroneous download counts and we actually lose the last three logs of2012. Details are availablehere.

Unsurprisingly, all this affects download counts.

Functions that rely oncranlogs::cran_download() (e.g.,‘packageRank::cranDownloads()’,‘adjustedcranlogs’and‘dlstats’) aresusceptible to the first error - duplicate names. My understanding isthat this is because‘cranlogs’ uses the datein a log rather than the filename/URL to retrieve logs. To put itdifferently,‘cranlogs’can’t detect multiple instances of logs with the same date. I found 3logs with duplicate filename/URLs, and 5 additional instances ofovercounting (including one of tripling).‘fixCranlogs()’addresses this overcounting problem behind the scenes by recomputing thedownload counts using the actual log(s) when any of the eightproblematic dates are requested. Details about the 8 days andfixCranlogs() can be foundhere.

Functions that access logs via their filename/URL, e.g.,packageRank()andpackageLog(), are affected by the second and third defects -mislabeled and offset logs.fixDate_2012()addresses this, in the background, by re-mapping problematic logs so youget the log you expect.

The second data problem is of more recent vintage. From 2023-09-13through 2023-10-02, the download counts for the R application returnedbycranlogs::cran_downloads(packages = "R"), is, with two exceptions,twice what one would expect when looking at the actual log(s). The twoexceptions are: 1) 2023-09-28 where the counts are identical but for a“rounding error” possibly due to NAs and 2) 2023-09-30 where there isactually a three-fold difference.

Here are the relevant ratios of counts comparing‘cranlogs’ results withcounts based on the underlying logs:

    2023-09-12 2023-09-13 2023-09-14 2023-09-15 2023-09-16 2023-09-17 2023-09-18 2023-09-19osx          1          2          2          2          2          2          2          2src          1          2          2          2          2          2          2          2win          1          2          2          2          2          2          2          2    2023-09-20 2023-09-21 2023-09-22 2023-09-23 2023-09-24 2023-09-25 2023-09-26 2023-09-27osx          2          2          2          2          2          2          2          2src          2          2          2          2          2          2          2          2win          2          2          2          2          2          2          2          2    2023-09-28 2023-09-29 2023-09-30 2023-10-01 2023-10-02 2023-10-03osx   1.000000          2          3          2          2          1src   1.000801          2          3          2          2          1win   1.000000          2          3          2          2          1

Details and code for replication can be found in issue#69.fixRCranlogs()corrects the problem

Note that there was a similar issue for package download counts aroundthe same period but that is now fixed in‘cranlogs’. For details,see issue#68

VII - data note

R Windows Sunday and Wednesday download spikes (06 Nov 2022 - 19 March 2023)

The graph above forR downloads shows the dailydownloads of the R application broken down by platform (Mac, Source,Windows). In it, you can see the typical pattern of mid-week peaks andweekend troughs.

Between 06 November 2022 and 19 March 2023, this pattern was broken. OnSundays (06 November 2022 - 19 March 2023) and Wednesdays (18 January2023 - 15 March 2023), there were noticeable, repeatedorders-of-magnitude spikes in the daily downloads of the Windows versionof R.

plot(cranDownloads("R",from="2022-10-06",to="2023-04-14"))axis(3,at= as.Date("2022-11-06"),labels="2022-11-06",cex.axis=2/3,padj=0.9)axis(3,at= as.Date("2023-03-19"),labels="2023-03-19",cex.axis=2/3,padj=0.9)abline(v= as.Date("2022-11-06"),col="gray",lty="dotted")abline(v= as.Date("2023-03-19"),col="gray",lty="dotted")

These download spikes did not seem to affect either the Mac or Sourceversions. I show this in the graphs below. Each plot, which isindividually scaled, breaks down the data in the graph above by day(Sunday or Wednesday) and platform.

The key thing is to compare the data in the period bounded by verticaldotted lines with the data before and after. If a Sunday or Wednesday isorders-of-magnitude unusual, I plot that day with a filled rather thanan empty circle. Only Windows, the final two graphs below, earn thisdistinction.

VIII - et cetera

For those interested in directly using the , this section describes someissues that may be of use.

country codes (top level domains)

While the IP addresses in thePosit/RStudiologs are anonymized,packageCountry()andcountryPackage() the logs include ISO country codes or top leveldomains (e.g., AT, JP, US).

Note that coverage extends to only about 85% of observations(approximately 15% country codes are NA), and that there seems to be a acouple of typos for country codes: “A1” (A + number one) and “A2” (A +number 2). According toPosit/RStudio’sdocumentation, this coding was doneusing MaxMind’s free database, which no longer seems to be available andmay be a bit out of date.

memoization

To avoid the bottleneck of downloading multiple log files,packageRank() is currently limited to individual calendar dates. Toreduce the bottleneck of re-downloading logs, which can approach 100 MB,‘packageRank’ makesuse of memoization via the‘memoise’ package.

Here’s relevant code:

fetchLog<-function(url)data.table::fread(url)mfetchLog<-memoise::memoise(fetchLog)if (RCurl::url.exists(url)) {cran_log<- mfetchLog(url)}# Note that data.table::fread() relies on R.utils::decompressFile().

This means that logs are intelligently cached; those that have alreadybeen downloaded in your current R session will not be downloaded again.

timeout

With R 4.0.3, the timeout value for internet connections became moreexplicit. Here are the relevant details from that release’s“Newfeatures”:

The default value for options("timeout") can be set from environment variableR_DEFAULT_INTERNET_TIMEOUT, still defaulting to 60 (seconds) if that is not setor invalid.

This change can affect functions that download logs. This is especiallytrue over slower internet connections or when you’re dealing with largelog files. To fix this,fetchCranLog() will, if needed, temporarilyset the timeout to 600 seconds.

Footnotes

  1. Specifically, within each 5% interval of percentile ranks (e.g., 0to 5, 5 to 10, 95 to 100, etc.), a random sample of 5% of packagesis selected and tracked.

About

R Package for Computing and Visualizing CRAN and Bioconductor Downloads

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp