- Notifications
You must be signed in to change notification settings - Fork1
lindbrook/packageRank
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
‘packageRank’ is an Rpackage that helps put package download counts into context. It does sovia two core functions,cranDownloads()
andpackageRank()
, a set offilters that reduce download count inflation, and a host of otherassorted functions.
You can read more about the package in the sections below:
- I Download Counts describes how
cranDownloads()
givescranlogs::cran_downloads()
a more user-friendly interface and makes visualizing those data easyvia its generic Rplot()
method. - II Download Percentile Ranksdescribes how
packageRank()
makes use of percentile ranks. Thisnonparametric statistic computes the percentage of packages that withfewer downloads than yours: a package is in the 74th percentile hasmore downloads than 74% of packages. This facilitates comparison andhelps you locate your package in the overall distribution ofCRAN package downloads. - III Inflation Filters describes fourfilter functions that remove software and behavioral artifacts thatinflatenominal download counts. This functionality is available in
packageRank()
andpackageLog()
. - IV Availability of Results discusseswhen results become available, how to use
logInfo()
to check theavailability of results, and the effect of time zones. - V Reverse lookup of counts, ranks andpercentilesdiscusses
queryCount()
,queryRank()
,queryPercentile()
andcranDistribution()
. - VI Data Fixes discusses two problems with downloadcounts. The first involves issues with logs collected between the endof 2012 and the beginning of 2013. This is fixed in
fixDate_2012()
andfixCranlogs()
. The second is an issue with‘cranlogs’ that doublesor triples the number of R application download counts between2023-09-13 and 2023-10-02. This is fixed infixRCranlogs()
. - VII Data Note discusses the spike in the downloadof the Windows version of the R application on Sundays and Wednesdaysbetween 06 November 2022 and 19 March 2023.
- VIII et cetera discusses country code top-leveldomains (e.g.,
countryPackage()
andpackageCountry()
), the use ofmemoization and the internet connection time out problem.
To install‘packageRank’ fromCRAN:
install.packages("packageRank")
To install the development version from GitHub:
# You may need to first install 'remotes' via install.packages("remotes").remotes::install_github("lindbrook/packageRank",build_vignettes=TRUE)
cranDownloads()
uses all the same arguments ascranlogs::cran_downloads()
:
cranlogs::cran_downloads(packages="HistData")
> date count package> 1 2020-05-01 338 HistData
The only difference is thatcranDownloads()
adds four features:
cranDownloads(packages="GGplot2")
## Error in cranDownloads(packages = "GGplot2") :## GGplot2: misspelled or not on CRAN.
cranDownloads(packages="ggplot2")
> date count cumulative package> 1 2020-05-01 56357 56357 ggplot2
Note that his also works for inactive or “retired” packages in theArchive:
cranDownloads(packages="vr")
## Error in cranDownloads(packages = "vr") :## vr: misspelled or not on CRAN/Archive.
cranDownloads(packages="VR")
> date count cumulative package> 1 2020-05-01 11 11 VR
Withcranlogs::cran_downloads()
, you specify a time frame using thefrom
andto
arguments. The downside of this is that youmust use“yyyy-mm-dd”. For convenience’s sake,cranDownloads()
also allows youto use “yyyy-mm” or yyyy (“yyyy” also works).
Let’s say you want the download counts for‘HistData’ for February2020. Withcranlogs::cran_downloads()
, you’d have to type out thewhole date and remember that 2020 was a leap year:
cranlogs::cran_downloads(packages="HistData",from="2020-02-01",to="2020-02-29")
WithcranDownloads()
, you can just specify the year and month:
cranDownloads(packages="HistData",from="2020-02",to="2020-02")
Let’s say you want the download counts for‘rstan’ for 2020. Withcranlogs::cran_downloads()
, you’d type something like:
cranlogs::cran_downloads(packages="rstan",from="2022-01-01",to="2022-12-31")
WithcranDownloads()
, you can use:
cranDownloads(packages="rstan",from=2020,to=2020)
or
cranDownloads(packages="rstan",from="2020",to="2020")
These additional date formats help to create convenient shortcuts. Let’ssay you want the year-to-date download counts for‘rstan’. Withcranlogs::cran_downloads()
, you’d type something like:
cranlogs::cran_downloads(packages="rstan",from="2023-01-01",to= Sys.Date()-1)
WithcranDownloads()
, you can just pass the current year tofrom =
:
cranDownloads(packages="rstan",from=2023)
And if you wanted the entire download history, pass the current year toto =
:
cranDownloads(packages="rstan",to=2023)
Note that the Posit/RStudio logs begin on 01 October 2012.
cranDownloads(packages="HistData",from="2019-01-15",to="2019-01-35")
## Error in resolveDate(to, type = "to") : Not a valid date.
cranDownloads(packages="HistData",when="last-week")
> date count cumulative package> 1 2020-05-01 338 338 HistData> 2 2020-05-02 259 597 HistData> 3 2020-05-03 321 918 HistData> 4 2020-05-04 344 1262 HistData> 5 2020-05-05 324 1586 HistData> 6 2020-05-06 356 1942 HistData> 7 2020-05-07 324 2266 HistData
The “spell check” or validation of packages described above, requiressome additional background downloads. While those data are cached viathe ‘meomoise’ package, this will add time the first timecranDownloads()
is run. For faster results, you can bypass thosefeatures by settingpro.mode = TRUE
. The downside is that you’ll seezero downloads for packages on dates before they’re published on CRANand zero downloads for mis-spelled/non-existent packages. You’ll alsowon’t be able to use theto =
argument by itself.
For example, ‘packageRank’ was first published on CRAN on 2019-05-16 -you can verify this viapackageHistory("packageRank")
. But if you usecranlogs::cran_downloads()
orcranDownloads(pro.mode = TRUE)
beforethat date, you’ll see zero downloads for dates before 2019-05-16:
cranDownloads("packageRank",from="2019-05-10",to="2019-05-16",pro.mode=TRUE)>datecountcumulativepackage>12019-05-1000packageRank>22019-05-1100packageRank>32019-05-1200packageRank>42019-05-1300packageRank>52019-05-1400packageRank>62019-05-1500packageRank>72019-05-166868packageRank
You’ll notice this particularly when one of the packages you’reincluding newer packages in cranDownloads().
If you mis-spell a package :
cranDownloads("vr",from="2019-05-10",to="2019-05-16",pro.mode=TRUE)>datecountcumulativepackage>12019-05-1000vr>22019-05-1100vr>32019-05-1200vr>42019-05-1300vr>52019-05-1400vr>62019-05-1500vr>72019-05-1600vr
If you just useto =
without a value forfrom =
, you’ll get anerror:
cranDownloads(to=2024,pro.mode=TRUE)
Error: You must also provide a date for "from".
cranDownloads()
makes visualizing package downloads easy by usingplot()
:
plot(cranDownloads(packages="HistData",from="2019",to="2019"))
If you pass a vector of package names for a single day,plot()
returnsa dotchart:
plot(cranDownloads(packages= c("ggplot2","data.table","Rcpp"),from="2020-03-01",to="2020-03-01"))
If you pass a vector of package names for multiple days,plot()
uses‘ggplot2’ facets:
plot(cranDownloads(packages= c("ggplot2","data.table","Rcpp"),from="2020",to="2020-03-20"))
To plot those data in a single frame, setmulti.plot = TRUE
:
plot(cranDownloads(packages= c("ggplot2","data.table","Rcpp"),from="2020",to="2020-03-20"),multi.plot=TRUE)
To plot those data in separate plots on the same scale, setgraphics = "base"
and you’ll be prompted for each plot:
plot(cranDownloads(packages= c("ggplot2","data.table","Rcpp"),from="2020",to="2020-03-20"),graphics="base")
To do the above on separate, independent scales, setsame.xy = FALSE
:
plot(cranDownloads(packages= c("ggplot2","data.table","Rcpp"),from="2020",to="2020-03-20"),graphics="base",same.xy=FALSE)
To use the base 10 logarithm of the download count in a plot, setlog.y = TRUE
:
plot(cranDownloads(packages="HistData",from="2019",to="2019"),log.y=TRUE)
Note that for the sake of the plot, zero counts are replaced by ones sothat the logarithm can be computed (This does not affect the datareturned bycranDownloads()
).
cranlogs::cran_download(packages = NULL)
computes the total number ofpackage downloads fromCRAN. You can plotthese data by using:
plot(cranDownloads(from=2019,to=2019))
cranlogs::cran_download(packages = "R")
computes the total number ofdownloads of the R application (note that you can only use “R” or avector of packages names, not both!). You can plot these data by using:
plot(cranDownloads(packages="R",from=2019,to=2019))
If you want the total count of R downloads, setr.total = TRUE
:
plot(cranDownloads(packages="R",from=2019,to=2019),r.total=TRUE)
Note that since Sunday 06 November 2022 and Wednesday, 18 January 2023,there’ve been spikes of downloads of the Windows version of R on Sundaysand Wednesdays (details below inR Windows Sunday and Wednesdaydownloads).
To add a smoother to your plot, usesmooth = TRUE
:
plot(cranDownloads(packages="rstan",from="2019",to="2019"),smooth=TRUE)
With graphs that use ‘ggplot2’,se = TRUE
will add a confidenceinterval:
plot(cranDownloads(packages= c("HistData","rnaturalearth","Zelig"),from="2020",to="2020-03-20"),smooth=TRUE,se=TRUE)
In general, loess is the chosen smoother. Note that with base graphics,lowess is used when there are 7 or fewer observations. Thus, to controlthe degree of smoothness, you’ll typically use thespan
argument (thedefault is span = 0.75). With base graphics with 7 or fewerobservations, you control the degree of smoothness using thef
argument (the default is f = 2/3):
plot(cranDownloads(packages= c("HistData","rnaturalearth","Zelig"),from="2020",to="2020-03-20"),smooth=TRUE,span=0.75)plot(cranDownloads(packages= c("HistData","rnaturalearth","Zelig"),from="2020",to="2020-03-20"),smooth=TRUE,graphics="ggplot2",span=0.33)
To annotate a graph with a package’s release dates (base graphics only):
plot(cranDownloads(packages="rstan",from="2019",to="2019"),package.version=TRUE)
To annotate a graph with R release dates:
plot(cranDownloads(packages="rstan",from="2019",to="2019"),r.version=TRUE)
To plot growth curves, setstatistic = "cumulative"
:
plot(cranDownloads(packages= c("ggplot2","data.table","Rcpp"),from="2020",to="2020-03-20"),statistic="cumulative",multi.plot=TRUE,points=FALSE)
To visualize a package’s downloads relative to “all” other packages overtime:
plot(cranDownloads(packages="HistData",from="2020",to="2020-03-20"),population.plot=TRUE)
This longitudinal view plots the date (x-axis) against the base 10logarithm of the selected package’s download counts (y-axis). To get asense of how the selected package’s performance stacks up against “all”other packages, a set of smoothed curves representing a stratifiedrandom sample of packages is plotted in gray in the background (this isthe “typical” pattern of downloads onCRAN for the selected time period).1
The default unit of observation for bothcranDownloads()
andcranlogs::cran_dowanlods()
is the day. The graph below plots the dailydownloads for‘cranlogs’from 01 January 2022 through 15 April 2022.
plot(cranDownloads(packages="cranlogs",from=2022,to="2022-04-15"))
To view the data from a less granular perspective, changeplot.cranDownloads()’sunit.observation
argument from “day” to “week”,“month”, or “year”.
The graph below plots the data aggregated by month (with an addedsmoother):
plot(cranDownloads(packages="cranlogs",from=2022,to="2022-04-15"),unit.observation="month",smooth=TRUE,graphics="ggplot2")
Three things to note. First, if the last/current month (far right) isstill in-progress (it’s not yet the end of the month), that observationwill be split in two: one point for the in-progress total (empty blacksquare), another for the estimated total (empty red circle). Theestimate is based on the proportion of the month completed. In theexample above, the 635 observed downloads from April 1 through April 15translates into an estimate of 1,270 downloads for the entire month (30/ 15 * 635). Second, if a smoother is included, it will only use“complete” observations, not in-progress or estimated data. Third, allpoints are plotted along the x-axis on the first day of the month.
The graph below plots the data aggregated by week (weeks begin onSunday).
plot(cranDownloads(packages="cranlogs",from=2022,to="2022-06-15"),unit.observation="week",smooth=TRUE)
Four things to note. First, if the first week (far left) is incomplete(the ‘from’ date is not a Sunday), that observation will be split intwo: one point for the observed total on the start date (gray emptysquare) and another point for thebackdated total. Backdating involvescompleting the week by pushing the nominal start date back to includethe previous Sunday (blue asterisk). In the example above, the nominalstart date (01 January 2022) is moved back to include data through theprevious Sunday (26 December 2021). This is useful because with a weeklyunit of observation the first observation is likely to be truncated andwould not give the most representative picture of the data. Second, ifthe last week (far right) is in-progress (the ‘to’ date is not aSaturday), that observation will be split in two: the observed total(gray empty square) and the estimated total based on the proportion ofweek completed (red empty circle). Third, just like the monthly plot,smoothers only use complete observations, including backdated data butexcluding in-progress and estimated data. Fourth, with the exception offirst week’s observed count, which is plotted at its nominal date,points are plotted along the x-axis on Sundays, the first day of theweek.
For what it’s worth, below are my go-to commands for graphs. They takeadvantage of RStudio IDE’s plot history panel, which allows you to cyclethrough and compare graphs. Typically, I’ll look at the data for thelast year or so at the three available units of observation: day, weekand month. I use base graphics, viagraphics = "base"
, to takeadvantage of prompts and “nicer” axes annotation. This also allows me toeasily add graphical elements afterwards as needed, e.g.,abline(h = 100, lty = "dotted")
.
plot(cranDownloads(packages= c("cholera","packageRank"),from=2023),graphics="base",package.version=TRUE,smooth=TRUE,unit.observation="day")plot(cranDownloads(packages= c("cholera","packageRank"),from=2023),graphics="base",package.version=TRUE,smooth=TRUE,unit.observation="week")# Note that I disable smoothing for monthly dataplot(cranDownloads(packages= c("cholera","packageRank"),from=2023),graphics="base",package.version=TRUE,smooth=FALSE,unit.observation="month")
Perhaps the biggest downside of using cranDownloads(pro.mode = TRUE) isthat you might draw mistaken inferences from plotting the data since itadds false zeroes to your data.
Using the example of ‘packageRank’, which was published on 2019-05-16:
plot(cranDownloads("packageRank",from="2019-05",to="2019-05",pro.mode=TRUE),smooth=TRUE)
plot(cranDownloads("packageRank",from="2019-05",to="2019-05",pro.mode=FALSE),smooth=TRUE)
After spending some time with nominal download counts, the “compared towhat?” question will come to mind. For instance, consider the data forthe ‘cholera’ package from the first week of March 2020:
plot(cranDownloads(packages="cholera",from="2020-03-01",to="2020-03-07"))
Do Wednesday and Saturday reflect surges of interest in the package orsurges of traffic toCRAN? To put itdifferently, how can we know if a given download count is typical orunusual?
To answer these questions, we can start by looking at the total numberof package downloads:
plot(cranDownloads(from="2020-03-01",to="2020-03-07"))
Here we see that there’s a big difference between the work week and theweekend. This seems to indicate that the download activity for‘cholera’ on the weekendseems high. Moreover, the Wednesday peak for‘cholera’ downloads seemshigher than the mid-week peak of total downloads.
One way to better address these observations is to locate your package’sdownload counts in the overall frequency distribution of downloadcounts. ‘cholera’ allows you to do so viapackageDistribution()
. Beloware the distributions of logarithm of download counts for Wednesday andSaturday. Each vertical segment (along the x-axis) represents a downloadcount. The height of a segment represents that download count’sfrequency. The location of‘cholera’ in thedistribution is highlighted in red.
plot(packageDistribution(package="cholera",date="2020-03-04"))
plot(packageDistribution(package="cholera",date="2020-03-07"))
While these plots give us a better picture of where‘cholera’ is located,comparisons between Wednesday and Saturday are still impressionistic:all we can confidently say is that the download counts for both dayswere greater than the mode.
To facilitate interpretation and comparison, I use thepercentile rankof a download count instead of the simple nominal download count. Thisnonparametric statistic tells you the percentage of packages that hadfewer downloads. In other words, it gives you the location of yourpackage relative to the locations of all other packages. Moreimportantly, by rescaling download counts to lie on the bounded intervalbetween 0 and 100, percentile ranks make it easier to compare packageswithin and across distributions.
For example, we can compare Wednesday (“2020-03-04”) to Saturday(“2020-03-07”):
packageRank(package="cholera",date="2020-03-04")>datepackagecountrankpercentile>12020-03-04cholera385,788of18,03867.9
On Wednesday, we can see that‘cholera’ had 38downloads, came in 5,788th place out of the 18,038 different packagesdownloaded, and earned a spot in the 68th percentile.
packageRank(package="cholera",date="2020-03-07")>datepackagecountrankpercentile>12020-03-07cholera293,189of15,95080
On Saturday, we can see that‘cholera’ had 29downloads, came in 3,189st place out of the 15,950 different packagesdownloaded, and earned a spot in the 80th percentile.
So contrary to what the nominal counts tell us, one could say that theinterest in‘cholera’ wasactually greater on Saturday than on Wednesday.
To compute percentile ranks, I do the following. For each package, Itabulate the number of downloads and then compute the percentage ofpackages with fewer downloads. Here are the details using‘cholera’ from Wednesdayas an example:
pkg.rank<- packageRank(packages="cholera",date="2020-03-04")downloads<-pkg.rank$cran.data$countnames(downloads)<-pkg.rank$cran.data$packageround(100* mean(downloads<downloads["cholera"]),1)> [1]67.9
To put it differently:
(pkgs.with.fewer.downloads<- sum(downloads<downloads["cholera"]))> [1]12250(tot.pkgs<- length(downloads))> [1]18038round(100*pkgs.with.fewer.downloads/tot.pkgs,1)> [1]67.9
In the example above, 38 downloads puts ‘cholera’ in 5,788th place if weallow for ties usingcompetition(i.e., “1224” ranking) and 5,556th place if we don’t by usingnominal/ordinal(i.e., “1234” ranking).
Prior to v0.9.2.9008, only nominal/ordinal ranking was available.Competition ranking is now the default viapackageRank(rank.ties = TRUE)
. If you want ordinal ranking, usepackageRank(rank.ties = FALSE)
.
To visualizepackageRank()
, useplot()
.
plot(packageRank(packages="cholera",date="2020-03-04"))
plot(packageRank(packages="cholera",date="2020-03-07"))
These graphs above, which are customized here to be on the same scale,plot therank order of packages’ download counts (x-axis) against thelogarithm of those counts (y-axis). It then highlights (in red) apackage’s position in the distribution along with its percentile rankand download count. In the background, the 75th, 50th and 25thpercentiles are plotted as dotted vertical lines. The package with themost downloads,‘magrittr’ in both cases,is at top left (in blue). The total number of downloads is at the topright (in blue).
‘cranlogs’ computes thenumber of package downloads by simply counting log entries. Whilestraightforward, this approach can run into problems. Putting aside thequestion of whether package dependencies should be counted, what I havein mind here is what I believe to be two types of “invalid” log entries.The first, a software artifact, stems from entries that are smaller,often orders of magnitude smaller, than a package’s actual binary orsource file. The second, a behavioral artifact, emerges from efforts todownload all ofCRAN. In both cases, areliance on nominal counts will give you an inflated sense of the degreeof interest in your package. For those interested, an early but detailedanalysis and discussion of both types of inflation is included as partof thisR-hub blogpost.
When looking at package download logs, the first thing you’ll notice arewrongly sized log entries. They come in two sizes. The “small” entriesare approximately 500 bytes in size. The “medium” entries vary in size,falling somewhere between a “small” entry and a full download (i.e.,“small” <= “medium” <= full download). “Small” entries manifestthemselves as standalone entries, paired with a full download, or aspart of a triplet along side a “medium” and a full download. “Medium”entries manifest themselves as either standalone entries or as part of atriplet.
The example below illustrates a triplet:
packageLog(date="2020-07-01")[4:6,-(4:6)]>datetimesizepackageversioncountryip_id>39986332020-07-0107:56:1599622cholera0.7.0US4760>39990662020-07-0107:56:154161948cholera0.7.0US4760>39991782020-07-0107:56:15536cholera0.7.0US4760
The “medium” entry is the first observation (99,622 bytes). The fulldownload is the second entry (4,161,948 bytes). The “small” entry is thelast observation (536 bytes). At a minimum, what makes a triplet atriplet (or a pair a pair) is that all members share systemconfiguration (e.g. IP address, etc.) and have identical or adjacenttime stamps.
To deal with the inflationary effect of “small” entries, I filter outobservations smaller than 1,000 bytes (the smallest package onCRAN appears to be‘LifeInsuranceContracts’,whose source file weighs in at 1,100 bytes). “Medium” entries are harderto handle. I remove them using a filter functions that looks up apackage’s actual size.
While wrongly sized entries are fairly easy to spot, seeing the effectof efforts to download all ofCRANrequire a change of perspective. While details and further evidence canbe found in theR-hub blogpostmentioned above, I’ll illustrate the problem with the following example:
packageLog(packages="cholera",date="2020-07-31")[8:14,-(4:6)]
> date time size package version country ip_id> 132509 2020-07-31 21:03:06 3797776 cholera 0.2.1 US 14> 132106 2020-07-31 21:03:07 4285678 cholera 0.4.0 US 14> 132347 2020-07-31 21:03:07 4109051 cholera 0.3.0 US 14> 133198 2020-07-31 21:03:08 3766514 cholera 0.5.0 US 14> 132630 2020-07-31 21:03:09 3764848 cholera 0.5.1 US 14> 133078 2020-07-31 21:03:11 4275831 cholera 0.6.0 US 14> 132644 2020-07-31 21:03:12 4284609 cholera 0.6.5 US 14
Here, we see that seven different versions of the package weredownloaded as a sequential bloc. A little digging shows that these sevenversions representall versions of ‘cholera’ available on that date:
packageHistory(package="cholera")
> Package Version Date Repository> 1 cholera 0.2.1 2017-08-10 Archive> 2 cholera 0.3.0 2018-01-26 Archive> 3 cholera 0.4.0 2018-04-01 Archive> 4 cholera 0.5.0 2018-07-16 Archive> 5 cholera 0.5.1 2018-08-15 Archive> 6 cholera 0.6.0 2019-03-08 Archive> 7 cholera 0.6.5 2019-06-11 Archive> 8 cholera 0.7.0 2019-08-28 CRAN
While there are “legitimate” reasons for downloading past versions(e.g., research, container-based software distribution, etc.), I’d arguethat examples like the above are “fingerprints” of efforts to downloadCRAN. While this is not necessarilyproblematic, it does mean that when your package is downloaded as partof such efforts, that download is more a reflection of an interest inCRAN itself (a collection of packages)than of an interest in your packageper se. And since one of the usesof counting package downloads is to assess interest inyour package,it may be useful to exclude such entries.
To do so, I try to filter out these entries in two ways. The firstidentifies IP addresses that download “too many” packages and thenfilters outcampaigns, large blocs of downloads that occur in (nearly)alphabetical order. The second looks for campaigns not associated with“greedy” IP addresses and filters out sequences of past versionsdownloaded in a narrowly defined time window.
To get an idea of how inflated your package’s download count may be, usefilteredDownloads()
. Below are the results for ‘ggplot2’ for 15September 2021.
filteredDownloads(package="ggplot2",date="2021-09-15")>datepackagedownloadsfiltered.downloadsdeltainflation>12021-09-15ggplot211384211166221801.95 %
While there were 113,842 nominal downloads, applying all the filtersreduced that number to 111,662, an inflation of 1.95%.
Excluding the time it takes to download the log file (typically the bulkof the computation time), the above example take approximate 15additional seconds to run on a single core on a 3.1 GHz Dual-Core IntelCore i5 processor.
There are 4 filters. You can control them using the following arguments(listed in order of application):
ip.filter
: removes campaigns of “greedy” IP addresses.small.filter
: removes entries smaller than 1,000 bytes.sequence.filter
: removes blocs of past versions.size.filter
: removes entries smaller than a package’s binary orsource file.
ForfilteredDownloads()
, they are all on by default. ForpackageLog()
andpackageRank()
, they are off by default. To applythem, simply set the argument for the filter you want to TRUE:
packageRank(package="cholera",small.filter=TRUE)
Alternatively, forpackageLog()
andpackageRank()
you can simply setall.filters = TRUE
.
packageRank(package="cholera",all.filters=TRUE)
Note that theall.filters = TRUE
is contextual. Depending on thefunction used, you’ll either get the CRAN-specific or thepackage-specific set of filters. The former setsip.filter = TRUE
andsize.filter = TRUE
; it works independently of packages at the level ofthe entire log. The latter sets sequence.filter = TRUEand
size.filterTRUE`; it relies on package specific information (e.g., size of sourceor binary file).
Ideally, we’d like to use both sets. However, the package-specific setis computationally expensive because they need to be appliedindividually to all packages in the log, which can involve tens ofthousands of packages. While not unfeasible, currently this takes a longtime. For this reason, whenall.filters = TRUE
,packageRank()
,ipPackage()
,countryPackage()
,countryDistribution()
andpackageDistribution()
use only CRAN specific filters whilepackageLog()
,packageCountry()
, andfilteredDownloads()
use bothCRAN and package specific filters.
To understand when results become available, you need to be aware that‘packageRank’ has twoupstream, online dependencies. The first is Posit/RStudio’sCRANpackage download logs, which recordtraffic to the “0-Cloud” mirror at cloud.r-project.org (formerlyPosit/RStudio’s CRAN mirror). The second is Gábor Csárdi’s‘cranlogs’ R package,which uses those logs to compute the download counts of both the Rapplication and R packages.
TheCRAN package download logs for theprevious day are typically posted by 17:00 UTC. The results for‘cranlogs’ usually becomeavailable soon thereafter (sometimes as much as a day later).
Occasionally problems with “today’s” data can emerge due to the upstreamdependencies (illustrated below).
CRAN Download Logs --> 'cranlogs' --> 'packageRank'
If there’s a problem with thelogs(e.g., they’re not posted on time), both‘cranlogs’ and‘packageRank’ will beaffected. If this happens, you’ll see things like an unexpected zerocount(s) for your package(s) (actually, you’ll see a zero download countfor both your package and for all ofCRAN), data from “yesterday”, or a “Logis not (yet) on the server” error message.
'cranlogs' --> packageRank::cranDownloads()
If there’s a problem with‘cranlogs’ but not withthelogs, onlypackageRank::cranDownalods()
will be affected. In that case, you mightget a warning that only “previous” results will be used. All other‘packageRank’functions should work since they either directly access the logs or usesome other source. Usually, these errors resolve themselves the nexttime the underlying scripts are run (“tomorrow”, if not sooner).
To check the status of the download logs and ‘cranlogs’, uselogInfo()
. This function checks whether 1) “today’s” log is posted onPosit/RStudio’s server and 2) “today’s” results have been computed by‘cranlogs’.
logInfo()
$`Today's log/result`[1] "2023-02-01"$`Today's log posted?`[1] "Yes"$`Today's results on 'cranlogs'?`[1] "No"$status[1] "Today's log is typically posted by 01 Feb 09:00 PST | 01 Feb 17:00 UTC."
Because you’re typically interested intoday’s log file, another thingthat affects availability is your time zone. For example, let’s say thatit’s 09:01 on 01 January 2021 and you want to compute the percentilerank for‘ergm’ for the lastday of 2020. You might be tempted to use the following:
packageRank(packages="ergm")
However, depending onwhere you make this request, you may not get thedata you expect. In Honolulu, USA, you will. In Sydney, Australia youwon’t. The reason is that you’ve somehow forgotten a key piece oftrivia: Posit/RStudio typically postsyesterday’s log around 17:00 UTCthe following day.
The expression works in Honolulu because 09:01 HST on 01 January 2021 is19:01 UTC 01 January 2021. So the log you want has been available for 2hours. The expression fails in Sydney because 09:01 AEDT on 01 January2021 is 31 December 2020 22:00 UTC. The log you want won’t actually beavailable for another 19 hours.
To make life a little easier,‘packageRank’ does twothings. First, when the log for the date you want is not available (dueto time zone rather than server issues), you’ll just get the lastavailable log. If you specified a date in the future, you’ll either getan error message or a warning with an estimate of when the log you wantshould be available.
Using the Sydney example and the expression above, you’d get the resultsfor 30 December 2020:
packageRank(packages="ergm")
> date package count rank percentile> 1 2020-12-30 ergm 292 878 of 20,077 95.6
If you had specified the date, you’d get an additional warning:
packageRank(packages="ergm",date="2021-01-01")
> date package count rank percentile> 1 2020-12-30 ergm 292 878 of 20,077 95.6Warning message:2020-12-31 log arrives in ~19 hours at 02 Jan 04:00 AEDT. Using previous!
Keep in mind that 17:00 UTC is not a hard deadline. Barring serverissues, the logs are usually posted a littlebefore that time. I don’tknow when the script starts but the posting time seems to be a functionof the number of entries: closer to 17:00 UTC when there are moreentries (e.g., weekdays); earlier than 17:00 UTC when there are fewerentries (e.g., weekends). Again, barring server issues, the ‘cranlogs’results are usually availablebefore 18:00 UTC.
Here’s what you’d see using the Honolulu example:
logInfo(details=TRUE)
$`Today's log/result`[1] "2020-12-31"$`Today's log posted?`[1] "Yes"$`Today's results on 'cranlogs'?`[1] "Yes"$`Available log/result`[1] "Posit/RStudio (2020-12-31); 'cranlogs' (2020-12-31)."$status[1] "Everything OK."
The function uses your local time zone, which depends on R’s ability tocompute your local time and time zone (e.g.,Sys.time()
andSys.timezone()
). My understanding is that there may be operatingsystem or platform specific issues that could undermine this.
To query the log for a specific count, rank or percentile rank, use thefunctions below:
To find the packages that had 100 downloads (the default is 1, thelowest number of observable downloads):
queryCount(100)
> package count rank nominal.rank percentile> 1 analogsea 100 2143 2129 92.1> 2 ComplexUpset 100 2143 2130 92.1> 3 detrendr 100 2143 2131 92.1> 4 drat 100 2143 2132 92.1> 5 enrichR 100 2143 2133 92.1> 6 exact2x2 100 2143 2134 92.1> 7 fdapace 100 2143 2135 92.1> 8 fdth 100 2143 2136 92.1> 9 ggmcmc 100 2143 2137 92.1> 10 jsTreeR 100 2143 2138 92.1> 11 likert 100 2143 2139 92.1> 12 praznik 100 2143 2140 92.1> 13 rayimage 100 2143 2141 92.1> 14 rlemon 100 2143 2142 92.1> 15 worcs 100 2143 2143 92.1
To find the package that was ranked 20th in downloads (the default is1st, the most downloaded package):
queryRank(20)
> package count rank nominal.rank percentile> 1 stringr 33041 20 20 99.9
If you want the packages with a particular percentile rank, usequeryPercentile()
. Note that due to the discrete nature of counts,your choice of percentile may not be available because they may fall inthe vertical gaps in the observed data:
For this reason,queryPercentile()
rounds you selection to wholenumbers. Also, the default value, which is set to 50, usesmedian()
toguarantee a result.
# head() is used because there will be many observations with median count.head(queryPercentile())
> package count rank nominal.rank percentile> 1 AATtools 12 13697 12845 49.2> 2 abdiv 12 13697 12846 49.2> 3 abglasso 12 13697 12847 49.2> 4 ablasso 12 13697 12848 49.2> 5 Ac3net 12 13697 12849 49.2> 6 acp 12 13697 12850 49.2
You can also set a range of percentile ranks using the ‘lo’ and/or ‘hi’arguments. If you get an error message, you may need to widen yourinterval:
head(queryPercentile(lo=95,hi=96),3)tail(queryPercentile(lo=95,hi=96),3)
> package count rank nominal.rank percentile> 1 mapdata 420 931 931 96.5> 2 shinyalert 418 932 932 96.5> 3 klaR 416 935 933 96.5> package count rank nominal.rank percentile> 536 PortfolioAnalytics 189 1466 1466 94.6> 537 binom 188 1468 1467 94.6> 538 prefmod 188 1468 1468 94.6
The above functions leveragecranDistribution()
, which computes theranks and the distribution of download counts for a given day’s log.
Its print method provides the date, the number of unique packagesdownloaded, the total number of downloads (the total number ofrows/observations in the log) and the count and rank data for the top 20packages:
cranDistribution()
> $date> [1] "2024-08-01 Thursday"> > $unique.packages.downloaded> [1] "26,959"> > $total.downloads> [1] "5,760,937"> > $top.n> package count rank nominal.rank percentile> 1 rlang 56311 1 1 100.0> 2 ggplot2 53981 2 2 100.0> 3 withr 51577 3 3 100.0> 4 cli 51509 4 4 100.0> 5 lifecycle 47771 5 5 100.0> 6 dplyr 46734 6 6 100.0> 7 vctrs 45173 7 7 100.0> 8 jsonlite 43280 8 8 100.0> 9 Rcpp 40935 9 9 100.0> 10 tibble 39908 10 10 100.0> 11 glue 39430 11 11 100.0> 12 pillar 37875 12 12 100.0> 13 magrittr 35133 13 13 100.0> 14 bslib 35106 14 14 99.9> 15 colorspace 34707 15 15 99.9> 16 xfun 34442 16 16 99.9> 17 scales 34052 17 17 99.9> 18 R6 33516 18 18 99.9> 19 fansi 33111 19 19 99.9> 20 stringr 33041 20 20 99.9
Note that if you want to specify the number of top N packages, you’llhave to explicitly use the print() and the ‘top.n’ argument:
print(cranDistribution(),top.n=7)
Alternatively, you can usequeryRank()
:
queryRank(1:7)
The summary method provides the number of unique packages downloaded,the total number of downloads and the five number summary (plus thearithmetic mean):
summary(cranDistribution())
> $unique.packages.downloaded> [1] 26959> > $total.downloads> [1] 5760937> > $download.summary> Min. 1st Qu. Median Mean 3rd Qu. Max. > 1.0 6.0 12.0 213.7 28.0 56311.0
The plot method graphs the distribution of base 10 logarithm of downloadcounts. Each plot is annotated with the median, mean and maximumdownload counts, as well as the total number of downloads and the totalnumber of unique packages observed.
plot(cranDistribution())
‘packageRank’ fixestwo data problems.
The first data problem involves logs collected between late 2012 and thebeginning of 2013. It’s a bit complicated. To understand it, we need tobe know that the Posit/RStudio download logs are stored as separatefiles with a name/URL that embeds the log’s date:
http://cran-logs.rstudio.com/2022/2022-01-01.csv.gz
For the logs in question, this convention was broken in three ways: i)some logs are effectively duplicated (same log, multiple names), ii) atleast one is mislabeled and iii) the logs from 13 October through 28December are offset by +3 days (e.g., the file with the name/URL“2012-12-01” contains the log for “2012-11-28”). As a result, we geterroneous download counts and we actually lose the last three logs of2012. Details are availablehere.
Unsurprisingly, all this affects download counts.
Functions that rely oncranlogs::cran_download()
(e.g.,‘packageRank::cranDownloads()’,‘adjustedcranlogs’and‘dlstats’) aresusceptible to the first error - duplicate names. My understanding isthat this is because‘cranlogs’ uses the datein a log rather than the filename/URL to retrieve logs. To put itdifferently,‘cranlogs’can’t detect multiple instances of logs with the same date. I found 3logs with duplicate filename/URLs, and 5 additional instances ofovercounting (including one of tripling).‘fixCranlogs()’addresses this overcounting problem behind the scenes by recomputing thedownload counts using the actual log(s) when any of the eightproblematic dates are requested. Details about the 8 days andfixCranlogs()
can be foundhere.
Functions that access logs via their filename/URL, e.g.,packageRank()
andpackageLog()
, are affected by the second and third defects -mislabeled and offset logs.fixDate_2012()
addresses this, in the background, by re-mapping problematic logs so youget the log you expect.
The second data problem is of more recent vintage. From 2023-09-13through 2023-10-02, the download counts for the R application returnedbycranlogs::cran_downloads(packages = "R")
, is, with two exceptions,twice what one would expect when looking at the actual log(s). The twoexceptions are: 1) 2023-09-28 where the counts are identical but for a“rounding error” possibly due to NAs and 2) 2023-09-30 where there isactually a three-fold difference.
Here are the relevant ratios of counts comparing‘cranlogs’ results withcounts based on the underlying logs:
2023-09-12 2023-09-13 2023-09-14 2023-09-15 2023-09-16 2023-09-17 2023-09-18 2023-09-19osx 1 2 2 2 2 2 2 2src 1 2 2 2 2 2 2 2win 1 2 2 2 2 2 2 2 2023-09-20 2023-09-21 2023-09-22 2023-09-23 2023-09-24 2023-09-25 2023-09-26 2023-09-27osx 2 2 2 2 2 2 2 2src 2 2 2 2 2 2 2 2win 2 2 2 2 2 2 2 2 2023-09-28 2023-09-29 2023-09-30 2023-10-01 2023-10-02 2023-10-03osx 1.000000 2 3 2 2 1src 1.000801 2 3 2 2 1win 1.000000 2 3 2 2 1
Details and code for replication can be found in issue#69.fixRCranlogs()
corrects the problem
Note that there was a similar issue for package download counts aroundthe same period but that is now fixed in‘cranlogs’. For details,see issue#68
The graph above forR downloads shows the dailydownloads of the R application broken down by platform (Mac, Source,Windows). In it, you can see the typical pattern of mid-week peaks andweekend troughs.
Between 06 November 2022 and 19 March 2023, this pattern was broken. OnSundays (06 November 2022 - 19 March 2023) and Wednesdays (18 January2023 - 15 March 2023), there were noticeable, repeatedorders-of-magnitude spikes in the daily downloads of the Windows versionof R.
plot(cranDownloads("R",from="2022-10-06",to="2023-04-14"))axis(3,at= as.Date("2022-11-06"),labels="2022-11-06",cex.axis=2/3,padj=0.9)axis(3,at= as.Date("2023-03-19"),labels="2023-03-19",cex.axis=2/3,padj=0.9)abline(v= as.Date("2022-11-06"),col="gray",lty="dotted")abline(v= as.Date("2023-03-19"),col="gray",lty="dotted")
These download spikes did not seem to affect either the Mac or Sourceversions. I show this in the graphs below. Each plot, which isindividually scaled, breaks down the data in the graph above by day(Sunday or Wednesday) and platform.
The key thing is to compare the data in the period bounded by verticaldotted lines with the data before and after. If a Sunday or Wednesday isorders-of-magnitude unusual, I plot that day with a filled rather thanan empty circle. Only Windows, the final two graphs below, earn thisdistinction.
For those interested in directly using the , this section describes someissues that may be of use.
While the IP addresses in thePosit/RStudiologs are anonymized,packageCountry()
andcountryPackage()
the logs include ISO country codes or top leveldomains (e.g., AT, JP, US).
Note that coverage extends to only about 85% of observations(approximately 15% country codes are NA), and that there seems to be a acouple of typos for country codes: “A1” (A + number one) and “A2” (A +number 2). According toPosit/RStudio’sdocumentation, this coding was doneusing MaxMind’s free database, which no longer seems to be available andmay be a bit out of date.
To avoid the bottleneck of downloading multiple log files,packageRank()
is currently limited to individual calendar dates. Toreduce the bottleneck of re-downloading logs, which can approach 100 MB,‘packageRank’ makesuse of memoization via the‘memoise’ package.
Here’s relevant code:
fetchLog<-function(url)data.table::fread(url)mfetchLog<-memoise::memoise(fetchLog)if (RCurl::url.exists(url)) {cran_log<- mfetchLog(url)}# Note that data.table::fread() relies on R.utils::decompressFile().
This means that logs are intelligently cached; those that have alreadybeen downloaded in your current R session will not be downloaded again.
With R 4.0.3, the timeout value for internet connections became moreexplicit. Here are the relevant details from that release’s“Newfeatures”:
The default value for options("timeout") can be set from environment variableR_DEFAULT_INTERNET_TIMEOUT, still defaulting to 60 (seconds) if that is not setor invalid.
This change can affect functions that download logs. This is especiallytrue over slower internet connections or when you’re dealing with largelog files. To fix this,fetchCranLog()
will, if needed, temporarilyset the timeout to 600 seconds.
Footnotes
Specifically, within each 5% interval of percentile ranks (e.g., 0to 5, 5 to 10, 95 to 100, etc.), a random sample of 5% of packagesis selected and tracked.↩
About
R Package for Computing and Visualizing CRAN and Bioconductor Downloads