Movatterモバイル変換


[0]ホーム

URL:


ContentsMenuExpandLight modeDark modeAuto light/dark, in light modeAuto light/dark, in dark modeSkip to content
Python Packaging User Guide
Python Packaging User Guide
Back to top

Analyzing PyPI package downloads

This section covers how to use the public PyPI download statistics datasetto learn more about downloads of a package (or packages) hosted on PyPI. Forexample, you can use it to discover the distribution of Python versions used todownload a package.

Background

PyPI does not display download statistics for a number of reasons:[1]

  • Inefficient to make work with a Content Distribution Network (CDN):Download statistics change constantly. Including them in project pages, whichare heavily cached, would require invalidating the cache more often, andreduce the overall effectiveness of the cache.

  • Highly inaccurate: A number of things prevent the download counts frombeing accurate, some of which include:

    • pip’s download cache (lowers download counts)

    • Internal or unofficial mirrors (can both raise or lower download counts)

    • Packages not hosted on PyPI (for comparisons sake)

    • Unofficial scripts or attempts at download count inflation (raises downloadcounts)

    • Known historical data quality issues (lowers download counts)

  • Not particularly useful: Just because a project has been downloaded a lotdoesn’t mean it’s good; Similarly just because a project hasn’t beendownloaded a lot doesn’t mean it’s bad!

In short, because its value is low for various reasons, and the tradeoffsrequired to make it work are high, it has been not an effective use oflimited resources.

Public dataset

As an alternative, theLinehaul projectstreams download logs from PyPI toGoogle BigQuery[2], where they arestored as a public dataset.

Getting set up

In order to useGoogle BigQuery to query thepublic PyPI downloadstatistics dataset, you’ll need a Google account and to enable the BigQueryAPI on a Google Cloud Platform project. You can run up to 1TB of queriesper monthusing the BigQuery free tier without a credit card

For more detailed instructions on how to get started with BigQuery, check outtheBigQuery quickstart guide.

Data schema

Linehaul writes an entry in abigquery-public-data.pypi.file_downloads table for eachdownload. The table contains information about what file was downloaded and howit was downloaded. Some useful columns from thetable schemainclude:

Column

Description

Examples

timestamp

Date and time

2020-03-0900:33:03UTC

file.project

Project name

pipenv,nose

file.version

Package version

0.1.6,1.4.2

details.installer.name

Installer

pip,bandersnatch

details.python

Python version

2.7.12,3.6.4

Useful queries

Run queries in theBigQuery web UI by clicking the “Compose query” button.

Note that the rows are stored in a partitioned table, which helpslimit the cost of queries. These example queries analyze downloads fromrecent history by filtering on thetimestamp column.

Counting package downloads

The following query counts the total number of downloads for the project“pytest”.

#standardSQLSELECTCOUNT(*)ASnum_downloadsFROM`bigquery-public-data.pypi.file_downloads`WHEREfile.project='pytest'-- Only query the last 30 days of historyANDDATE(timestamp)BETWEENDATE_SUB(CURRENT_DATE(),INTERVAL30DAY)ANDCURRENT_DATE()

num_downloads

26190085

To count downloads from pip only, filter on thedetails.installer.namecolumn.

#standardSQLSELECTCOUNT(*)ASnum_downloadsFROM`bigquery-public-data.pypi.file_downloads`WHEREfile.project='pytest'ANDdetails.installer.name='pip'-- Only query the last 30 days of historyANDDATE(timestamp)BETWEENDATE_SUB(CURRENT_DATE(),INTERVAL30DAY)ANDCURRENT_DATE()

num_downloads

24334215

Package downloads over time

To group by monthly downloads, use theTIMESTAMP_TRUNC function. Alsofiltering by this column reduces corresponding costs.

#standardSQLSELECTCOUNT(*)ASnum_downloads,DATE_TRUNC(DATE(timestamp),MONTH)AS`month`FROM`bigquery-public-data.pypi.file_downloads`WHEREfile.project='pytest'-- Only query the last 6 months of historyANDDATE(timestamp)BETWEENDATE_TRUNC(DATE_SUB(CURRENT_DATE(),INTERVAL6MONTH),MONTH)ANDCURRENT_DATE()GROUPBY`month`ORDERBY`month`DESC

num_downloads

month

1956741

2018-01-01

2344692

2017-12-01

1730398

2017-11-01

2047310

2017-10-01

1744443

2017-09-01

1916952

2017-08-01

Python versions over time

Extract the Python version from thedetails.python column. Warning: Thisquery processes over 500 GB of data.

#standardSQLSELECTREGEXP_EXTRACT(details.python,r"[0-9]+\.[0-9]+")ASpython_version,COUNT(*)ASnum_downloads,FROM`bigquery-public-data.pypi.file_downloads`WHERE-- Only query the last 6 months of historyDATE(timestamp)BETWEENDATE_TRUNC(DATE_SUB(CURRENT_DATE(),INTERVAL6MONTH),MONTH)ANDCURRENT_DATE()GROUPBY`python_version`ORDERBY`num_downloads`DESC

python

num_downloads

3.7

18051328726

3.6

9635067203

3.8

7781904681

2.7

6381252241

null

2026630299

3.5

1894153540

Getting absolute links to artifacts

It’s sometimes helpful to be able to get the absolute links to downloadartifacts from PyPI based on their hashes, e.g. if a particular project orrelease has been deleted from PyPI. The metadata table includes thepathcolumn, which includes the hash and artifact filename.

Note

The URL generated here is not guaranteed to be stable, but currently aligns with the URL where PyPI artifacts are hosted.

SELECTCONCAT('https://files.pythonhosted.org/packages',path)asurlFROM`bigquery-public-data.pypi.distribution_metadata`WHEREfilenameLIKE'sampleproject%'

Caveats

In addition to the caveats listed in the background above, Linehaul sufferedfrom a bug which caused it to significantly under-report download statisticsprior to July 26, 2018. Downloads before this date are proportionally accurate(e.g. the percentage of Python 2 vs. Python 3 downloads) but total numbers arelower than actual by an order of magnitude.

Additional tools

Besides using the BigQuery console, there are some additional tools which maybe useful when analyzing download statistics.

google-cloud-bigquery

You can also access the public PyPI download statistics datasetprogrammatically via the BigQuery API and thegoogle-cloud-bigquery project,the official Python client library for BigQuery.

fromgoogle.cloudimportbigquery# Note: depending on where this code is being run, you may require# additional authentication. See:# https://cloud.google.com/bigquery/docs/authentication/client=bigquery.Client()query_job=client.query("""SELECT COUNT(*) AS num_downloadsFROM `bigquery-public-data.pypi.file_downloads`WHERE file.project = 'pytest'  -- Only query the last 30 days of history  AND DATE(timestamp)    BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)    AND CURRENT_DATE()""")results=query_job.result()# Waits for job to complete.forrowinresults:print("{} downloads".format(row.num_downloads))

pypinfo

pypinfo is a command-line tool which provides access to the dataset andcan generate several useful queries. For example, you can query the totalnumber of download for a package with the commandpypinfopackage_name.

Installpypinfo using pip.

python3-mpipinstallpypinfo

Usage:

$pypinforequestsServed from cache: FalseData processed: 6.87 GiBData billed: 6.87 GiBEstimated cost: $0.04| download_count || -------------- ||      9,316,415 |

pandas-gbq

Thepandas-gbq project allows for accessing query results viaPandas.

References

[1]

PyPI Download Counts deprecation email

[2]

PyPI BigQuery dataset announcement email

On this page

[8]ページ先頭

©2009-2025 Movatter.jp