Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Tools to bulk download arxiv data

License

NotificationsYou must be signed in to change notification settings

armancohan/arxiv-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Prerequisites

ArXiv providesbulk data access throughAmazon S3. You need an account withAmazon AWS to be able to download the data. You also need python 2.

Downloading arXiv documents

1- Installs3cmd which is a command line tool for interacting with S3

pip install s3cmd (only works on python 2)

2- Configure your s3cmd by entering credentials found in the account management tab of the Amazon AWS website

s3cmd --configure

3- Get the manifest files:

The complete set of arXiv files available from Amazon S3 in requester pays buckets. The files are in .tar format each with ~500MB size. You need to have the keys to these chunks to be able to download them. The complete list of these keys is provided in themanifest files. First download the manifests:

For PDF documents:

s3cmd get --requester-pays s3://arxiv/pdf/arXiv_pdf_manifest.xml local-directory/arXiv_pdf_manifest.xml

For source documents:

s3cmd get --requester-pays s3://arxiv/src/arXiv_src_manifest.xml local-directory/arXiv_src_manifest.xml

4- Download the actual pdf and source files using thedownload.py script

Download pdf files:

python download.py --manifest_file /path/to/pdf-manifest --mode pdf --output_dir /path/to/output

Download source files:

python download.py --manifest_file /path/to/src-manifest --mode src --output_dir /path/to/output

This will download all the files in the directory that you designated as output.

If you also need the metadata, usemetha to bulk download the metadata.

About

Tools to bulk download arxiv data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp