<dd>), seeMOS:LISTS § Description lists.This help page is ahow-to guide. It explains concepts or processes used by the Wikipedia community. It is not one ofWikipedia's policies or guidelines, and may reflect varying levels ofconsensus. |
| Readers' FAQ |
|---|
Wikipedia offers free copies of all available content to interested users. These databases can be used formirroring, personal use, informal backups, offline use or database queries (such as forWikipedia:Maintenance). All text content is licensed under theCreative Commons Attribution-ShareAlike 4.0 License (CC-BY-SA), and most is additionally licensed under theGNU Free Documentation License (GFDL).[1] Images and other files are available underdifferent terms, as detailed on their description pages. For our advice about complying with these licenses, seeWikipedia:Copyrights.
Some of the many ways to read Wikipedia while offline:
Some of them are mobile applications – see "List of Wikipedia mobile applications".
TL;DR:GET THE MULTISTREAM VERSION! (and the corresponding index file,pages-articles-multistream-index.txt.bz2)
pages-articles.xml.bz2 andpages-articles-multistream.xml.bz2 both contain the samexml contents. So if you unpack either, you get the same data. But with multistream, it is possible to get an article from the archive without unpacking the whole thing. Your reader should handle this for you; if your reader doesn't support it, it will work anyway since multistream and non-multistream contain the samexml. The only downside of multistream is that it is marginally larger. You might be tempted to get the smaller non-multistream archive, but this will be useless if you don't unpack it. And it will unpack to ~5–10 times its original size. Penny wise, pound foolish. Get multistream.
NOTE THAT the multistream dump file contains multiple bz2 'streams' (bz2 header, body, footer) concatenated together into one file, in contrast to the vanilla file which contains one stream. Each separate 'stream' (or really, file) in the multistream dump contains 100 pages, except possibly the last one.
For multistream, you can get an index file,pages-articles-multistream-index.txt.bz2. The first field of this index is the number of bytes to seek into the compressed archivepages-articles-multistream.xml.bz2, the second is the article ID, the third the article title.
Cut a small part out of the archive with dd using the byte offset as found in the index. You could then either bzip2 decompress it or use bzip2recover, and search the first file for the article ID.
Seehttps://docs.python.org/3/library/bz2.html#bz2.BZ2Decompressor for info about such multistream files and about how to decompress them with python; see alsohttps://gerrit.wikimedia.org/r/plugins/gitiles/operations/dumps/+/ariel/toys/bz2multistream/README.txt and related files for an old working toy.
In thedumps
Images and other uploaded media are available from mirrors in addition to being served directly from Wikimedia servers. Bulk download is (as of September 2013) available from mirrors but not offered directly from Wikimedia servers. See thelist of current mirrors. You shouldrsync from the mirror, then fill in the missing images fromupload.wikimedia.org; when downloading fromupload.wikimedia.org you should throttle yourself to 1 cache miss per second (you can check headers on a response to see if was a hit or miss and then back off when you get a miss) and you shouldn't use more than one or two simultaneous HTTP connections. In any case, make sure you have an accurateuser agent string with contact info (email address) so ops can contact you if there's an issue. You should be getting checksums from the mediawiki API and verifying them. TheAPI Etiquette page contains some guidelines, although not all of them apply (for example, because upload.wikimedia.org isn't MediaWiki, there is nomaxlag parameter).
Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-4.0. They may be under one of manyfree licenses, in thepublic domain, believed to befair use, or even copyright infringements (which should bedeleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information. This information is included in image description pages, which are part of the text dumps available fromdumps.wikimedia.org. In conclusion, download these images at your own risk (Legal).
Compressed dump files are significantly compressed, thus after being decompressed will take uplarge amounts of drive space. A large list of decompression programs are described incomparison of file archivers. The following programs in particular can be used to decompress bzip2,.bz2,.zip, and.7z files.
Beginning withWindows XP, a basic decompression program enables decompression of zip files.[2][3] Among others, the following can be used to decompress bzip2 files.
As files grow in size, so does the likelihood they will exceed some limit of a computing device. Each operating system, file system, hard storage device, and software (application) has a maximum file size limit. Each one of these will likely have a different maximum, and the lowest limit of all of them will become the file size limit for a storage device.
The older the software in a computing device, the more likely it will have a 2 GB file limit somewhere in the system. This is due to older software using 32-bit integers for file indexing, which limits file sizes to 2^31 bytes (2 GB) (for signed integers), or 2^32 (4 GB) (for unsigned integers). OlderCprogramming libraries have this 2 or 4 GB limit, but the newer file libraries have been converted to 64-bit integers thus supporting file sizes up to 2^63 or 2^64 bytes (8 or 16EB).
Before starting a download of a large file, check the storage device to ensure its file system can support files of such a large size, check the amount of free space to ensure that it can hold the downloaded file, and make sure the device(s) you'll use the storage with are able to read your chosen file system.
There are two limits for a file system: the file system size limit, and the file system limit. In general, since the file size limit is less than the file system limit, the larger file system limits are a moot point. A large percentage of users assume they can create files up to the size of their storage device, but are wrong in their assumption. For example, a 16 GB storage device formatted as FAT32 file system has a file limit of 4 GB for any single file. The following is a list of the most common file systems, and seeComparison of file systems for additional detailed information.
Each operating system has internal file system limits for file size and drive size, which is independent of the file system or physical media. If the operating system has any limits lower than the file system or physical media, then the OS limits will be the real limit.
Android:Android is based on Linux, which determines its base limits.
It is useful to check theMD5 sums (provided in a file in the download directory) to make sure the download was complete and accurate. This can be checked by running the "md5sum" command on the files downloaded. Given their sizes, this may take some time to calculate. Due to the technical details of how files are stored,file sizes may be reported differently on different filesystems, and so are not necessarily reliable. Also, corruption may have occurred during the download, though this is unlikely.
If you seem to be hitting the 2 GB limit, try usingwget version 1.10 or greater,cURL version 7.11.1-1 or greater, or a recent version oflynx (using -dump). Also, you can resume downloads (for example wget -c).
Suppose you are building a piece of software that at certain points displays information that came from Wikipedia. If you want your program to display the information in a different way than can be seen in the live version, you'll probably need the wikicode that is used to enter it, instead of the finished HTML.
Also, if you want to get all the data, you'll probably want to transfer it in the most efficient way that's possible. The wikipedia.org servers need to do quite a bit of work to convert the wikicode into HTML. That's time consuming both for you and for the wikipedia.org servers, so simply spidering all pages is not the way to go.
To access any article in XML, one at a time, accessSpecial:Export/Title of the article.
Read more about this atSpecial:Export.
Please be aware that live mirrors of Wikipedia that are dynamically loaded from the Wikimedia servers are prohibited. Please seeWikipedia:Mirrors and forks.
Please do not use aweb crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia.
You can do SQL queries on the current database dump usingQuarry (as a replacement for the disabledSpecial:Asksql page).
See also:mw:Manual:Database layout
The sql file used to initialize a MediaWiki database can be foundhere.
The XML schema for each dump is defined at the top of the file and described in theMediaWiki export help page.
You can do Hadoop MapReduce queries on the current database dump, but you will need an extension to the InputRecordFormat tohave each <page> </page> be a single mapper input. A working set of java methods (jobControl, mapper, reducer, and XmlInputRecordFormat) is available atHadoop on the Wikipedia
See:
Access to recent article update dumps (Snapshot API) or individual article retrieval (On-demand API) are available viaWikimedia Enterprise with a free account (documentation). Alternatively, use your developer account to access the APIs within Wikimedia Cloud Services.
MediaWiki 1.5 includes routines to dump a wiki to HTML, rendering the HTML with the same parser used on a live wiki. As the following page states, putting one of these dumps on the web unmodified will constitute a trademark violation. They are intended for private viewing in an intranet or desktop installation.
See also:

Kiwix is by far the largest offline distribution ofWikipedia to date. As an offline reader, Kiwix works with a library of contents that are zim files: you can pick & choose whicheverWikimedia project (Wikipedia in any language,Wiktionary,Wikisource, etc.), as well asTED Talks,PhET Interactive Maths & Physics simulations,Project Gutenberg, etc.
It is free and open source, and currently available for download on:
... as well as extensions forChrome &Firefox browsers, server solutions, etc. Seeofficial Website for the complete Kiwix portfolio.
Aard Dictionary is an offline Wikipedia reader. No images. Cross-platform for Windows, Mac, Linux, Android, Maemo. Runs on rooted Nook and Sony PRS-T1 eBooks readers.
It also has a successorAard 2.
The wikiviewer plugin for rockbox permits viewing converted Wikipedia dumps on manyRockbox devices.It needs a custom build and conversion of the wiki dumps using the instructions available athttp://www.rockbox.org/tracker/4755 . The conversion recompresses the file and splits it into 1GB files and an index file which all need to be in the same folder on the device or micro sd card.
Instead of converting a database dump file to many pieces of static HTML, one can also use a dynamic HTML generator. Browsing a wiki page is just like browsing a Wiki site, but the content is fetched and converted from a local dump file on request from the browser.
XOWA is a free, open-source application that helps download Wikipedia to a computer. Access all of Wikipedia offline, without an internet connection!It is currently in the beta stage of development, but is functional. It is available for downloadhere.
WikiFilter is a program which allows you to browse over 100 dump files without visiting a Wiki site.
WikiTaxi is an offline-reader for wikis in MediaWiki format. It enables users to search and browse popular wikis like Wikipedia, Wikiquote, or WikiNews, without being connected to the Internet. WikiTaxi works well with different languages like English, German, Turkish, and others but has a problem with right-to-left language scripts. WikiTaxi does not display images.
For WikiTaxi reading, only two files are required: WikiTaxi.exe and the .taxi database. Copy them to any storage device (memory stick or memory card) or burn them to a CD or DVD and take your Wikipedia with you wherever you go!
BzReader is an offline Wikipedia reader with fast search capabilities. It renders the Wiki text into HTML and doesn't need to decompress the database. Requires Microsoft .NET framework 2.0.
MzReader byMun206 works with (though is not affiliated with) BzReader, and allows further rendering of wikicode into better HTML, including an interpretation of the monobook skin. It aims to make pages more readable. Requires Microsoft Visual Basic 6.0 Runtime, which is not supplied with the download. Also requires Inet Control and Internet Controls (Internet Explorer 6 ActiveX), which are packaged with the download.
Offline Wikipedia database in EPWING dictionary format, which is common and an out-datedJapanese Industrial Standards (JIS) in Japan, can be read including thumbnail images and tables with some rendering limits, on any systems where a reader is available (Boookends). There are many free and commercial readers for Windows (including Mobile), Mac OS X, iOS (iPhone, iPad), Android, Unix-Linux-BSD, DOS, and Java-based browser applications (EPWING Viewers).
WP-MIRROR is a free utility for mirroring any desired set of WMF wikis. That is, it builds a wiki farm that the user can browse locally. WP-MIRROR builds a complete mirror with original size media files. WP-MIRROR is available fordownload.