Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A tool for extracting plain text from Wikipedia dumps

License

NotificationsYou must be signed in to change notification settings

attardi/wikiextractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WikiExtractor.py is a Python script that extracts and cleans text from aWikipedia database backup dump, e.g.https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 for English.

The tool is written in Python and requires Python 3 but no additional library.Warning: problems have been reported on Windows due to poor support forStringIO in the Python implementation on Windows.

For further information, see theWiki.

Wikipedia Cirrus Extractor

cirrus-extractor.py is a version of the script that performs extraction from a Wikipedia Cirrus dump.Cirrus dumps contain text with already expanded templates.

Cirrus dumps are available at:cirrussearch.

Details

WikiExtractor performs template expansion by preprocessing the whole dump and extracting template definitions.

In order to speed up processing:

  • multiprocessing is used for dealing with articles in parallel
  • a cache is kept of parsed templates (only useful for repeated extractions).

Installation

The script may be invoked directly:

python -m wikiextractor.WikiExtractor <Wikipedia dump file>

It can also be installed fromPyPi by doing:

pip install wikiextractor

or locally with:

(sudo) python setup.py install

The installer also installs two scripts for direct invocation:

wikiextractor  (equivalent to python -m wikiextractor.WikiExtractor)extractPage(to extract a single page from a dump)

Usage

Wikiextractor

The script is invoked with a Wikipedia dump file as an argument:

python -m wikiextractor.WikiExtractor <Wikipedia dump file> [--templates <extracted template file>]

The option--templates extracts the templates to a local file, which can be reloaded to reduce the time to perform extraction.

The output is stored in several files of similar size in a given directory.Each file will contains several documents in thisdocument format.

usage: wikiextractor [-h] [-o OUTPUT] [-b n[KMG]] [-c] [--json] [--html] [-l] [-ns ns1,ns2] [--templates TEMPLATES] [--no-templates] [--html-safe HTML_SAFE] [--processes PROCESSES] [-q] [--debug] [-a] [-v] inputWikipedia Extractor:Extracts and cleans text from a Wikipedia database dump and stores output in anumber of files of similar size in a given directory.Each file will contain several documents in the format:<doc url="" title="">    ...    </doc>If the program is invoked with the --json flag, then each file will                                            contain several documents formatted as json ojects, one per line, with                                         the following structure{"id": "", "revid": "", "url": "", "title": "", "text": "..."}The program performs template expansion by preprocesssng the whole dump andcollecting template definitions.positional arguments:  input                 XML wiki dump fileoptional arguments:  -h, --help            show this help message and exit  --processes PROCESSES    Number of processes to use (default 79)Output:  -o OUTPUT, --output OUTPUT    directory for extracted files (or '-' for dumping to stdout)  -b n[KMG], --bytes n[KMG]    maximum bytes per output file (default 1M)  -c, --compress        compress output files using bzip  --json                write output in json format instead of the default <doc> formatProcessing:  --html                produce HTML output, subsumes --links  -l, --links           preserve links  -ns ns1,ns2, --namespaces ns1,ns2    accepted namespaces  --templates TEMPLATES    use or create file containing templates  --no-templates        Do not expand templates  --html-safe HTML_SAFE    use to produce HTML safe output within <doc>...</doc>Special:  -q, --quiet           suppress reporting progress info  --debug               print debug info  -a, --article         analyze a file containing a single article (debug option)  -v, --version         print program version

Saving templates to a file will speed up performing extraction the next time,assuming template definitions have not changed.

Option--no-templates significantly speeds up the extractor, avoiding the costof expandingMediaWiki templates.

For further information, visitthe documentation.

Cirrus Extractor

usage: cirrus-extract.py [-h] [-o OUTPUT] [-b n[KMG]] [-c] [-ns ns1,ns2] [-q]                         [-v]                         inputWikipedia Cirrus Extractor:Extracts and cleans text from a Wikipedia Cirrus dump and stores output in anumber of files of similar size in a given directory.Each file will contain several documents in the format:<doc url="" title="" language="" revision="">        ...        </doc>positional arguments:  input                 Cirrus Json wiki dump fileoptional arguments:  -h, --help            show this help message and exitOutput:  -o OUTPUT, --output OUTPUT                        directory for extracted files (or '-' for dumping to                        stdin)  -b n[KMG], --bytes n[KMG]                        maximum bytes per output file (default 1M)  -c, --compress        compress output files using bzipProcessing:  -ns ns1,ns2, --namespaces ns1,ns2                        accepted namespacesSpecial:  -q, --quiet           suppress reporting progress info  -v, --version         print program version

extractPage

Extract a single page from a Wikipedia dump file.

usage: extractPage [-h] [--id ID] [--template] [-v] inputWikipedia Page Extractor:Extracts a single page from a Wikipedia dump file.positional arguments:  input          XML wiki dump fileoptional arguments:  -h, --help     show this help message and exit  --id ID        article number  --template     template number  -v, --version  print program version

License

The code is made available under theGNU Affero General Public License v3.0.

Reference

If you find this code useful, please refer it in publications as:

@misc{Wikiextractor2015,  author = {Giusepppe Attardi},  title = {WikiExtractor},  year = {2015},  publisher = {GitHub},  journal = {GitHub repository},  howpublished = {\url{https://github.com/attardi/wikiextractor}}}

About

A tool for extracting plain text from Wikipedia dumps

Resources

License

Stars

Watchers

Forks

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp