This help page is ahow-to guide. It explains concepts or processes used by the Wikipedia community. It is not one ofWikipedia's policies or guidelines, and may reflect varying levels ofconsensus. |
| Linking and page manipulation |
|---|
Wiki pages can be exported in a specialXML format toimport into another MediaWiki installation or use it elsewise for instance for analysing the content. See alsom:Syndication feeds for exporting all other information except pages, and seeHelp:Import on importing pages.
There are at least six ways to export pages:
action=raw. (This fetches just the page's wikitext and not the XML format described below.) For example:https://en.wikipedia.org/w/index.php?title=Wikipedia&action=raw .. it's important to use/w/index.php?title=PAGENAME&action=raw and not/wiki/PAGENAME?action=raw (seePhab T126183)By default only the current version of a page is included. Optionally you can get all versions with date, time, user name and edit summary.
Additionally you can copy the SQL database. This is how dumps of the database were made available before MediaWiki 1.5 and it won't be explained here further.
To exportall pages of a namespace, for example.
and finally...
Now you can use this XML file toperform an import.
A checkbox in theSpecial:Export interface selects whether to export the full history (all versions of an article) or the most recent version of articles. A maximum of 1000 revisions are returned; other revisions can be requested as detailed inMW:Parameters to Special:Export.
The format of the XML file you receive is the same in all ways. This format is codified inXML Schema athttp://www.mediawiki.org/xml/export-0.6.xsd. This format is not intended for viewing in a web browser, though some browsers show you pretty-printed XML with "+" and "-" links to view or hide selected parts. Alternatively the XML-source can be viewed using the "view source" feature of the browser, or after saving the XML file locally, with a program of choice. If you directly read the XML source it won't be difficult to find the actual wikitext. If you don't use a special XML editor "<" and ">" appear as < and >, to avoid a conflict with XML tags; to avoid ambiguity, "&" is coded as "&".
In the current version the export format does not contain an XML replacement of wiki markup (seeWikipedia DTD for an older proposal, orWiki Markup Language). You only get the wikitext as you get when editing the article. (After export you can usealternative parsers to convert wikitext to other format)
<mediawikixml:lang="en"><page><title>Pagetitle</title><!-- page namespace code --><ns>0</ns><id>2</id><!-- If page is a redirection, element "redirect" contains title of the page redirect to --><redirecttitle="Redirect page title"/><restrictions>edit=sysop:move=sysop</restrictions><revision><timestamp>2001-01-15T13:15:00Z</timestamp><contributor><username>Foobar</username><id>65536</id></contributor><comment>Ihavejustonethingtosay!</comment><text>Abunchof[[text]]here.</text><minor/></revision><revision><timestamp>2001-01-15T13:10:27Z</timestamp><contributor><ip>10.0.0.2</ip></contributor><comment>new!</comment><text>Anearlier[[revision]].</text></revision><revision><!-- deleted revision example --><id>4557485</id><parentid>1243372</parentid><timestamp>2010-06-24T02:40:22Z</timestamp><contributordeleted="deleted"/><model>wikitext</model><format>text/x-wiki</format><textdeleted="deleted"/><sha1/></revision></page><page><title>Talk:Pagetitle</title><revision><timestamp>2001-01-15T14:03:00Z</timestamp><contributor><ip>10.0.0.2</ip></contributor><comment>hey</comment><text>WHYDYOULOCKPAGE??!!!iwaseditingthatjerk</text></revision></page></mediawiki>
Here is an unofficial, shortDocument Type Definition version of the format. If you don't know what a DTD is just ignore it.
<!ELEMENTmediawiki(siteinfo?,page*)><!-- version contains the version number of the format (currently 0.3) --><!ATTLISTmediawikiversionCDATA#REQUIREDxmlnsCDATA#FIXED"http://www.mediawiki.org/xml/export-0.3/"xmlns:xsiCDATA#FIXED"http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocationCDATA#FIXED"http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd"><!ELEMENTsiteinfo(sitename,base,generator,case,namespaces)><!ELEMENTsitename(#PCDATA)><!-- name of the wiki --><!ELEMENTbase(#PCDATA)><!-- url of the main page --><!ELEMENTgenerator(#PCDATA)><!-- MediaWiki version string --><!ELEMENTcase(#PCDATA)><!-- how cases in page names are handled --><!-- possible values: 'first-letter' | 'case-sensitive' 'case-insensitive' option is reserved for future --><!ELEMENTnamespaces(namespace+)><!-- list of namespaces and prefixes --><!ELEMENTnamespace(#PCDATA)><!-- contains namespace prefix --><!ATTLISTnamespacekeyCDATA#REQUIRED><!-- internal namespace number --><!ELEMENTpage(title,id?,restrictions?,(revision|upload)*)><!ELEMENTtitle(#PCDATA)><!-- Title with namespace prefix --><!ELEMENTid(#PCDATA)><!ELEMENTrestrictions(#PCDATA)><!-- optional page restrictions --><!ELEMENTrevision(id?,timestamp,contributor,minor?,comment,text)><!ELEMENTtimestamp(#PCDATA)><!-- according to ISO8601 --><!ELEMENTminorEMPTY><!-- minor flag --><!ELEMENTcomment(#PCDATA)><!ELEMENTtext(#PCDATA)><!-- Wikisyntax --><!ATTLISTtextxml:spaceCDATA#FIXED"preserve"><!ELEMENTcontributor((username,id)|ip)><!ELEMENTusername(#PCDATA)><!ELEMENTip(#PCDATA)><!ELEMENTupload(timestamp,contributor,comment?,filename,src,size)><!ELEMENTfilename(#PCDATA)><!ELEMENTsrc(#PCDATA)><!ELEMENTsize(#PCDATA)>
Many tools can process the exported XML. If you process a large number of pages (for instance a whole dump) you probably won't be able to get the document in main memory so you will need a parser based onSAX or other event-driven methods.
You can also use regular expressions to directly process parts of the XML code. These run fast but are difficult to maintain.
Please list methods and tools for processing XML export here:
/mediawiki/siteinfo/namespaces/namespace