Movatterモバイル変換

راهنما: واردات ورودی XML

From mediawiki.org

This page is atranslated version of the pageManual:Importing XML dumps and the translation is 15% complete.

Languages:

این صفحه روش های "استورد XML dumps" را توصیف می کند.XML Dumps حاوی محتوای ویکی (صفحات ویکی با تمام اصلاحات آنها) بدون داده های مرتبط با سایت است.یک XML dump یک پشتیبان کامل از پایگاه داده ویکی ایجاد نمی کند، dump حاوی حساب های کاربر، تصاویر، سوابق ویرایش و غیره نیست.

صفحهSpecial:Export هر سایت MediaWiki، از جمله هر سایت ویکیمیڈیا و ویکی پدیا، یک فایل XML (دمپ محتوا) ایجاد می کند.مشاهدهm:Special:MyLanguage/Data dumps در متا:دمپ داده وManual:DumpBackup.php.توضیحات بیشتر درSpecial:MyLanguage/Help:Export#Export format XML نمیتا:مدد: صادرات ارائه شده است.

نحوه وارد کردن

روش های مختلفی برای واردات این مخزن XML وجود دارد.

استفاده ویژه: واردات

Special:Import می تواند توسط کاربران ویکی با اجازهimport استفاده شود (به طور پیش فرض این کاربران در گروهsysop است) برای وارد کردن تعداد کمی از صفحات (حدود 100 باید امن باشد).

"حاول واردات مخزن های بزرگ به این ترتیب ممکن است منجر به توقف زمان یا شکست اتصال شود".

SeeHelp:Import for a basic description of how the importing process works.^[1]

You are asked to give an interwiki prefix.For instance, if you exported from the English Wikipedia, you have to type 'en'.

XML importing requires theimport andimportupload permissions.SeeManual:User rights.

Large XML uploads

Large XML uploads might be rejected because they exceed the PHP upload limit set in thephp.ini file.

; Maximum allowed size for uploaded files.upload_max_filesize=20M

Try changing these four settings inphp.ini:

; Maximum size of POST data that PHP will accept.post_max_size=20Mmax_execution_time=1000; Maximum execution time of each script, in secondsmax_input_time=2000; Maximum amount of time each script may spend parsing request data; Default timeout for socket based streams (seconds)default_socket_timeout=2000

Using importDump.php, if you have shell access

Recommended method for general use, but slow for very big data sets.

See:Manual:importDump.php, including tips on how to use it for large wikis.

Using importTextFiles.php maintenance Script

نسخهٔ مدیاویکی:

≤ 1.23

نسخهٔ مدیاویکی:

≥ 1.27

If you have a lot of content converted from another source (several word processor files, content from another wiki, etc), you may have several files that you would like to import into your wiki.In MediaWiki 1.27 and later, you can use theimportTextFiles.php maintenance script.

You can also use theedit.php maintenance script for this purpose.

rebuildall.php

For large XML dumps, you can runrebuildall.php, but, it will take along time, because it has to parse all pages.This is not recommended for large data sets.

Using pywikibot, pagefromfile.py and Nokogiri

pywikibot is a collection of tools written in python that automate work on Wikipedia or other MediaWiki sites.Once installed on your computer, you can use the specific toolpagefromfile.py which lets you upload a wiki file on Wikipedia or MediaWiki sites.The xml file created by dumpBackup.php can be transformed into a wiki file suitable to be processed by 'pagefromfile.py' using a simple script similar to the following (here the program will transform all xml files which are on the current directory which is needed if your MediaWiki site is a family):

On Python

frompathlibimportPathfromlxmlimportetreeimportrefrompywikibotimportLink,Site""" The exported XML preserves local namespace names. To import into a project in another language, you need to convert them to canonical names. Otherwise the pages will be imported as originals without recognizing their namespace.XML contains namespace numbers, but mapping them can be problematic. Because they can differ between projects. For example, in ru.wikisource NS Page is number 104, while in uk.wikisource it is 252.To solve these, the Link() class from pywikibot is used.The Link() required the language code and family of project. XML does not contain these as separate values. Instead, they are in another format, like `enwikisource` or 'commonswiki', which needs to be parsed."""xml_folder='PATH_TO_XML_DUMP'forxml_fileinPath(xml_folder).glob('*.xml'):root=etree.parse(xml_file).getroot()ns={'':root.nsmap[None]}# xml namespace# Parsing language code and family of the projectdbname=root.find('.//siteinfo/dbname',ns).textm=re.search(r'^(\w+?)(wiki|wikisource|wikiquote|wikinews|wikibooks)$',dbname)ifm:site=Site(code=m.group(1),fam=m.group(2))elif'commons'indbname:site=Site('commons')elif'mediawikiwiki'indbname:site=Site('mediawiki')else:print("Site didn't recognized.")breakwiki=[]forpageinroot.findall('.//page',ns):title_source=page.find('title',ns).texttitle=Link(title_source,site).ns_title().replace(' ','_')revision_id=page.find('.//revision/id',ns).textprint(f'{title}, revision id:{revision_id}')text=page.find('.//revision/text',ns).textwiki.append("{{-start-}}\n'''%s'''\n%s\n{{-stop-}}"%(title,text))Path(f'out_{xml_file.stem}.wiki').write_text('\n'.join(wiki))

On Ruby

# -*- coding: utf-8 -*-# dumpxml2wiki.rbrequire'rubygems'require'nokogiri'# This program dumpxml2wiki reads MediaWiki xml files dumped by dumpBackup.php on the current directory and transforms them into wiki files which can then be modified and uploaded again by pywikipediabot using pagefromfile.py on a MediaWiki family site.# The text of each page is searched with xpath and its title is added on the first line as an html comment: this is required by pagefromfile.py.#Dir.glob("*.xml").eachdo|filename|input=Nokogiri::XML(File.new(filename),nil,'UTF-8')putsfilename.to_s# prints the name of each .xml fileFile.open("out_"+filename+".wiki",'w'){|f|input.xpath("//xmlns:text").each{|n|pptitle=n.parent.parent.at_css"title"# searching for the titletitle=pptitle.contentf.puts"\n{{-start-}}<!--'''"<<title.to_s<<"'''-->"<<n.content<<"\n{{-stop-}}"}}end

For example, here is an excerpt of a wiki file output by the command 'ruby dumpxml2wiki.rb' (two pages can then be uploaded by pagefromfile.py, a Template and a second page which is a redirect):

{{-start-}}<!--'''Template:Lang_translation_-pl'''--><includeonly>Tłumaczenie</includeonly>{{-stop-}}{{-start-}}#REDIRECT[[badania demograficzne]]<!--'''ilościowa demografia'''--><noinclude>[[Category:Termin wielojęzycznego słownika demograficznego (pierwsze wydanie)|ilościowa demografia]][[Category:Termin wielojęzycznego słownika demograficznego (pierwsze wydanie) (redirect)]][[Category:10]]</noinclude>{{-stop-}}

The program accesses each xml file, extracts the texts within <text> </text> markups of each page, searches the corresponding title as a parent and enclosed it with the paired {{-start-}} {{-stop-}} commands used by 'pagefromfile' to create or update a page.The name of the page is in an html comment and separated by three quotes on the same first start line.Please notice that the name of the page can be written in Unicode.Sometimes it is important that the page starts directly with the command, like for a #REDIRECT; thus the comment giving the name of the page must beafter the command but still on the first line.

Please remark that the xml dump files produced by dumpBackup.php are prefixed by a namespace:

<mediawikixmlns="http://www.mediawiki.org/xml/export-0.4/">

In order to access the text node using Nokogiri, you need to prefix your path with 'xmlns':

input.xpath("//xmlns:text")

Nokogiri is an HTML, XML, SAX, & Reader parser with the ability to search documents via XPath or CSS3 selectors from the last generation of XML parsers using Ruby.

Example of the use of 'pagefromfile' to upload the output wiki text file:

pythonpagefromfile.py-file:out_filename.wiki-summary:"Reason for changes"-lang:pl-putthrottle:01

How to import logs?

Exporting and importing logs with the standard MediaWiki scripts often proves very hard; an alternative for import is the scriptpages_logging.py in theWikiDAT tool, assuggested by Felipe Ortega.

Troubleshooting

Merging histories, revision conflict, edit summaries, and other complications

When importing a page with history information, if the user name already exists in the target project, change the user name in the XML file to avoid confusion.If the user name is new to the target project, the contributions will still be available, but no account will be created automatically.

When a page is referenced through a link or a URL, generic namespace names are converted automatically.If the prefix is not a recognized namespace name, the page will default to the main namespace.However, prefixes like "mw:" might be ignored on projects that use them for interwiki linking.In such cases, it might be beneficial to change the prefix to "Project:" in the XML file before importing.

If a page with the same name already exists, importing revisions will combine their histories.Adding a revision between two existing ones can make the next user's changes appear different.To see the actual change, compare the two original revisions, not the inserted one.So, only insert revisions to properly rebuild the page history.

A revision won't be imported if there's already a revision with the same date and time.This usually happens if it's been imported before, either to the current, a previous wiki, or both, possibly from a third site.

An edit summary might reference or link to another page, which can be confusing if the page being linked to has not been imported, even though the referencing page has been.

The edit summary doesn't automatically indicate whether the page has been imported, but in the case of an upload import, you can add this information to the edit summaries in the XML file before the import.This can help prevent confusion. When modifying the XML file with find/replace, keep in mind that adding text to the edit summaries requires differentiating between edits with and without an edit summary.If there are multiple comment tags in the XML file, only the last set will be applied.

Interwikis

If you get the messageinterwiki the problem is that some pages to be imported have a prefix that is used forinterwiki linking.For example, ones with a prefix of 'Meta:' would conflict with the interwiki prefixmeta: which by default links to https://meta.wikimedia.org.

You can do any of the following.

Remove the prefix from theinterwiki table.This will preserve page titles, but prevent interwiki linking through that prefix.
Example: you will preserve page titles 'Meta:Blah blah' but will not be able to use the prefix 'meta:' to link to meta.wikimedia.org (although it will be possible through a different prefix).
How to do it: before importing the dump, run the queryDELETE FROM interwiki WHERE iw_prefix='prefix' (note: do not include the colon in theprefix).Alternatively, if you haveenabled editing the interwiki table, you can simply go toSpecial:Interwiki and click the 'Delete' link on the right side of the row belonging to that prefix.
Replace the unwanted prefix in the XML file with "Project:" before importing. This will preserve the functionality of the prefix as an interlink, but will replace the prefix in the page titles with the name of the wiki where they're imported into, and might be quite a pain to do on large dumps.
Example: replace all 'Meta:' with 'Project:' in the XML file. MediaWiki will then replace 'Project:' with the name of your wiki during importing.