Movatterモバイル変換


[0]ホーム

URL:


Jump to content
Wikimedia Meta-Wiki
Search

Research:Data

From Meta, a Wikimedia project coordination wiki
 
Other languages:
English ·português

There is a great deal ofpublicly-available, open-licensed data about Wikimedia projects. This page is intended to help community members, developers, and researchers who are interested in analyzing raw data learn what data and infrastructure is available.

If you have any questions, you might find the answer in theFrequently Asked Questions about Data. If you still have questions, you can email your question to theAnalytics mailing list (more information). You can also find a guide about basic concepts for researchers working with Wikimedia Data in the data introduction page.

If you wish to browse pre-computed metrics and dashboards, seestatistics.

If this publicly available data isn't sufficient, you can look at the page onprivate data access to see what non-public data exists and how you can gain access.

See alsoinspirational example uses.

Also consider searching for datasets atZenodo,Figshare,Dimensions.ai,Google Dataset Search,Academic Torrents,DataHub (historical) orHugging Face (see also a curated "Wikimedia Datasets" list on Huggingface).

Quick glance

By access method

Data Dumps(details)

HomepageDownload

Dumps of allWMF projects for backup, offline use, research, etc.

  • Wiki content, revisions, metadata, and page-to-page and outside links
  • XML and SQL format
  • once/twice a month
  • large file sizes
  • The dumps.wikimedia.org domain also hostsother data, includingMediaWiki history dumps, a historical record of revision (without text), user and page events.
APIs(details)
  • TheMediaWiki API provides direct, high-level access to the data contained in MediaWiki databases over the web.
    • Meta info about the wiki and logged-in user, properties of pages (revisions, content, etc.) and lists of pages based on criteria
    • JSON, XML, and PHP's native serialization format
Wiki Replicas(details)

Data Services allowsWikimedia Cloud Services users to query a sanitized copy of the Wikimedia MediaWiki databases.

  • Toolforge andCloud VPS hosting environments include access to the Wiki Replicas.
  • PAWS is aJupyter Notebook environment that allows e.g. querying the Wiki Replicas and APIs for analysis.
  • Quarry andSuperset are a public web interfaces for SQL queries to the Wiki Replicas.
Recent changes stream(details)

Homepage

Wikimedia broadcasts every change to every Wikimedia wiki using Server Sent Events over HTTP.

Analytics Dumps(details)

Homepage

Rawpageviews,unique device estimates,mediacounts, etc.

WikiStats(details)

Homepage

Reports based on data dumps and server log files.

  • Unique visits, page views, active editors and more
  • Intermediate CSV files available
  • Graphical presentation
DBpedia(details)

DBpedia extracts structured data from Wikipedia. It allows users to run complex queries and link Wikipedia data to other data sets.

  • RDF, N-triplets, SPARQL endpoint, Linked Data
  • Billions of triplets of info in a consistent ontology
DataHub and Figshare(details)

DataHub Homepage

A collection of various Wikimedia-related datasets.

Differential privacy(details)

Differential privacy homepage

A collection of differentially-private datasets, released daily, weekly, or monthly.

  • pageview data
  • editor/edit data
  • centralnotice data
  • search data
Training data for AI/ML models(details)

Machine learning models homepage

A collection of AI/ML models in production that are used to power user-facing tools and features across Wikimedia projects. All such models have a corresponding model card which includes a variety of information about the model including information about the models’ training datasets.

By data domain

The table below is a quick reference of data sources organized by data domain. For a more detailed overview of Wikimedia data domains and how to access data in each domain, use the links in the table or seeResearch:Data introduction.

Data domainData sourceAccess method
ContentMediaWiki REST APIAPI
ContentMediaWiki Action API:Parse (HTML)API
ContentMediaWiki Action API:Revisions (wikitext)API
ContentWikidata:REST_APIAPI
ContentWikimedia Enterprise APIs (require separate accounts, free access may have limits)API
Content – structured dataWikidata:REST_APIAPI
Content – structured dataWikidata SPARQL query serviceAPI
Content – structured dataCommons SPARQL query serviceAPI
Content – structured dataDBpedia SPARQL endpointAPI
Contributions / editsMediaWiki Action API: RevisionsAPI
Contributions / editsMediaWiki Action API: AllrevisionsAPI
Contributions / editsWikimedia Analytics API: Edits dataAPI
Contributions / editsMediaWiki Event StreamsAPI
Contributions / editsWikimedia Enterprise APIs (require separate accounts, free access may have limits)API
Contributors / editorsWikimedia Analytics API: Editors by countryAPI
Contributors / editorsMediaWiki Action API: UsersAPI
Contributors / editorsMediaWiki Action API: UsercontribsAPI
TrafficWikimedia Analytics API: PageviewsAPI
TrafficWikimedia Analytics API: Unique devicesAPI
TrafficWikimedia Analytics API: MediarequestsAPI
Contributions / editsWikistatsDashboard
Contributions / editsXToolsDashboard
Contributions / editsBitergia: technical community metricsDashboard
Contributors / editorsWikistatsDashboard
Contributors / editorsXToolsDashboard
Contributors / editorsBitergia: technical community metricsDashboard
TrafficDevicesDashboard
TrafficWikistatsDashboard
TrafficReaders:Pageviews and Unique DevicesDashboard
TrafficPageviews ToolDashboard
TrafficWikiNavDashboard
ContentWikitextDownload
ContentStatic HTML and Enterprise HTML (use mwparserfromhtml)Download
ContentKnowledge gapsDownload
Content – structured dataCommons image depictsDownload
Content – structured dataWikidata dumps (JSON, RDF, XML)Download
Content – structured dataDBpedia.orgDownload
Contributions / editsMediawiki_historyDownload
Contributions / editsgeoeditorsDownload
Contributions / editsDifferential privacy: GeoeditorsDownload
TrafficClickstreamDownload
TrafficPageview hourlyDownload
TrafficUnique devicesDownload
TrafficMediacountsDownload
TrafficDifferential privacy pageviewsDownload
ContentTextMediaWiki database tables
Contributions / editsRevision_tableMediaWiki database tables
Contributors / editorsMediawiki_historyMediaWiki database tables
Contributors / editorsgeoeditorsMediaWiki database tables
Contributors / editorsDifferential privacy: GeoeditorsMediaWiki database tables
Contributors / editorsactorMediaWiki database tables
Contributors / editorsuserMediaWiki database tables
Contributors / editorsuser_groupsMediaWiki database tables
Contributors / editorsuser_former_groupsMediaWiki database tables
Contributors / editorsuser_propertiesMediaWiki database tables
Contributors / editorsglobaluserMediaWiki database tables
Contributors / editorsuser_groupsMediaWiki database tables

Data dumps

WMF releasesdata dumps of Wikipedia, Wikidata, and allWMF projects on a regular basis, as well as dumps of other Wikimedia-related data such as search indices and short URL mappings.

Content

XML/SQL dumps

  • Text of current and/or all revisions of all pages, in XML format(schema)
  • Metadata for current and/or all revisions of all pages, in XML format(schema)
  • Most database tables as SQL files
    • Page-to-page link lists (pagelinks,categorylinks,imagelinks,templatelinks tables)
    • Lists of pages with links outside of the project (externallinks,iwlinks,langlinks tables)
    • Media metadata (image,oldimage tables)
    • Info about each page (page,page_props,page_restrictions tables)
    • Titles of all pages in the main namespace, i.e. all articles (*-all-titles-in-ns0.gz)
    • List of all pages that are redirects and their targets (redirect table)
    • Log data, including blocks, protection, deletion, uploads (logging table)
    • Misc bits (interwiki,site_stats,user_groups tables)
  • Stub-prefixed dumps for some projects which only have header info for pages and revisions without actual content

See a more comprehensive list of what is available for download.

Other dumps

Dumps.wikimedia.org offersvarious other database dumps and datasets, including

Download

You candownload the latest dumps for the last year (dumps.wikimedia.org/enwiki/ for English Wikipedia,dumps.wikimedia.org/dewiki/ for German Wikipedia, etc).Download mirrors offer an alternative to the download page.

Due to large file sizes, using adownload tool is recommended.

There are alsoarchives. Many older dumps can also be found at theInternet Archive.

Data format

XML dumps are in the wrapper format described atExport format (schema). Files are compressed in gzip (.gz), bzip2/lbzip2 (.bz2) and .7z formats.

SQL dumps are provided as dumps of entire tables, using mysqldump.

Some older dumps exist invarious formats.

How to and examples

Seeexamples of importing dumps in a MySQL database with step-by-step instructions.

Existing tools

Some tools are listed on the following pages, but these tools are mostly outdated and non-functional:

License

All text content is multi-licensed under theCreative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and theGNU Free Documentation License (GFDL). Images and other files are available underdifferent terms, as detailed on their description pages.

Support


MediaWiki API

The MediaWiki API provides direct, high-level access to the data contained in MediaWiki databases. Client programs can log in to a wiki, get data, and post changes automatically by making HTTP requests.

Content

Endpoint

To query the database you send a HTTP GET request to the desiredendpoint (examplehttps://en.wikipedia.org/w/api.php for English Wikipedia) setting the action parameter toquery and defining thequery details the URL.

How to and examples

Existing tools

To try out the API interactively on English Wikipedia, use theAPI Sandbox.

Access

To use the API, your application or client might need tolog in.

Before you start, learn about theAPI etiquette.

Researchers could be givenSpecial access rights on case-to-case bases.

License

All text content is multi-licensed under theCreative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and theGNU Free Documentation License (GFDL).

Support

Wiki Replicas

TheWiki Replicas (part of WMCSwikitech:Portal:Data Services) host sanitized versions of Wikimedia production MediaWiki databases.

Content

Users of variousWikimedia Cloud Services products can access the wiki Wiki Replicas databases that host sanitized copies of the databases of all Wikimedia projects including Commons.

Data format

Explore thedatabase schema of the MediaWiki software.

How to

See theWiki Replicas page on Wikitech on how to access the Wiki Replicas.

Support

Seewikitech:Help:Cloud Services introduction#Communication and support

Recent changes stream

SeeEventStreams to subscribe toRecent changes on all Wikimedia wikis.This broadcasts edits and other changes as they happen.

Existing tools

Seewikitech:Event Platform/EventStreams/Powered By

Analytics Datasets

Analytics Datasets on dumps.wikimedia.org offers stable and continuous datasets about web request statistics (including page views, mediacounts, unique devices), page revision history, data by country, and Wikidata QRanks.

Pageview statistics

Pageview statistics are one example. Each request of a page reaches one of Wikimedia'sVarnish caching hosts. The project name and the title of the page requested are logged and aggregated hourly.

Files starting with "project" contain total hits per project per hour statistics.

Per-country pageviews data is also available, sanitized for privacy reasons. Seethis announcement post (June 2023).

See theREADME for details on the format.

You can interactively browse the page view statistics athttps://pageviews.toolforge.org. Moredocumentation on the Pageviews Analysis tool is available.

Clickstream data

TheWikipedia clickstream dataset contains counts of(referrer, resource)pairs extracted from the request logs of Wikipedia.

Geoeditors

Thepublic "Geoeditors" dataset contains information about the monthly number of active editors from a particular country on a particular Wikipedia language edition (bucketed and redacted for privacy reasons). For some earlier years, similar data is available at[1]/[2], see alsoEdits by project and country of origin.

Misc datasets

Additional datasets (mostly irregular or discontinued ones) are published athttps://analytics.wikimedia.org/datasets/. These includeCaching research data, andAS Performance Report.

WikiStats

Wikistats is an informal but widely recognized name for a set of reports which provide monthly trend information for all Wikimedia projects and wikis.

Content

Many dashboards that display trends about reading, contributing, and content broken down by different projects such as:

  • unique visitors
  • page views (overall and mobile only)
  • editor activity
  • article count

Data format

Data is presented as charts with the option to download the underlying data.

Support

For more details on Wikistats, seewikitech:Data Platform/Systems/Wikistats 2.

DBpedia

DBpedia.org is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.

Content

The English version of the DBpedia knowledge base describes millions of things, and the majority of items are classified in a consistent ontology (persons, places, creative works like music albums, films and video games, organizations like companies and educational institutions, species, diseases, etc.). Localized versions of DBpedia in more than hundred languages describe millions of things.

The data set also features:

  • about 2 billion pieces of information (RDF triples)
  • labels and abstracts for >10 million unique things in up to 111 different languages
  • millions of links to images, links to external web pages, data links into external RDF datasets, links to Wikipedia categories, YAGO categories
  • https://www.dbpedia.org/resources/ has download links for all the data sets, different formats and languages.

Data format

  • RDF/XML
  • Turtle
  • N-Triplets
  • SPARQL endpoint

Access

License

Support

DataHub

TheWikimedia organization on theOpen Knowledge Foundation's DataHub was established by the Wikimedia Foundation around 2013, and contains a collection of datasets about Wikipedia and other projects which mostly date from around 2013-2016.

Wikivoyage also maintains data onits own DataHub:

  • Hotels/restaurants/attractions data as CSV/OSM/OBF
  • Tourism guide for offline use

Differential privacy

TheWMF privacy engineering team usesdifferential privacy to release data that would otherwise be too sensitive to release. This data currently only includes pageview statistics; in the future, it will include statistics about editors, centralnotice impressions and views, search, and more.

Content

Data format

Differentially-private data is currently available in static TSV form athttps://analytics.wikimedia.org/published/datasets/. Work to make this data available via API is ongoing.

License

Differentially-private data and code is available under aCreative Commons Zero license.

Support

Retrieved from "https://meta.wikimedia.org/w/index.php?title=Research:Data&oldid=29518592"
Categories:

[8]ページ先頭

©2009-2026 Movatter.jp