- Notifications
You must be signed in to change notification settings - Fork13
A Russian data set for question answering over Wikidata
License
vladislavneon/RuBQ
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
We presentRuBQ (pronounced [`rubik]) --Russian KnowledgeBaseQuestions, a KBQA dataset that consists of 1,500 Russian questions of varying complexity along with their English machine translations, corresponding SPARQL queries, answers, as well as a subset of Wikidata covering entities with Russian labels. To the best of our knowledge, this is the first Russian KBQA and semantic parsing dataset. The dataset is thought to be used as a development and test sets in cross-lingual transfer, few-shot learning, or learning with synthetic data scenarios.
Data set files are presented in JSON format as an array of dictionary entries. See full specifications here.
| Question | Query | Answers | Tags |
|---|---|---|---|
| Rus: Кто написал роман «Хижина дяди Тома»? Eng: Who wrote the novel "Uncle Tom's Cabin"? | SELECT ?answer | wd:Q102513 (Harriet Beecher Stowe) | 1-hop |
| Rus: Кто сыграл князя Андрея Болконского в фильме С. Ф. Бондарчука «Война и мир»? Eng: Who played Prince Andrei Bolkonsky in S. F. Bondarchuk's film "War and peace"? | SELECT ?answer | wd:Q312483 (Vyacheslav Tikhonov) | qualifier-constraint |
| Rus: Кто на работе пользуется теодолитом? Eng: Who uses a theodolite for work? | SELECT ?answer | wd:Q1734662 (cartographer) wd:Q11699606 (geodesist) wd:Q294126 (land surveyor) | multi-hop |
| Rus: Какой океан самый маленький? Eng: Which ocean is the smallest? | SELECT ?answer | wd:Q788 (Arctic Ocean) | multi-constraint reverse ranking |
We provide a Wikidata sample containing all the entities with Russian labels. It consists of about 212M triples with 8.1M unique entities. This snapshot mitigates the problem of Wikidata’s dynamics – a reference answer may change with time as the knowledge base evolves. The sample guarantees the correctness of the queries and answers. In addition, the smaller dump makes it easier to conduct experiments with our dataset.
We strongly recommend using this sample for evaluation.
Sample is a collection of several RDF files in Turtle.
wdt_all.ttlcontains all the truthy statements.names.ttlcontains Russian and English labels and aliases for all entities. Names in other language also provided when needed.onto.ttlcontains all Wikidata triples with relationwdt:P279-subclass of. It represents some class hierarchy, but remember that there is noclass orinstance concepts in Wikidata.pch_{0,6}.ttlcontain all statetment nodes and their data for all entities.
Some question in our dataset require usingrdfs:label orskos:altLabel for retrieving answer which is a literal. In cases where answer language doesn't have to be inferred from question, our evaluation script takes into account Russian literals only.
This work is licensed under aCreative Commons Attribution 4.0 International License.
About
A Russian data set for question answering over Wikidata
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
