- Notifications
You must be signed in to change notification settings - Fork0
shigashiyama/en-ja-el
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This dataset was constructed by translating original English texts in existing English entity linking datasets (VoxEL, MEANTIME, and Linked-DocRed) to Japanese while preserving annotation information.This includes mention spans for named entity recognition and knowledge base entry IDs for entity disabiguation.The texts were machine translated using the model fromMin'na no Jidou Hon'yaku@TexTra and then fully post-edited by human translators.We added links to Wikidata entities and Wikipedia pages based on intra-KB links when the original datasets did not include them.For example, the original MENATIME data only contained DBpedia links, so we added Wikidata and Wikipedia links.
The data statistics are as follows.
You can confirm them using the following commmand (with Python 3.8.0 or later). For example,
python3 src/show_data_statistics.py -i data/voxel/json/en/s-voxel.json
VoxEL | MEANTIME | Linked-DocRed | |
---|---|---|---|
Document | 15 | 120 | 500 |
Sentence | 94 | 1,797 | 3,944 |
Mention | 204 | 2,634 | 12,897 |
Mention w/ Wikidata link | 201 | 1,861 | 8,568 |
Mention w/ Wikipedia_En link | 201 | 1,867 | 8,624 |
Mention w/ Wikipedia_Ja link | 187 | 1,781 | 5,768 |
Mention w/ DBpedia link | 1,871 | ||
Entity | 204 | 1,407 | 9,779 |
Entity w/ Wikidata link | 201 | 785 | 6,023 |
Entity w/ Wikipedia_En link | 201 | 789 | 6,076 |
Entity w/ Wikipedia_Ja link | 187 | 747 | 4,170 |
Entity w/ DBpedia link | 791 |
The number of mentions for each entity type is as follows.For Linked-DocRed, the names of living persons are masked with■
symbols, and their spans are annotated with thePER_MASKED
type.
Type | VoxEL | MEANTIME | Linked-DocRed |
---|---|---|---|
No_Label | 204 | 2,634 | |
PER | 1,260 | ||
PER_MASKED | 1,088 | ||
LOC | 4,122 | ||
ORG | 1,838 | ||
NUM | 669 | ||
TIME | 1,996 | ||
MISC | 1,924 |
- A document object value is assosiated with a key that represents thedocument ID (e.g., 001-1). Each document object has the sets of
doc_info
,sentences
,mentions
, andentities
."001": { "doc_info": { "title": null, "url": "http://www.voxeurop.eu/en/2017/social-issues-5121271" }, "sentences": { ... }, "mentions": { ... }, "entities": { ... }}
- A sentence object under
sentences
is as follows"sentences": { "00": { "text": "EUの失業率は2008年以来の最低水準。", "mention_ids": [ "M001" ] }, ...},
- A mention object under
mentions
is as follows:"mentions": { "M001": { "sentence_id": "00", "span": [ 0, 2 ], "text": "EU", "entity_type": null, "entity_id": "E001" }, ...},
- An entity object, which corresponds to a set of one or more mentions,under
entities
is as follows."entities": { "E001": { "member_mention_ids": [ "M001" ], "entity_type": null, "has_enwiki_ref": true, "has_jawiki_ref": true, "has_wikidata_ref": true, "has_dbpedia_ref": false, "ref_urls": { "en.wikipedia": "https://en.wikipedia.org/wiki/European_Union", "wikidata": "http://www.wikidata.org/entity/Q458", "ja.wikipedia": "https://ja.wikipedia.org/wiki/%E6%AC%A7%E5%B7%9E%E9%80%A3%E5%90%88" } }, ...}
- The VoxEL data
- We used
sVoxEL-en.ttl
from theVoxEL benchmark dataset. - Our extended data is licensed underAcademic Research Non-Commercial Limited CC-BY-NC-SA Reference-Type License.)
- We used
- The MEANTIME data
- We used 120 documents in
intra_cross-doc_annotation
from the NewsReaderMEANTIME corpus (meantime_newsreader_english_oct15.zip
). - Our extended data is licensed underAcademic Research Non-Commercial Limited CC-BY-NC-SA Reference-Type License.
- We used 120 documents in
- The Linked-DocRED data
- We used
test_revised.json
fromLinked-Re-DocRED. - Our extended data is licensed under theGPLv3 License.
EnJaEL Linked-DocREDCopyright (C) 2025 National Institute of Information and Communications Technology (Shohei Higashiyama) This program is free software; you can redistribute it and/or modifyit under the terms of the GNU General Public License as published bythe Free Software Foundation; either version 3 of the License, or(at your option) any later version. This program is distributed in the hope that it will be useful,but WITHOUT ANY WARRANTY; without even the implied warranty ofMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See theGNU General Public License for more details. You should have received a copy of the GNU General PublicLicense along with this program. If not, see <http://www.gnu.org/licenses/>.
- We used
- 2025/01/29: The Version 1.0 has been released.
Please cite the following paper.
Japanese bibliography:
@article{higashiyama-etal-2024-cadel, author = "東山,翔平 and 出内,将夫 and 内山,将夫", title = "日本語エンティティリンキングのための行政機関ウェブ文書コーパスの構築", journal = "情報処理学会研究報告", volume = "2024-NL-260", number = "10", pages = "1--15", year = "2024", month = "jun" url = "https://ipsj.ixsq.nii.ac.jp/ej/index.php?active_action=repository_view_main_item_detail&page_id=13&block_id=8&item_id=235101&item_no=1",}
English bibliography:
@article{higashiyama-etal-2024-cadel, author = "Shohei Higashiyama and Masao Ideuchi and Masao Utiyama", title = "Construction of the Administrative Agency Web Document Corpus for {Japanese} Entity Linking [in {Japanese}]", journal = "IPSJ SIG Technical Report", volume = "2024-NL-260", number = "10", pages = "1--15", year = "2024", month = "jun", url = "https://ipsj.ixsq.nii.ac.jp/ej/index.php?active_action=repository_view_main_item_detail&page_id=13&block_id=8&item_id=235101&item_no=1",}
- Henry Rosales-Méndez, Aidan Hogan, Barbara Poblete. VoxEL: A Benchmark Dataset for Multilingual Entity Linking. International Semantic Web Conference (ISWC), Monterey, United States, 2018.https://dl.acm.org/doi/10.1007/978-3-030-00668-6_11.
- Anne-Lyse Minard, Manuela Speranza, Ruben Urizar, Begona Altuna, Marieke van Erp, Anneleen Schoen, and Chantal van Son. MEANTIME, the NewsReader Multilingual Event and Time Corpus. In Proceedings of the 10th language resources and evaluation conference (LREC 2016), European Language Resources Association (ELRA), Portorož, Slovenia, 2016.https://aclanthology.org/L16-1699/
- Tan, Qingyu, Lu Xu, Lidong Bing, Hwee Tou Ng, and Sharifah Mahani Aljunied. Revisiting DocRED - Addressing the False Negative Problem in Relation Extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 8472–87. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022.https://aclanthology.org/2022.emnlp-main.580
About
Resources
Uh oh!
There was an error while loading.Please reload this page.