Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
NotificationsYou must be signed in to change notification settings

shigashiyama/en-ja-el

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This dataset was constructed by translating original English texts in existing English entity linking datasets (VoxEL, MEANTIME, and Linked-DocRed) to Japanese while preserving annotation information.This includes mention spans for named entity recognition and knowledge base entry IDs for entity disabiguation.The texts were machine translated using the model fromMin'na no Jidou Hon'yaku@TexTra and then fully post-edited by human translators.We added links to Wikidata entities and Wikipedia pages based on intra-KB links when the original datasets did not include them.For example, the original MENATIME data only contained DBpedia links, so we added Wikidata and Wikipedia links.

Data Statistics

The data statistics are as follows.

You can confirm them using the following commmand (with Python 3.8.0 or later). For example,

  • python3 src/show_data_statistics.py -i data/voxel/json/en/s-voxel.json
VoxELMEANTIMELinked-DocRed
Document15120500
Sentence941,7973,944
Mention2042,63412,897
Mention w/ Wikidata link2011,8618,568
Mention w/ Wikipedia_En link2011,8678,624
Mention w/ Wikipedia_Ja link1871,7815,768
Mention w/ DBpedia link1,871
Entity2041,4079,779
Entity w/ Wikidata link2017856,023
Entity w/ Wikipedia_En link2017896,076
Entity w/ Wikipedia_Ja link1877474,170
Entity w/ DBpedia link791

The number of mentions for each entity type is as follows.For Linked-DocRed, the names of living persons are masked with symbols, and their spans are annotated with thePER_MASKED type.

TypeVoxELMEANTIMELinked-DocRed
No_Label2042,634
PER1,260
PER_MASKED1,088
LOC4,122
ORG1,838
NUM669
TIME1,996
MISC1,924

Data Format

JSON

  • A document object value is assosiated with a key that represents thedocument ID (e.g., 001-1). Each document object has the sets ofdoc_info,sentences,mentions, andentities.
    "001": {  "doc_info": {    "title": null,    "url": "http://www.voxeurop.eu/en/2017/social-issues-5121271"  },  "sentences": {  ...  },  "mentions": {  ...  },  "entities": {  ...  }}
  • A sentence object undersentences is as follows
    "sentences": {  "00": {    "text": "EUの失業率は2008年以来の最低水準。",    "mention_ids": [      "M001"    ]  },  ...},
  • A mention object undermentions is as follows:
    "mentions": {  "M001": {    "sentence_id": "00",    "span": [      0,      2    ],    "text": "EU",    "entity_type": null,    "entity_id": "E001"  },  ...},
  • An entity object, which corresponds to a set of one or more mentions,underentities is as follows.
    "entities": {  "E001": {    "member_mention_ids": [      "M001"    ],    "entity_type": null,    "has_enwiki_ref": true,    "has_jawiki_ref": true,    "has_wikidata_ref": true,    "has_dbpedia_ref": false,    "ref_urls": {      "en.wikipedia": "https://en.wikipedia.org/wiki/European_Union",      "wikidata": "http://www.wikidata.org/entity/Q458",      "ja.wikipedia": "https://ja.wikipedia.org/wiki/%E6%AC%A7%E5%B7%9E%E9%80%A3%E5%90%88"    }  },  ...}

Data Sources and License

  • The VoxEL data
  • The MEANTIME data
  • The Linked-DocRED data
    • We usedtest_revised.json fromLinked-Re-DocRED.
    • Our extended data is licensed under theGPLv3 License.
      EnJaEL Linked-DocREDCopyright (C) 2025 National Institute of Information and Communications Technology (Shohei Higashiyama) This program is free software; you can redistribute it and/or modifyit under the terms of the GNU General Public License as published bythe Free Software Foundation; either version 3 of the License, or(at your option) any later version. This program is distributed in the hope that it will be useful,but WITHOUT ANY WARRANTY; without even the implied warranty ofMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See theGNU General Public License for more details. You should have received a copy of the GNU General PublicLicense along with this program. If not, see <http://www.gnu.org/licenses/>.

Change Log

  • 2025/01/29: The Version 1.0 has been released.

Citation

Please cite the following paper.

Japanese bibliography:

@article{higashiyama-etal-2024-cadel,    author  = "東山,翔平 and 出内,将夫 and 内山,将夫",    title   = "日本語エンティティリンキングのための行政機関ウェブ文書コーパスの構築",    journal = "情報処理学会研究報告",    volume  = "2024-NL-260",    number  = "10",    pages   = "1--15",       year    = "2024",    month   = "jun"    url     = "https://ipsj.ixsq.nii.ac.jp/ej/index.php?active_action=repository_view_main_item_detail&page_id=13&block_id=8&item_id=235101&item_no=1",}

English bibliography:

@article{higashiyama-etal-2024-cadel,    author  = "Shohei Higashiyama and Masao Ideuchi and Masao Utiyama",    title   = "Construction of the Administrative Agency Web Document Corpus for {Japanese} Entity Linking [in {Japanese}]",    journal = "IPSJ SIG Technical Report",    volume  = "2024-NL-260",    number  = "10",    pages   = "1--15",       year    = "2024",    month   = "jun",    url     = "https://ipsj.ixsq.nii.ac.jp/ej/index.php?active_action=repository_view_main_item_detail&page_id=13&block_id=8&item_id=235101&item_no=1",}

Reference

  1. Henry Rosales-Méndez, Aidan Hogan, Barbara Poblete. VoxEL: A Benchmark Dataset for Multilingual Entity Linking. International Semantic Web Conference (ISWC), Monterey, United States, 2018.https://dl.acm.org/doi/10.1007/978-3-030-00668-6_11.
  2. Anne-Lyse Minard, Manuela Speranza, Ruben Urizar, Begona Altuna, Marieke van Erp, Anneleen Schoen, and Chantal van Son. MEANTIME, the NewsReader Multilingual Event and Time Corpus. In Proceedings of the 10th language resources and evaluation conference (LREC 2016), European Language Resources Association (ELRA), Portorož, Slovenia, 2016.https://aclanthology.org/L16-1699/
  3. Tan, Qingyu, Lu Xu, Lidong Bing, Hwee Tou Ng, and Sharifah Mahani Aljunied. Revisiting DocRED - Addressing the False Negative Problem in Relation Extraction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 8472–87. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022.https://aclanthology.org/2022.emnlp-main.580

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp