- Notifications
You must be signed in to change notification settings - Fork2
ku-nlp/WikipediaAnnotatedCorpus
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a Japanese text corpus that consists of Wikipedia articles with various linguistic annotations.
The linguistic annotations consist of annotations of morphology, named entities, dependencies, predicate-argumentstructures including zero anaphora, and coreferences.For the annotation guidelines, see the manuals in thedoc
directory oftheku-nlp/KWDLC repository.
knp/
: the corpus annotated with morphology, named entities, dependencies, predicate-argument structures,and coreferencesorg/
: the raw corpusid/
: document id files providing train/dev/test split
Split | Documents | Sentences | Morphemes | Named entities | Predicates | Coreferring mentions |
---|---|---|---|---|---|---|
train | 2,396 | 5,618 | 137,171 | 9,490 | 37,171 | 30,977 |
dev | 100 | 248 | 6,353 | 423 | 1,702 | 1,435 |
test | 200 | 455 | 11,123 | 801 | 2,875 | 2,533 |
total | 2,696 | 6,321 | 154,647 | 10,714 | 41,748 | 34,945 |
Annotations of this corpus are given in the following format (a.k.a. the KNP format).
# S-ID:wiki000010000-1* 2D+ 3D太郎 たろう 太郎 名詞 6 人名 5 * 0 * 0は は は 助詞 9 副助詞 2 * 0 * 0* 2D+ 2D京都 きょうと 京都 名詞 6 地名 4 * 0 * 0+ 3D <NE:ORGANIZATION:京都大学>大学 だいがく 大学 名詞 6 普通名詞 1 * 0 * 0に に に 助詞 9 格助詞 1 * 0 * 0* -1D+ -1D <rel type="ガ" sid="w201106-0000010001-1"/><rel type="ニ" sid="w201106-0000010001-1"/>行った いった 行く 動詞 2 * 0 子音動詞カ行促音便形 3 タ形 10EOS
A description of this format can be found inthe documentation of KWDLC.
Note: You can userhoknp to intuitively access annotations from Python without understanding the syntax of this format.
fromrhoknpimportDocumentwithopen("knp/wiki0010/wiki00100176.knp")asf:document=Document.from_knp(f.read())formorphemeindocument.morphemes: ...
- 萩行正嗣, 河原大輔, 黒橋禎夫. 多様な文書の書き始めに対する意味関係タグ付きコーパスの構築とその分析, 自然言語処理,Vol.21, No.2, pp.213-248, 2014.https://doi.org/10.5715/jnlp.21.213
京都大学 言語メディア研究室 (contactat nlp.ist.i.kyoto-u.ac.jp)
- Nobuhiro Ueda <uedaat nlp.ist.i.kyoto-u.ac.jp>
If you have any questions or problems with this corpus, please email to <nl-resourceat nlp.ist.i.kyoto-u.ac.jp>.
The license for this corpus is subject to CC BY-SA 4.0.https://creativecommons.org/licenses/by-sa/4.0/
About
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.