Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
NotificationsYou must be signed in to change notification settings

ku-nlp/WikipediaAnnotatedCorpus

Repository files navigation

Overview

This is a Japanese text corpus that consists of Wikipedia articles with various linguistic annotations.

The linguistic annotations consist of annotations of morphology, named entities, dependencies, predicate-argumentstructures including zero anaphora, and coreferences.For the annotation guidelines, see the manuals in thedoc directory oftheku-nlp/KWDLC repository.

Distributed files

  • knp/: the corpus annotated with morphology, named entities, dependencies, predicate-argument structures,and coreferences
  • org/: the raw corpus
  • id/: document id files providing train/dev/test split

Statistics

SplitDocumentsSentencesMorphemesNamed entitiesPredicatesCoreferring mentions
train2,3965,618137,1719,49037,17130,977
dev1002486,3534231,7021,435
test20045511,1238012,8752,533
total2,6966,321154,64710,71441,74834,945

Format of the annotation

Annotations of this corpus are given in the following format (a.k.a. the KNP format).

# S-ID:wiki000010000-1* 2D+ 3D太郎 たろう 太郎 名詞 6 人名 5 * 0 * 0は は は 助詞 9 副助詞 2 * 0 * 0* 2D+ 2D京都 きょうと 京都 名詞 6 地名 4 * 0 * 0+ 3D <NE:ORGANIZATION:京都大学>大学 だいがく 大学 名詞 6 普通名詞 1 * 0 * 0に に に 助詞 9 格助詞 1 * 0 * 0* -1D+ -1D <rel type="ガ" sid="w201106-0000010001-1"/><rel type="ニ" sid="w201106-0000010001-1"/>行った いった 行く 動詞 2 * 0 子音動詞カ行促音便形 3 タ形 10EOS

A description of this format can be found inthe documentation of KWDLC.

Note: You can userhoknp to intuitively access annotations from Python without understanding the syntax of this format.

fromrhoknpimportDocumentwithopen("knp/wiki0010/wiki00100176.knp")asf:document=Document.from_knp(f.read())formorphemeindocument.morphemes:    ...

References

  • 萩行正嗣, 河原大輔, 黒橋禎夫. 多様な文書の書き始めに対する意味関係タグ付きコーパスの構築とその分析, 自然言語処理,Vol.21, No.2, pp.213-248, 2014.https://doi.org/10.5715/jnlp.21.213

Author

京都大学 言語メディア研究室 (contactat nlp.ist.i.kyoto-u.ac.jp)

  • Nobuhiro Ueda <uedaat nlp.ist.i.kyoto-u.ac.jp>

Contact

If you have any questions or problems with this corpus, please email to <nl-resourceat nlp.ist.i.kyoto-u.ac.jp>.

License

The license for this corpus is subject to CC BY-SA 4.0.https://creativecommons.org/licenses/by-sa/4.0/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors2

  •  
  •  

Languages


[8]ページ先頭

©2009-2025 Movatter.jp