- Notifications
You must be signed in to change notification settings - Fork6
ku-nlp/KWDLC
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a Japanese text corpus that consists of lead three sentencesof web documents with various linguistic annotations. By collectinglead three sentences of web documents, this corpus contains documentswith various genres and styles, such as news articles, encyclopedicarticles, blogs and commercial pages. It comprises approximately 5,000documents, which correspond to 15,000 sentences.
The linguistic annotations consist of annotations of morphology, namedentities, dependencies, predicate-argument structures including zeroanaphora, coreferences, and discourse. All the annotations exceptdiscourse annotations were given by manually modifying automaticanalyses of the morphological analyzer JUMAN and the dependency, casestructure and anaphora analyzer KNP. The discourse annotations weregiven by two types of annotators; experts and crowd workers.
This corpus consists of linguistically annotated Web documents thathave been made publicly available on the Web at some time. The corpusis released for the purpose of contributing to the research of naturallanguage processing.
Since the collected documents are fragmentary, i.e., only the leadthree sentences of each Web document, we have not obtained permissionfrom copyright owners of the Web documents and do not provide sourceinformation such as URL. If copyright owners of Web documents requestthe addition of source information or deletion of these documents, we willupdate the corpus and newly release it. In this case, please deletethe downloaded old version and replace it with the new version.
The annotation guidelines for this corpus are written in the manualsfound in the "doc" directory. The guidelines for morphology anddependencies are described in syn_guideline.pdf, those forpredicate-argument structures and coreferences are described inrel_guideline.pdf, and those for discourse relations are described indisc_guideline.pdf. The guidelines for named entities are available onthe IREX website (http://nlp.cs.nyu.edu/irex/).
knp/
: the corpus annotated with morphology, named entities, dependencies, predicate-argument structures, andcoreferencesdisc/
: the corpus annotated with discourse relationsorg/
: the raw corpusdoc/
: annotation guidelinesid/
: document id files providing train/test split
# of documents | # of sentences | # of morphemes | # of named entities | # of predicates | # of coreferring mentions | |
---|---|---|---|---|---|---|
train | 3,915 | 11,745 | 194,490 | 6,267 | 51,702 | 16,079 |
dev | 512 | 1,536 | 22,625 | 974 | 6,139 | 1,641 |
test | 700 | 2,100 | 35,869 | 1,122 | 9,549 | 3,074 |
total | 5,127 | 15,381 | 252,984 | 8,363 | 67,390 | 20,794 |
Format of the corpus annotated with annotations of morphology, named entities, dependencies, predicate-argument structures, and coreferences
Annotations of this corpus are given in the following format.
# S-ID:w201106-0000010001-1* 2D+ 3D太郎 たろう 太郎 名詞 6 人名 5 * 0 * 0は は は 助詞 9 副助詞 2 * 0 * 0* 2D+ 2D京都 きょうと 京都 名詞 6 地名 4 * 0 * 0+ 3D <NE:ORGANIZATION:京都大学>大学 だいがく 大学 名詞 6 普通名詞 1 * 0 * 0に に に 助詞 9 格助詞 1 * 0 * 0* -1D+ -1D <rel type="ガ" sid="w201106-0000010001-1"/><rel type="ニ" sid="w201106-0000010001-1"/>行った いった 行く 動詞 2 * 0 子音動詞カ行促音便形 3 タ形 10EOS
The first line represents the ID of this sentence. In the subsequentlines, the lines starting with "*" denote "bunsetsu," the lines startingwith "+" denote basic phrases, and the other lines denote morphemes.
The line of morphemes is the same as the output of the morphologicalanalyzers, JUMAN and Juman++. This information includes surfacestring, reading, lemma, part of speech (POS), fine-grained POS,conjugate type, and conjugate form. "*" means that its field is notavailable. Note that this format is slightly different from KWDLC 1.0,which adopted the same format as Kyoto University Text Corpus 4.0.
The line starting with "*" represents "bunsetsu," which is aconventional unit for dependency in Japanese. "Bunsetsu" consists ofone or more content words and zero or more function words. In thisline, the first numeral means the ID of its depending head. The subsequent alphabetdenotes the type of dependency relation, i.e., "D" (normaldependency), "P" (coordination dependency), "I" (incompletecoordination dependency), and "A" (appositive dependency).
The line starting with "+" represents a basic phrase, which is a unitto which various relations are annotated. A basic phrase consists ofone content word and zero or more function words. Therefore, it isequivalent to a bunsetsu or a part of a bunsetsu. In this line, thefirst numeral means the ID of its depending head. The subsequent alphabet isdefined in the same way as bunsetsu. The remaining part of this lineincludes the annotations of named entity and various relations.
Annotations of named entity are given in<NE>
tags.<NE>
has thefollowing four attributes: type, target, possibility, andoptional_type, which mean the class of a named entity, the string ofa named entity, possible classes for an OPTIONAL named entity, and atype for an OPTIONAL named entity, respectively. The details of theseattributes are described in the IREX annotation guidelines.
Annotations of various relations are given in<rel>
tags.<rel>
hasthe following four attributes: type, target, sid, and id, which meanthe name of a relation, the string of the counterpart, the sentence IDof the counterpart, and the basic phrase ID of the counterpart,respectively. If a basic phrase has multiple tags of the same type, a"mode" attribute is also assigned, which has one of "AND," "OR," and"?." The details of these attributes are described in the annotationguidelines (rel_guideline.pdf).
In this corpus, a clause pair is given a discourse type and its votes as follows.
# A-ID:w201106-00019985361 今日とある企業のトップの話を聞くことが出来た。2 経営者として何事も全てビジネスチャンスに変えるマインドが大切だと感じた。3 生きていく上で追い風もあれば、4 逆風もある。1-2 談話関係なし:5 原因・理由:4 条件:13-4 原因・理由:3 談話関係なし:2 逆接:2 対比:2 目的:1
The first line represents the ID of this document, the subsequentblock denotes clause IDs and clauses, and the last block denotesdiscourse relations for clause pairs and their voting results. Thesediscourse relations and voting results are the results of the secondstage of crowdsourcing. Each line is the list of a discourse relationand its votes in order of votes. For the discourse relation annotatedby experts, the discourse direction is annotated; if it is reverse order,"(逆方向)" is added to the discourse relation. The details of annotationmethods and discourse relations are described in [Kawahara et al., 2014]and the annotation guidelines (disc_guideline.pdf).
- Masatsugu Hangyo, Daisuke Kawahara and Sadao Kurohashi. Building a Diverse Document Leads Corpus Annotated withSemantic Relations, In Proceedings of the 26th Pacific Asia Conference on Language Information and Computing,pp.535-544, 2012.http://www.aclweb.org/anthology/Y/Y12/Y12-1058.pdf
- 萩行正嗣, 河原大輔, 黒橋禎夫. 多様な文書の書き始めに対する意味関係タグ付きコーパスの構築とその分析, 自然言語処理,Vol.21, No.2, pp.213-248, 2014.https://doi.org/10.5715/jnlp.21.213
- Daisuke Kawahara, Yuichiro Machida, Tomohide Shibata, Sadao Kurohashi, Hayato Kobayashi and Manabu Sassano. RapidDevelopment of a Corpus with Discourse Annotations using Two-stage Crowdsourcing, In Proceedings of the 25thInternational Conference on Computational Linguistics, pp.269-278,2014.http://www.aclweb.org/anthology/C/C14/C14-1027.pdf
- 岸本裕大, 村脇有吾, 河原大輔, 黒橋禎夫. 日本語談話関係解析:タスク設計・談話標識の自動認識・ コーパスアノテーション,自然言語処理, Vol.27, No.4, pp.889-931, 2020.https://doi.org/10.5715/jnlp.27.889
The creation of this corpus was supported by JSPS KAKENHI Grant Number 24300053 and JST CREST "Advanced CoreTechnologies for Big Data Integration." The discourse annotations were acquired by crowdsourcing under the support ofYahoo! Japan Corporation. We deeply appreciate their support.
If you have any questions or problems with this corpus, please send an email to nl-resource at nlp.ist.i.kyoto-u.ac.jp.If you have a request to add source information or to delete a document in the corpus, please send an email to this mailaddress.
About
Kyoto University Web Document Leads Corpus
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors7
Uh oh!
There was an error while loading.Please reload this page.