NotificationsYou must be signed in to change notification settings
Fork6
Star83

Kyoto University Web Document Leads Corpus

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
.github		.github
disc		disc
doc		doc
id		id
knp		knp
org		org
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Repository files navigation

Kyoto University Web Document Leads Corpus

Overview

This is a Japanese text corpus that consists of lead three sentencesof web documents with various linguistic annotations. By collectinglead three sentences of web documents, this corpus contains documentswith various genres and styles, such as news articles, encyclopedicarticles, blogs and commercial pages. It comprises approximately 5,000documents, which correspond to 15,000 sentences.

The linguistic annotations consist of annotations of morphology, namedentities, dependencies, predicate-argument structures including zeroanaphora, coreferences, and discourse. All the annotations exceptdiscourse annotations were given by manually modifying automaticanalyses of the morphological analyzer JUMAN and the dependency, casestructure and anaphora analyzer KNP. The discourse annotations weregiven by two types of annotators; experts and crowd workers.

Notes

This corpus consists of linguistically annotated Web documents thathave been made publicly available on the Web at some time. The corpusis released for the purpose of contributing to the research of naturallanguage processing.

Since the collected documents are fragmentary, i.e., only the leadthree sentences of each Web document, we have not obtained permissionfrom copyright owners of the Web documents and do not provide sourceinformation such as URL. If copyright owners of Web documents requestthe addition of source information or deletion of these documents, we willupdate the corpus and newly release it. In this case, please deletethe downloaded old version and replace it with the new version.

Notes on annotation guidelines

The annotation guidelines for this corpus are written in the manualsfound in the "doc" directory. The guidelines for morphology anddependencies are described in syn_guideline.pdf, those forpredicate-argument structures and coreferences are described inrel_guideline.pdf, and those for discourse relations are described indisc_guideline.pdf. The guidelines for named entities are available onthe IREX website (http://nlp.cs.nyu.edu/irex/).

Distributed files

knp/: the corpus annotated with morphology, named entities, dependencies, predicate-argument structures, andcoreferences
disc/: the corpus annotated with discourse relations
org/: the raw corpus
doc/: annotation guidelines
id/: document id files providing train/test split

Statistics

	# of documents	# of sentences	# of morphemes	# of named entities	# of predicates	# of coreferring mentions
train	3,915	11,745	194,490	6,267	51,702	16,079
dev	512	1,536	22,625	974	6,139	1,641
test	700	2,100	35,869	1,122	9,549	3,074
total	5,127	15,381	252,984	8,363	67,390	20,794

Format of the corpus annotated with annotations of morphology, named entities, dependencies, predicate-argument structures, and coreferences

Annotations of this corpus are given in the following format.

# S-ID:w201106-0000010001-1* 2D+ 3D太郎 たろう 太郎 名詞 6 人名 5 * 0 * 0は は は 助詞 9 副助詞 2 * 0 * 0* 2D+ 2D京都 きょうと 京都 名詞 6 地名 4 * 0 * 0+ 3D <NE:ORGANIZATION:京都大学>大学 だいがく 大学 名詞 6 普通名詞 1 * 0 * 0に に に 助詞 9 格助詞 1 * 0 * 0* -1D+ -1D <rel type="ガ" sid="w201106-0000010001-1"/><rel type="ニ" sid="w201106-0000010001-1"/>行った いった 行く 動詞 2 * 0 子音動詞カ行促音便形 3 タ形 10EOS

The first line represents the ID of this sentence. In the subsequentlines, the lines starting with "*" denote "bunsetsu," the lines startingwith "+" denote basic phrases, and the other lines denote morphemes.

The line of morphemes is the same as the output of the morphologicalanalyzers, JUMAN and Juman++. This information includes surfacestring, reading, lemma, part of speech (POS), fine-grained POS,conjugate type, and conjugate form. "*" means that its field is notavailable. Note that this format is slightly different from KWDLC 1.0,which adopted the same format as Kyoto University Text Corpus 4.0.

The line starting with "*" represents "bunsetsu," which is aconventional unit for dependency in Japanese. "Bunsetsu" consists ofone or more content words and zero or more function words. In thisline, the first numeral means the ID of its depending head. The subsequent alphabetdenotes the type of dependency relation, i.e., "D" (normaldependency), "P" (coordination dependency), "I" (incompletecoordination dependency), and "A" (appositive dependency).

The line starting with "+" represents a basic phrase, which is a unitto which various relations are annotated. A basic phrase consists ofone content word and zero or more function words. Therefore, it isequivalent to a bunsetsu or a part of a bunsetsu. In this line, thefirst numeral means the ID of its depending head. The subsequent alphabet isdefined in the same way as bunsetsu. The remaining part of this lineincludes the annotations of named entity and various relations.

Annotations of named entity are given in<NE> tags.<NE> has thefollowing four attributes: type, target, possibility, andoptional_type, which mean the class of a named entity, the string ofa named entity, possible classes for an OPTIONAL named entity, and atype for an OPTIONAL named entity, respectively. The details of theseattributes are described in the IREX annotation guidelines.

Annotations of various relations are given in<rel> tags.<rel> hasthe following four attributes: type, target, sid, and id, which meanthe name of a relation, the string of the counterpart, the sentence IDof the counterpart, and the basic phrase ID of the counterpart,respectively. If a basic phrase has multiple tags of the same type, a"mode" attribute is also assigned, which has one of "AND," "OR," and"？." The details of these attributes are described in the annotationguidelines (rel_guideline.pdf).

Format of the corpus annotated with discourse relations

In this corpus, a clause pair is given a discourse type and its votes as follows.

# A-ID:w201106-00019985361 今日とある企業のトップの話を聞くことが出来た。2 経営者として何事も全てビジネスチャンスに変えるマインドが大切だと感じた。3 生きていく上で追い風もあれば、4 逆風もある。1-2 談話関係なし:5  原因・理由:4  条件:13-4 原因・理由:3  談話関係なし:2  逆接:2  対比:2  目的:1

The first line represents the ID of this document, the subsequentblock denotes clause IDs and clauses, and the last block denotesdiscourse relations for clause pairs and their voting results. Thesediscourse relations and voting results are the results of the secondstage of crowdsourcing. Each line is the list of a discourse relationand its votes in order of votes. For the discourse relation annotatedby experts, the discourse direction is annotated; if it is reverse order,"(逆方向)" is added to the discourse relation. The details of annotationmethods and discourse relations are described in [Kawahara et al., 2014]and the annotation guidelines (disc_guideline.pdf).

References

Masatsugu Hangyo, Daisuke Kawahara and Sadao Kurohashi. Building a Diverse Document Leads Corpus Annotated withSemantic Relations, In Proceedings of the 26th Pacific Asia Conference on Language Information and Computing,pp.535-544, 2012.http://www.aclweb.org/anthology/Y/Y12/Y12-1058.pdf
萩行正嗣, 河原大輔, 黒橋禎夫. 多様な文書の書き始めに対する意味関係タグ付きコーパスの構築とその分析, 自然言語処理,Vol.21, No.2, pp.213-248, 2014.https://doi.org/10.5715/jnlp.21.213
Daisuke Kawahara, Yuichiro Machida, Tomohide Shibata, Sadao Kurohashi, Hayato Kobayashi and Manabu Sassano. RapidDevelopment of a Corpus with Discourse Annotations using Two-stage Crowdsourcing, In Proceedings of the 25thInternational Conference on Computational Linguistics, pp.269-278,2014.http://www.aclweb.org/anthology/C/C14/C14-1027.pdf
岸本裕大, 村脇有吾, 河原大輔, 黒橋禎夫. 日本語談話関係解析：タスク設計・談話標識の自動認識・コーパスアノテーション,自然言語処理, Vol.27, No.4, pp.889-931, 2020.https://doi.org/10.5715/jnlp.27.889

Acknowledgment

The creation of this corpus was supported by JSPS KAKENHI Grant Number 24300053 and JST CREST "Advanced CoreTechnologies for Big Data Integration." The discourse annotations were acquired by crowdsourcing under the support ofYahoo! Japan Corporation. We deeply appreciate their support.

Contact

If you have any questions or problems with this corpus, please send an email to nl-resource at nlp.ist.i.kyoto-u.ac.jp.If you have a request to add source information or to delete a document in the corpus, please send an email to this mailaddress.

About

Kyoto University Web Document Leads Corpus

Releases3

v1.1.1 Latest

Dec 18, 2023

+ 2 releases

Packages

No packages published

Contributors7

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Kyoto University Web Document Leads Corpus

Overview

Notes

Notes on annotation guidelines

Distributed files

Statistics

Format of the corpus annotated with annotations of morphology, named entities, dependencies, predicate-argument structures, and coreferences

Format of the corpus annotated with discourse relations

References

Acknowledgment

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases3

Packages

Uh oh!

Contributors7

Uh oh!

Languages

Movatterモバイル変換

ku-nlp/KWDLC

Folders and files

Latest commit

History

Repository files navigation

Kyoto University Web Document Leads Corpus

Overview

Notes

Notes on annotation guidelines

Distributed files

Statistics

Format of the corpus annotated with annotations of morphology, named entities, dependencies, predicate-argument structures, and coreferences

Format of the corpus annotated with discourse relations

References

Acknowledgment

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases3

Packages0

Uh oh!

Contributors7

Uh oh!

Languages

Packages