megagonlabs/UD_Japanese-GSDPublic

forked fromUniversalDependencies/UD_Japanese-GSD

NotificationsYou must be signed in to change notification settings
Fork2
Star28

Japanese data from the Google UDT 2.0.

License

View license

28 stars 11 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
spacy		spacy
stanza		stanza
.gitignore		.gitignore
CITATION		CITATION
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
ene_mapping.xlsx		ene_mapping.xlsx
eval.log		eval.log
ja_gsd-ud-dev.conllu		ja_gsd-ud-dev.conllu
ja_gsd-ud-test.conllu		ja_gsd-ud-test.conllu
ja_gsd-ud-train.conllu		ja_gsd-ud-train.conllu
leader_board.md		leader_board.md
stats.xml		stats.xml

Repository files navigation

Summary

This Universal Dependencies (UD) Japanese treebank is based on the definition of UD Japanese convention described in the UD documentation. The original sentences are from Google UDT 2.0.

In addition, the Megagon Labs Tokyo added the files named*.ne.conllu which contain the BILUO style Named Entity gold labels in misc field.Below files are converted from*.ne.conllu for the NLP frameworks.

ja_gsd-ud-(train|dev|test).ne.json:https://spacy.io/api/cli#train
(train|dev|test).ne.bio:https://github.com/stanfordnlp/stanza#training-your-own-neural-pipelines

Introduction

The Japanese UD treebank contains the sentences from Google Universal Dependency Treebanks v2.0 (legacy):https://github.com/ryanmcd/uni-dep-tb. First, Google UDT v2.0 was converted to UD-style with bunsetsu-based word units (say "master" corpus).

The word units in "master" is significantly different from the definition of the documents based onShort Unit Word (SUW) [1], then the sentences are automatically re-processed by Hiroshi Kanayama in Feb 2017. It is the Japanese_UD v2.0 and used in the CoNLL 2017 shared task.In November 2017, UD_Japanese v2.0 is merged with the "master" data so that the manual annotations for dependencies can be reflected to the corpus. It reduced the errors in the dependency structures and relation labels.

Still there are slight differences in the word unit between UD_Japanese v2.1 and UD_Japanese-KTC 1.3.

In May 2020, we introduce UD_Japanese BCCWJ[3] like coversion method for UD_Japanese GSD v2.6.

Specification

Overview

The data is tokenized manually in a three layered tokenization of Short Unit Word (SUW)[4], Long Unit Word (LUW)[5], and base-phrase (bunsetsu)[5] as the `Balanced Corpus of Contemporary Written Japanese'[6]. The original morporlogical labels are based on UniDic POS tagset [7]We use the slightly changed version of SUW as the UD word tokenization, in which the cardinal numbers are concatenated as in one word.

The (base-)phrase level dependency structures are annotated manually with the gudeline of BCCWJ-DepPara[8]. The phrase level dependency structures are converted into the word level dependency structures by the head rule of the dependency analyser CaboCha[9].

LEMMA field

LEMMA is the base form of conjugated words -- verbs, adjectives, and auxiliary verbs by the UniDic schema [7].

XPOS field

XPOS is the part-of-speech label for Short Unit Word (SUW) based on UniDic POS tagset [7].

MISC field

SpaceAfter: manually annotated to discriminate alphanumeric word tokens
BunsetuPositionType: heads in a bunsetu by the head rules [9];
- SEM_HEAD: the head content word
- SYN_HEAD: the head functional word
- CONT: the non-head content word
- FUNC: the non-head functional word
LUWPOS: the part-of-speech label for Long Unit Word (LUW) based on UniDic POS tagset [7].
LUWBILabel: Long Unit Word (LUW) boundary labels [5]
- B: Beginning of LUW
- I: Inside of LUW
UniDicInfo: lemma information based on UniDic [7]. The UniDic lemma normalisenot only conjugation forms but also orthographical variants.
- 1 lForm: lexeme reading （語彙素読み）
- 2 lemma: lexeme （語彙素）
- 3 orth: Infinitive Form and Surface Form (書字形出現形）
- 4 pron: Surface Pronunciation （発音形出現形）
- 5 orthBase: Infinitive Form （書字形基本形）
- 6 pronBase: Surface Pronunciation（発音形基本形）
- 7 form: Word Form （語形）
- 8 formBase: Word Form （語形基本形）

Acknowledgments

The original treebank was provided by:

Adam LaMontagne
Milan Souček
Timo Järvinen
Alessandra Radici

via

Dan Zeman.

The corpus was converted by:

Mai Omura
Yusuke Miyao
Hiroshi Kanayama
Hiroshi Matsuda

through annotation, discussion and validation with

Aya Wakasa
Kayo Yamashita
Masayuki Asahara
Takaaki Tanaka
Yugo Murawaki
Yuji Matsumoto
Kaoru Ito
Taishi Chika
Shinsuke Mori
Sumire Uematsu

License

See file LICENSE.txt

Reference

[1] Tanaka, T., Miyao, Y., Asahara, M., Uematsu, S., Kanayama, H., Mori, S., & Matsumoto, Y. (2016). Universal Dependencies for Japanese. In LREC.

[2] Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M, & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC.

[3] Omura, M., & Asahara, M. (2020). UD-Japanese BCCWJ: Universal Dependencies Annotation for the Balanced Corpus of Contemporary Written Japanese. In UDW 2018.

[4] 小椋秀樹, 小磯花絵, 冨士池優美, 宮内佐夜香, 小西光, 原裕 (2011).『現代日本語書き言葉均衡コーパス』形態論情報規程集第４版（下）,（LR-CCG-10-05-02), 国立国語研究所, Tokyo, Japan.

[5] 小椋秀樹, 小磯花絵, 冨士池優美, 宮内佐夜香, 小西光, 原裕 (2011).『現代日本語書き言葉均衡コーパス』形態論情報規程集第４版（上）,（LR-CCG-10-05-01), 国立国語研究所, Tokyo, Japan.

[6] Maekawa, K., Yamazaki, M., Ogiso, T., Maruyama, T., Ogura, H., Kashino, W., Koiso, H., Yamaguchi, M., Tanaka, M., & Den, Y. (2014). Balanced Corpus of Contemporary Written Japanese. Language Resources and Evaluation, 48(2):345-371.

[7] Den, Y., Nakamura, J., Ogiso, T., Ogura, H., (2008). A Proper Approach to Japanese Morphologican Analysis: Dictionary, Model, and Evaluation. In LREC 2008. pp.1019-1024.

[8] Asahara, M., & Matsumoto, Y. (2016). BCCWJ-DepPara: A Syntactic Annotation Treebank on the `Balanced Corpus of Contemporary Written Japanese'. In ALR-12.

[9] Kudo, T. & Matsumoto, Y. (2002). Japanese Dependency Analysis using Cascaded Chunking, In CoNLL 2002. pp.63-69.

[10] 松田寛, 若狭絢, 山下華代, 大村舞, 浅原正幸 (2020).UD Japanese GSD の再整備と固有表現情報付与, 言語処理学会第26回年次大会発表論文集

Changelog

2020-05- v2.6
- Update for v2.6. Introduce the conversion method of UD-Japanese BCCWJ [3]
- Add the files containing the NE gold labels
2019-11-15 v2.5
- Google gave permission to drop the "NC" restriction from the license.This applies to the UD annotations (not the underlying content, of which Google claims no ownership or copyright).
2018-11- v2.3
- Updates for v2.3. Errors in morphologies are fixed, and unknown words and dep labels are reduced. XPOS is added.
2017-11- v2.1
- Updates for v2.1. Several errors are removed by adding PoS/label rules and merging the manual dependency annotations in the original bunsetu-style annotations in Google UDT 2.0.
2017-03-01 v2.0
- Converted to UD v2 guidelines.
2016-11-15 v1.4
- Initial release in Universal Dependencies.

===================================Universal Dependency Treebanks v2.0(legacy information)============================================================Licenses and terms-of-use=========================For the following languages  German, Spanish, French, Indonesian, Italian, Japanese, Korean and Brazilian  Portuguesewe will distinguish between two portions of the data.1. The underlying text for sentences that were annotated. This data Google   asserts no ownership over and no copyright over. Some or all of these   sentences may be copyrighted in some jurisdictions.  Where copyrighted,   Google collected these sentences under exceptions to copyright or implied   license rights.  GOOGLE MAKES THEM AVAILABLE TO YOU 'AS IS', WITHOUT ANY   WARRANTY OF ANY KIND, WHETHER EXPRESS OR IMPLIED.2. The annotations -- part-of-speech tags and dependency annotations. These are   made available under a CC BY-SA 4.0. GOOGLE MAKES   THEM AVAILABLE TO YOU 'AS IS', WITHOUT ANY WARRANTY OF ANY KIND, WHETHER   EXPRESS OR IMPLIED. See attached LICENSE file for the text of CC BY-NC-SA.Portions of the German data were sampled from the CoNLL 2006 Tiger Treebankdata. Hans Uszkoreit graciously gave permission to use the underlyingsentences in this data as part of this release.Any use of the data should reference the above plus:  Universal Dependency Annotation for Multilingual Parsing  Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg,  Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang,  Oscar Tackstrom, Claudia Bedini, Nuria Bertomeu Castello and Jungmee Lee  Proceedings of ACL 2013=======Contact=======ryanmcd@google.comjoakim.nivre@lingfil.uu.seslav@google.comSee https://github.com/ryanmcd/uni-dep-tb for more details

=== Machine-readable metadata =================================================Data available since: UD v1.4License: CC BY-SA 4.0Includes text: yesGenre: news blogLemmas: converted from manualUPOS: converted from manualXPOS: manual nativeFeatures: not availableRelations: converted from manualContributors: Omura, Mai; Miyao, Yusuke; Kanayama, Hiroshi; Matsuda, Hiroshi; Wakasa, Aya; Yamashita, Kayo; Asahara, Masayuki; Tanaka, Takaaki; Murawaki, Yugo; Matsumoto, Yuji; Mori, Shinsuke; Uematsu, Sumire; McDonald, Ryan; Nivre, Joakim; Zeman, DanielContributing: hereContact:hkana@jp.ibm.com

(Original treebank contributors: LaMontagne, Adam; Souček, Milan; Järvinen, Timo; Radici, Alessandra)

About

Japanese data from the Google UDT 2.0.

Releases6

UD Japanese GSD r2.10 with Named Entity Gold Labels Latest

May 29, 2022

+ 5 releases

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Summary

Introduction

Specification

Overview

LEMMA field

XPOS field

MISC field

Acknowledgments

License

Reference

Changelog

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases6

Packages

Languages

Movatterモバイル変換

License

megagonlabs/UD_Japanese-GSD

Folders and files

Latest commit

History

Repository files navigation

Summary

Introduction

Specification

Overview

LEMMA field

XPOS field

MISC field

Acknowledgments

License

Reference

Changelog

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases6

Packages0

Languages

Packages