tmu-nlp/JapaneseWordSimilarityDatasetPublic

NotificationsYou must be signed in to change notification settings
Fork12
Star101

Japanese Word Similarity Dataset

101 stars 12 forks Branches Tags Activity

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
src		src
README.ja.md		README.ja.md
README.md		README.md
score_adj.csv		score_adj.csv
score_adv.csv		score_adv.csv
score_noun.csv		score_noun.csv
score_verb.csv		score_verb.csv

Repository files navigation

Japanese Word Similarity Dataset

Data

We built a Japanese word similarity dataset including rare words.

We target verb, adjective, noun and adverb.

Data construction

We constructed our dataset following the Stanford Rare Word Similarity Dataset(RW) proposed byLuong et al.(2013).

We extracted pairs of Japanese verbs (including sahen verb) and adjectives(both i-adjective and na-adjective) from Kodaira et al.(2016)'sEvaluation dataset for Japanese lexicalsimplification.

We employed a crowdsourcing service (Lancers) torecruite 10 annotators to assign 11 levels of similarity for word pairs.

0 (most dissimilar) - 10 (most similar)

Entry

The sample of the dataset is as follows:

word1	word2	mean(remove_extreme_annotator)	sub1	sub2	...	sub9	sub10	mean
排除する	無視する	4.6	5	3	...	5	6	4.8
排除する	除外する	6.6	7	6	...	5	7	6.8

mean(remove_extreme_annotator) : average of the similarity scores assigned by annotators(the annotator attached an extreme value are removed)

mean : average of the similarity scores assigned by annotators

sub* : the similarity score for each annotator

Helper script in src

The src directory contains a helper script to calculate Spearman's rankcorrelation coefficient used in our LREC paper.

Specifically, we learned word vectors from Japanese Wikipedia tocalculatethe rank correlation coefficient between the similarity ofword pairs and mean of annotated scores.

License

Our work is licensed under Creative Commons Attribution-ShareAlike 3.0Unported (CC BY-SA 3.0).

Citation

If you use our dataset, please cite our LREC paper.

Yuya Sakaizawa and Mamoru Komachi. Construction of a Japanese WordSimilarity Dataset. In 11th edition of the Language Resources and EvaluationConference (LREC 2018), pp.948-951. May 2018.
Yuya Sakaizawa and Mamoru Komachi. Construction of a Japanese Word Similarity Dataset. In arXiv e-prints, 1703.05916 (5 pages). March 2017.

References

Tomonori Kodaira, Tomoyuki Kajiwara, Mamoru Komachi. Controlled andBalanced Dataset for Japanese Lexical Simplification. ACL 2016 StudentResearch Workshop, pp.1-7. August 2016.
Minh-thang Luong, Richard Socher, Christopher D. Manning. Better Word Representations with Recursive Neural Networks for Morphology. CoNLL 2013, pp.104-113. August 2013.

Tokyo Metropolitan University

Yuya Sakaizawa

e-mail: ksf.doingmorewithless-at-gmail.com

About

Japanese Word Similarity Dataset

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Japanese Word Similarity Dataset

Data

Data construction

Entry

Helper script in src

License

Citation

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors2

Uh oh!

Languages

Movatterモバイル変換

tmu-nlp/JapaneseWordSimilarityDataset

Folders and files

Latest commit

History

Repository files navigation

Japanese Word Similarity Dataset

Data

Data construction

Entry

Helper script in src

License

Citation

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors2

Uh oh!

Languages

Packages