- Notifications
You must be signed in to change notification settings - Fork12
tmu-nlp/JapaneseWordSimilarityDataset
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
We built a Japanese word similarity dataset including rare words.
We target verb, adjective, noun and adverb.
We constructed our dataset following the Stanford Rare Word Similarity Dataset(RW) proposed byLuong et al.(2013).
We extracted pairs of Japanese verbs (including sahen verb) and adjectives(both i-adjective and na-adjective) from Kodaira et al.(2016)'sEvaluation dataset for Japanese lexicalsimplification.
We employed a crowdsourcing service (Lancers) torecruite 10 annotators to assign 11 levels of similarity for word pairs.
0 (most dissimilar) - 10 (most similar)
The sample of the dataset is as follows:
word1 | word2 | mean(remove_extreme_annotator) | sub1 | sub2 | ... | sub9 | sub10 | mean |
---|---|---|---|---|---|---|---|---|
排除する | 無視する | 4.6 | 5 | 3 | ... | 5 | 6 | 4.8 |
排除する | 除外する | 6.6 | 7 | 6 | ... | 5 | 7 | 6.8 |
mean(remove_extreme_annotator) : average of the similarity scores assigned by annotators(the annotator attached an extreme value are removed)
mean : average of the similarity scores assigned by annotators
sub* : the similarity score for each annotator
The src directory contains a helper script to calculate Spearman's rankcorrelation coefficient used in our LREC paper.
Specifically, we learned word vectors from Japanese Wikipedia tocalculatethe rank correlation coefficient between the similarity ofword pairs and mean of annotated scores.
Our work is licensed under Creative Commons Attribution-ShareAlike 3.0Unported (CC BY-SA 3.0).
If you use our dataset, please cite our LREC paper.
- Yuya Sakaizawa and Mamoru Komachi. Construction of a Japanese WordSimilarity Dataset. In 11th edition of the Language Resources and EvaluationConference (LREC 2018), pp.948-951. May 2018.
- Yuya Sakaizawa and Mamoru Komachi. Construction of a Japanese Word Similarity Dataset. In arXiv e-prints, 1703.05916 (5 pages). March 2017.
- Tomonori Kodaira, Tomoyuki Kajiwara, Mamoru Komachi. Controlled andBalanced Dataset for Japanese Lexical Simplification. ACL 2016 StudentResearch Workshop, pp.1-7. August 2016.
- Minh-thang Luong, Richard Socher, Christopher D. Manning. Better Word Representations with Recursive Neural Networks for Morphology. CoNLL 2013, pp.104-113. August 2013.
Tokyo Metropolitan University
Yuya Sakaizawa
e-mail: ksf.doingmorewithless-at-gmail.com
About
Japanese Word Similarity Dataset
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Contributors2
Uh oh!
There was an error while loading.Please reload this page.