- Notifications
You must be signed in to change notification settings - Fork1
script to evaluate pre-trained Japanese word2vec model on Japanese similarity dataset
License
shihono/evaluate_japanese_w2v
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
日本語類似度評価データセットをword2vecモデルに適用するためのスクリプト
mecab-python3 とSudachiPy による分かち書きに対応
mecab-python3 andSudachiPy for tokenizing Japanese
- chardet
- numpy
- scipy
- gensim
- mecab-python3
- sudachipy
- sudachidict-core
$ python eval.py model data [option]
model
:gensimで読み込み可能なモデルファイルdata
: 単語1, 単語2, (類似度などの)数値の3つの列を持つcsvファイルもしくはcsvファイルを含むディレクトリ--col
で3つの列を指定可能 (デフォルトは[0,1,2]
)
model
: The word2vec model file that can be load bygensim.data
: csv file or directory path. The files contain 3 columns of word1, word2, similarity score- 3 columns can be specified by
--col
(default[0,1,2]
)
- 3 columns can be specified by
optional arguments: -h, --help show this help message and exit --col COL COL COL indexes of word1, word2, similarity --verbose, -v verbose --mecab, -m use mecab --mecab_dict MECAB_DICT, -d MECAB_DICT mecab dictionary path --sudachi, -s use sudachi --sudachi_mode SUDACHI_MODE select sudachi tokenizer mode: A or B or C --output OUTPUT, -o OUTPUT output csv path or directory path
Example for Mecab
- model:Japanese Word2Vec Model Builder
- data:JWSAN (similarity)
- tokenizer(optional): Mecab withmecab-ipadic-neologd
$ python eval.py /path/to/latest-ja-word2vec-gensim-model/word2vec.gensim.model \ /path/to/JWSAN/jwsan-1400.txt \ -v --col 1 2 4 -m --mecab_dict /usr/local/lib/mecab/dic/mecab-ipadic-neologd
Output:
[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] set logger[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] Word vector 50 dim, Vocab size 335476[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] Use mecab : dict setting is /usr/local/lib/mecab/dic/mecab-ipadic-neologd[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] load filepath : /path/to/JWSAN/jwsan-1400.csv, 1400 data[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] Evaluate 1359 data[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] spearmanr SpearmanrResult(correlation=0.4155930561711437, pvalue=6.97399627506598e-58)Data 1400OOV 41Corr 0.416
More results on学習済み日本語word2vecとその評価について (write in Japanese)
About
script to evaluate pre-trained Japanese word2vec model on Japanese similarity dataset
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.