Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

script to evaluate pre-trained Japanese word2vec model on Japanese similarity dataset

License

NotificationsYou must be signed in to change notification settings

shihono/evaluate_japanese_w2v

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

日本語類似度評価データセットをword2vecモデルに適用するためのスクリプト

mecab-python3SudachiPy による分かち書きに対応

mecab-python3 andSudachiPy for tokenizing Japanese

Requirements

  • chardet
  • numpy
  • scipy
  • gensim
  • mecab-python3
  • sudachipy
  • sudachidict-core

Usage

$ python eval.py model data [option]
  • model:gensimで読み込み可能なモデルファイル

  • data: 単語1, 単語2, (類似度などの)数値の3つの列を持つcsvファイルもしくはcsvファイルを含むディレクトリ

    • --col で3つの列を指定可能 (デフォルトは[0,1,2])
  • model: The word2vec model file that can be load bygensim.

  • data: csv file or directory path. The files contain 3 columns of word1, word2, similarity score

    • 3 columns can be specified by--col (default[0,1,2])
optional arguments:  -h, --help            show this help message and exit  --col COL COL COL     indexes of word1, word2, similarity  --verbose, -v         verbose  --mecab, -m           use mecab  --mecab_dict MECAB_DICT, -d MECAB_DICT                        mecab dictionary path  --sudachi, -s         use sudachi  --sudachi_mode SUDACHI_MODE                        select sudachi tokenizer mode: A or B or C  --output OUTPUT, -o OUTPUT                        output csv path or directory path

Example

Example for Mecab

$ python eval.py /path/to/latest-ja-word2vec-gensim-model/word2vec.gensim.model \    /path/to/JWSAN/jwsan-1400.txt \    -v --col 1 2 4 -m --mecab_dict /usr/local/lib/mecab/dic/mecab-ipadic-neologd

Output:

[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] set logger[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] Word vector 50 dim, Vocab size 335476[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] Use mecab : dict setting is /usr/local/lib/mecab/dic/mecab-ipadic-neologd[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] load filepath : /path/to/JWSAN/jwsan-1400.csv, 1400 data[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] Evaluate 1359 data[XXXX-XX-XX XX:XX:XX,XXX] [__main__] [INFO] spearmanr SpearmanrResult(correlation=0.4155930561711437, pvalue=6.97399627506598e-58)Data    1400OOV     41Corr    0.416

More results on学習済み日本語word2vecとその評価について (write in Japanese)

About

script to evaluate pre-trained Japanese word2vec model on Japanese similarity dataset

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors2

  •  
  •  

Languages


[8]ページ先頭

©2009-2025 Movatter.jp