Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Japanese text8 corpus for word embedding.

NotificationsYou must be signed in to change notification settings

Hironsan/ja.text8

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ja.text8 is a small (100MB) text corpus from the web (japanese wikipedia).

You can download ja.text8 corpus from the following link:

Usage

You can train word2vec by ja.text8.After downloading ja.text8, run the following code.It takes about 2 minutes to finish training:

importloggingfromgensim.modelsimportword2veclogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',level=logging.INFO)sentences=word2vec.Text8Corpus('ja.text8')model=word2vec.Word2Vec(sentences,size=200)

After the training, you can test the model as follows:

>>>model.most_similar(['日本'])[('中国',0.598496675491333), ('韓国',0.5914819240570068), ('アメリカ',0.5286925435066223), ('英国',0.5090063810348511), ('台湾',0.4761126637458801), ('米国',0.45954638719558716), ('アメリカ合衆国',0.45181626081466675), ('イギリス',0.44740626215934753), ('ソ連',0.43657147884368896), ('海外',0.4325913190841675)]

Great!

Requirements

  • Python 3.x
  • MeCab
  • virtualenv

Make corpus by yourself

You can download ja.text8.But you can make the corpus by yourself.

Simply run:

$ ./setup.sh

License

CC-BY-SA

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp