Hironsan/ja.text8Public

NotificationsYou must be signed in to change notification settings
Fork8
Star111

Japanese text8 corpus for word embedding.

hironsan.hatenablog.com/entry/japanese-text8-corpus

111 stars 8 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
process.py		process.py
setup.sh		setup.sh
tokenize.py		tokenize.py

Repository files navigation

ja.text8

ja.text8 is a small (100MB) text corpus from the web (japanese wikipedia).

You can download ja.text8 corpus from the following link:

ja.text8.zip

Usage

You can train word2vec by ja.text8.After downloading ja.text8, run the following code.It takes about 2 minutes to finish training:

importloggingfromgensim.modelsimportword2veclogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',level=logging.INFO)sentences=word2vec.Text8Corpus('ja.text8')model=word2vec.Word2Vec(sentences,size=200)

After the training, you can test the model as follows:

>>>model.most_similar(['日本'])[('中国',0.598496675491333), ('韓国',0.5914819240570068), ('アメリカ',0.5286925435066223), ('英国',0.5090063810348511), ('台湾',0.4761126637458801), ('米国',0.45954638719558716), ('アメリカ合衆国',0.45181626081466675), ('イギリス',0.44740626215934753), ('ソ連',0.43657147884368896), ('海外',0.4325913190841675)]

Great!

Requirements

Python 3.x
MeCab
virtualenv

Make corpus by yourself

You can download ja.text8.But you can make the corpus by yourself.

Simply run:

$ ./setup.sh

License

CC-BY-SA

About

Japanese text8 corpus for word embedding.

hironsan.hatenablog.com/entry/japanese-text8-corpus

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ja.text8

Usage

Requirements

Make corpus by yourself

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

Hironsan/ja.text8

Folders and files

Latest commit

History

Repository files navigation

ja.text8

Usage

Requirements

Make corpus by yourself

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages