Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Yet Another Japanese-Wikipedia Entity Vectors

NotificationsYou must be signed in to change notification settings

wikiwikification/jawikivec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 

Repository files navigation

Overview

Distribution vector representations of words and Wikipedia entities appeared in Japanese Wikipedia. The data is created based on the specification of "Japanese-Wikipedia Entity Vectors (in Japanese)" released by Masatoshi Suzuki of Inui-Suzuki Laboratory in Tohoku University.

Basic Usage

Here are examples withGensim library.Segmented words are registered as word2vec words.

>>> from gensim.models import KeyedVectors>>> w2v_model = KeyedVectors.load_word2vec_format('entity_vector.model.bin', binary=True, unicode_errors='ignore')>>> print(w2v_model.most_similar(['こと']))[('事', 0.9618349075317383), ('もの', 0.7732754349708557), ('ため', 0.7425500154495239), ... ]

Entities that link to Wikipedia articles are registered as word2vec words with square brackets on either side in the format[entity].

>>> from gensim.models import KeyedVectors>>> w2v_model = KeyedVectors.load_word2vec_format('entity_vector.model.bin', binary=True, unicode_errors='ignore')>>> print(w2v_model.most_similar(['[72時間ホンネテレビ]']))[('[AbemaPrime]', 0.6929168701171875), ('[原宿AbemaNews]', 0.6854027509689331), ...]

Download

Due to the large file size, files are uploaded to Dropbox.

https://www.dropbox.com/sh/601gucye55nr1gq/AABekRrz4IYtp2n0_lTrKsGma

Dictionary:mecab-ipadic

FilejawikicorpusDictionarymd5
jawikivec.ipadic.20181120.tar.xzjawikicorpus.20181120mecab-ipadic-2.7.0-20070801bc370d107f9076f9abbfd70ab74b1972
jawikivec.ipadic.20181101.tar.xzjawikicorpus.20181101mecab-ipadic-2.7.0-20070801ce2b0a197555021e5c0aac96e428c08c
jawikivec.ipadic.20181020.tar.xzjawikicorpus.20181020mecab-ipadic-2.7.0-200708012524636714d1418cba5ff0cbf1947c50
jawikivec.ipadic.20181001.tar.xzjawikicorpus.20181001mecab-ipadic-2.7.0-20070801693a9d75b936c9a2cb25147575f51eea
jawikivec.ipadic.20180920.tar.xzjawikicorpus.20180920mecab-ipadic-2.7.0-20070801ae63c1cb0c64382773ddfc823c0fce10
jawikivec.ipadic.20180901.tar.xzjawikicorpus.20180901mecab-ipadic-2.7.0-200708010a55a6a33e8e79151f7347378f70e5b5
jawikivec.ipadic.20180820.tar.xzjawikicorpus.20180820mecab-ipadic-2.7.0-20070801cc524c551cccf8fae29b086add0252b5
jawikivec.ipadic.20180720.tar.xzjawikicorpus.20180720mecab-ipadic-2.7.0-20070801b3841ad1b46a024b403ed384609d4aad
jawikivec.ipadic.20180701.tar.xzjawikicorpus.20180701mecab-ipadic-2.7.0-2007080165ee15ad182adf96cfc722b55c17b9ea
jawikivec.ipadic.20180620.tar.xzjawikicorpus.20180620mecab-ipadic-2.7.0-20070801ac7afc5daaf15080b0beb3985281636b
jawikivec.ipadic.20180601.tar.xzjawikicorpus.20180601mecab-ipadic-2.7.0-20070801a72e03aec91be9c287678ea7f3e17527
jawikivec.ipadic.20180520.tar.xzjawikicorpus.20180520mecab-ipadic-2.7.0-20070801898b2562d6b851b84e4b467b92e5782a
FilejawikicorpusDictionarymd5
jawikivec.ipadic-neologd.20181120.tar.xzjawikicorpus.20181120mecab-ipadic-NEologd,b3f3ac6fbdb5130894243c40726a9c4878075649f8d04ec98699a88c215601b3c86017e4
jawikivec.ipadic-neologd.20181101.tar.xzjawikicorpus.20181101mecab-ipadic-NEologd,b3f3ac6fbdb5130894243c40726a9c4878075649315a1ca2d0fee5d302ceef8cebbd2fe5
jawikivec.ipadic-neologd.20181020.tar.xzjawikicorpus.20181020mecab-ipadic-NEologd,b3f3ac6fbdb5130894243c40726a9c48780756494b4713493a7ffd8e1104bc393e0b3344
jawikivec.ipadic-neologd.20181001.tar.xzjawikicorpus.20181001mecab-ipadic-NEologd,1e9da37787c202f157e59d4c9b19cd4636d8a60d03c0536e5e68310f8ce40559728a7c06
jawikivec.ipadic-neologd.20180920.tar.xzjawikicorpus.20180920mecab-ipadic-NEologd,3326dc5bb7467b51e7875f0f332cef6d890496172d8e0a4e38dc31f073eb97a32d14e684
jawikivec.ipadic-neologd.20180901.tar.xzjawikicorpus.20180901mecab-ipadic-NEologd,3326dc5bb7467b51e7875f0f332cef6d89049617084942f0153444c5e56ff76db81706dd
jawikivec.ipadic-neologd.20180820.tar.xzjawikicorpus.20180820mecab-ipadic-NEologd,5dc3499bc3fcd28eed960ed03cd51765c5330fe28d361239c9ec57df78b1f2d527029f44
jawikivec.ipadic-neologd.20180720.tar.xzjawikicorpus.20180720mecab-ipadic-NEologd,172cfaa0aad1375d53879d273426cefe4a322e981587854da8d6efb742117d9e2933ab02
jawikivec.ipadic-neologd.20180701.tar.xzjawikicorpus.20180701mecab-ipadic-NEologd,f4d27e2d50c5980a375d326fd8f0e95c881ed1caa6c996ab30adbf924270fcb3f292268e
jawikivec.ipadic-neologd.20180620.tar.xzjawikicorpus.20180620mecab-ipadic-NEologd,1c6e9eb600bba348fa772e218b8ce57d4ce70d851431b93833a8431689fa2b8eef5d45c4
jawikivec.ipadic-neologd.20180601.tar.xzjawikicorpus.20180601mecab-ipadic-NEologd,3f6f113bc2b7b9eecbce45103a628ba715af3b332c88c8685ad9a821ffdc0ea833475e9a
jawikivec.ipadic-neologd.20180520.tar.xzjawikicorpus.20180520mecab-ipadic-NEologd,b8b282537589becf7256e74c80c543aa2eba56749d67c83dfe2ceb79bb3ac446a42ede40

Files

By decompressing an archive with the following tar command, 5 files are created.

tar xvJf jawikivec.[dictionary].yyyyMMdd.tar.xz

entity_vector.model.bin

An output file saved in binary word2vec format.

entity_vector.model.txt

An output file saved in text word2vec format.

entities.tsv

A tsv file containing terms appeared in a plain text and corresponding Wikipedia entities. More details are described inJapanese-Wikipedia Wikification Corpus.

version.yml

AYAML-formatted file to store version information for referred dictionary and corpus.

LICENSE.md

Document regarding licensing.

Supplementary

Word2vec options

Distribution vector representations are created in the following settings.

OptionValue
-size200
-window5
-sample1e-3
-negative5
-hs0
-iter5
-min-count5
-cbow1

About

Yet Another Japanese-Wikipedia Entity Vectors

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp