- Notifications
You must be signed in to change notification settings - Fork0
Shuailong/EntityWord2Vec
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
The repo contains code to train word2vec embeddings with some special tokens kept. All tokens in the training corpus in the form "DBPEDIA_ID/*" will be trained regardless of their frequencies. We useGensim's impmentation with is pythonic and fast on CPU.
The corpus is 20.8G, building fromwiki2vec containing 11,521,424 entities marked with "DBPEDIA_ID/*".The entity lexicons with frequencies are store in corpus/en_entity_lexicons.txt (355MB), sorting by frequency.
- embedding dimension: 100
- initial learning rate: 0.025
- minimum learning rate: 1e-4
- window size: 5
- min_count: 5
- entity_min_count: 1
- sample: 0.001(higher frequency words are downsampled in this rate)
- model: skipgram
- negative sampling noise words: 5
- epochs: 5
- sorted words by frequency
- words batch size: 10000
The training process takes ~9.96 hours on s2 with 8 threading workers, 334461 effective words/s,~7.36 hours on ai with 32 threading workers, 490843 effective words/s.
The vocabulary included in the pretrained embedding files are 15,902,725 (15M), with a dimension of 100. Total file is ~20G.
To validate whether the ENTITIES are kept, you can rungrep "DBPEDIA_ID/*" word2vec.en_entity_text.100d.ai.txt > out.txt
A simple evaluation is the word relation test "A is to B as C is to D", the accuracy is 62.7% on s2 and 63.2% on ai. For comparison, Google's best word2vec is above 70%.
- Gensim Tutorials[1][2][3][4]
- Google Word2Vec
About
Train word2vec and keeps all entities regardless of their frequency
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.