Masao-Taketani/japanese_text_classificationPublic

NotificationsYou must be signed in to change notification settings
Fork3
Star9

To investigate various DNN text classifiers including MLP, CNN, RNN, BERT approaches.

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
mecab_neologd		mecab_neologd
sentencepiece		sentencepiece
wiki_data/wikiextractor		wiki_data/wikiextractor
README.md		README.md
plot_graph.py		plot_graph.py

Repository files navigation

Text Classification with Various DNN Methods

While trying various DNN text classifier methods for aJapanese corpus,livedoor corpus, I aim to gain some knowledge and experice of DNN for NLP.

Tokenizer

Two major tokenizers below are experimented.
[1]MeCab +mecab-ipadicNEologd
Since the program is implemented inPython,mecab-python3 is also required to execute the program.
[2]SentencePiece
SentencePiece is trained on theJapanese Wikipedia Data dumps.
To train this, I referred to the webpage titled"Wikipediaから日本語コーパスを利用してSentencePieceでトークナイズ(分かち書き)".

Word Embedding

As for case [1] above,pretrained fastText embedding(Japanese) is used, which can be found here [URL]https://drive.google.com/open?id=0ByFQ96A4DgSPUm9wVWRLdm5qbmc.
As for case [2] above, a word embedding matrix is trained while training each end-to-end DNN model.

Japanese BERT

I usedbert-japanese implemented by"yoheikikuta".Instead of using the trained SentencePiece and the pratrained BERT model, I trained them from scratch, only a few changes are listed below.

I trained SentencePiece with 8,000 words instead of 32,000.
I used newer Japanese Wikipedia dataset than the one he used.
I pretrained BERT model up to 1,300,000 steps instead of 1,400,000.The pretrained result is shown as below.

***** Eval results *****global_step = 1300000loss = 1.3378378masked_lm_accuracy = 0.71464056masked_lm_loss = 1.2572908next_sentence_accuracy = 0.97375next_sentence_loss = 0.08065516

Results

[1]MeCab + ipadicNEologd + fastText

MLP(Multi-layer Perceptron)

                precision    recall  f1-score   support           dokujo-tsushin      0.721     0.886     0.795       175  it-life-hack      0.789     0.825     0.806       154 kaden-channel      0.856     0.856     0.856       167livedoor-homme      0.820     0.439     0.571       114   movie-enter      0.848     0.931     0.888       174        peachy      0.801     0.701     0.748       184          smax      0.930     0.930     0.930       186  sports-watch      0.980     0.890     0.932       163    topic-news      0.832     0.975     0.897       157     micro avg      0.839     0.839     0.839      1474     macro avg      0.842     0.826     0.825      1474  weighted avg      0.843     0.839     0.834      1474

                  precision    recall  f1-score   supportdokujo-tsushin      0.935     0.909     0.922       175  it-life-hack      0.906     1.000     0.951       154 kaden-channel      1.000     0.970     0.985       167livedoor-homme      0.968     0.798     0.875       114   movie-enter      0.939     0.977     0.958       174        peachy      0.900     0.929     0.914       184          smax      0.984     0.978     0.981       186  sports-watch      0.994     1.000     0.997       163    topic-news      0.981     0.987     0.984       157     micro avg      0.955     0.955     0.955      1474     macro avg      0.956     0.950     0.952      1474  weighted avg      0.956     0.955     0.954      1474

BiLSTM

                precision    recall  f1-score   supportdokujo-tsushin      0.850     0.874     0.862       175  it-life-hack      0.851     0.929     0.888       154 kaden-channel      0.957     0.928     0.942       167livedoor-homme      0.786     0.675     0.726       114   movie-enter      0.860     0.989     0.920       174        peachy      0.886     0.761     0.819       184          smax      0.957     0.957     0.957       186  sports-watch      0.969     0.957     0.963       163    topic-news      0.950     0.975     0.962       157     micro avg      0.900     0.900     0.900      1474     macro avg      0.896     0.894     0.893      1474  weighted avg      0.900     0.900     0.899      1474

[2]SentencePiece

                precision    recall  f1-score   supportdokujo-tsushin      0.862     0.926     0.893       175  it-life-hack      0.911     0.929     0.920       154 kaden-channel      0.931     0.970     0.950       167livedoor-homme      0.886     0.684     0.772       114   movie-enter      0.945     0.983     0.963       174        peachy      0.893     0.864     0.878       184          smax      0.974     0.995     0.984       186  sports-watch      0.974     0.914     0.943       163    topic-news      0.909     0.955     0.932       157     micro avg      0.922     0.922     0.922      1474     macro avg      0.921     0.913     0.915      1474  weighted avg      0.922     0.922     0.920      1474

                precision    recall  f1-score   supportdokujo-tsushin      0.965     0.937     0.951       175  it-life-hack      0.962     0.994     0.978       154 kaden-channel      1.000     0.982     0.991       167livedoor-homme      0.903     0.816     0.857       114   movie-enter      0.956     0.994     0.975       174        peachy      0.937     0.962     0.949       184          smax      0.989     1.000     0.995       186  sports-watch      0.975     0.975     0.975       163    topic-news      0.981     0.981     0.981       157     micro avg      0.965     0.965     0.965      1474     macro avg      0.963     0.960     0.961      1474  weighted avg      0.965     0.965     0.965      1474

BiLSTM

                precision    recall  f1-score   supportdokujo-tsushin      0.927     0.943     0.935       175  it-life-hack      0.936     0.955     0.945       154 kaden-channel      0.970     0.964     0.967       167livedoor-homme      0.930     0.702     0.800       114   movie-enter      0.919     0.983     0.950       174        peachy      0.891     0.935     0.912       184          smax      0.969     0.995     0.981       186  sports-watch      0.969     0.957     0.963       163    topic-news      0.955     0.949     0.952       157     micro avg      0.940     0.940     0.940      1474     macro avg      0.941     0.931     0.934      1474  weighted avg      0.941     0.940     0.939      1474

BERT

                precision    recall  f1-score   supportdokujo-tsushin      0.958     0.920     0.939       175  it-life-hack      0.933     0.987     0.959       154 kaden-channel      0.976     0.964     0.970       167livedoor-homme      0.922     0.825     0.870       114   movie-enter      0.944     0.977     0.960       174        peachy      0.922     0.967     0.944       184          smax      0.989     0.973     0.981       186  sports-watch      1.000     0.982     0.991       163    topic-news      0.969     0.987     0.978       157      accuracy                          0.958      1474     macro avg      0.957     0.954     0.955      1474  weighted avg      0.958     0.958     0.958      1474

Conclusion

The best model among the 7 models above isCNN with Sentence Piece.
Results may be changed if you do more complicated classification tasks.
For each DNN model tested on both MeCab and Sentence Piece, such as MLP, CNN or biLSTM, a model that usedSentence Piece outperformed the one that used fastText+MeCab+ipadicNEologd.

About

To investigate various DNN text classifiers including MLP, CNN, RNN, BERT approaches.

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Text Classification with Various DNN Methods

Tokenizer

Word Embedding

Japanese BERT

Results

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

Masao-Taketani/japanese_text_classification

Folders and files

Latest commit

History

Repository files navigation

Text Classification with Various DNN Methods

Tokenizer

Word Embedding

Japanese BERT

Results

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages