- Notifications
You must be signed in to change notification settings - Fork3
Masao-Taketani/japanese_text_classification
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
While trying various DNN text classifier methods for aJapanese corpus,livedoor corpus, I aim to gain some knowledge and experice of DNN for NLP.
Two major tokenizers below are experimented.
[1]MeCab +mecab-ipadicNEologd
Since the program is implemented inPython,mecab-python3 is also required to execute the program.
[2]SentencePiece
SentencePiece is trained on theJapanese Wikipedia Data dumps.
To train this, I referred to the webpage titled"Wikipediaから日本語コーパスを利用してSentencePieceでトークナイズ(分かち書き)".
- As for case [1] above,pretrained fastText embedding(Japanese) is used, which can be found here [URL]https://drive.google.com/open?id=0ByFQ96A4DgSPUm9wVWRLdm5qbmc.
- As for case [2] above, a word embedding matrix is trained while training each end-to-end DNN model.
I usedbert-japanese implemented by"yoheikikuta".Instead of using the trained SentencePiece and the pratrained BERT model, I trained them from scratch, only a few changes are listed below.
- I trained SentencePiece with 8,000 words instead of 32,000.
- I used newer Japanese Wikipedia dataset than the one he used.
- I pretrained BERT model up to 1,300,000 steps instead of 1,400,000.The pretrained result is shown as below.
***** Eval results *****global_step = 1300000loss = 1.3378378masked_lm_accuracy = 0.71464056masked_lm_loss = 1.2572908next_sentence_accuracy = 0.97375next_sentence_loss = 0.08065516
[1]MeCab + ipadicNEologd + fastText
- MLP(Multi-layer Perceptron)
precision recall f1-score support dokujo-tsushin 0.721 0.886 0.795 175 it-life-hack 0.789 0.825 0.806 154 kaden-channel 0.856 0.856 0.856 167livedoor-homme 0.820 0.439 0.571 114 movie-enter 0.848 0.931 0.888 174 peachy 0.801 0.701 0.748 184 smax 0.930 0.930 0.930 186 sports-watch 0.980 0.890 0.932 163 topic-news 0.832 0.975 0.897 157 micro avg 0.839 0.839 0.839 1474 macro avg 0.842 0.826 0.825 1474 weighted avg 0.843 0.839 0.834 1474
- CNN
precision recall f1-score supportdokujo-tsushin 0.935 0.909 0.922 175 it-life-hack 0.906 1.000 0.951 154 kaden-channel 1.000 0.970 0.985 167livedoor-homme 0.968 0.798 0.875 114 movie-enter 0.939 0.977 0.958 174 peachy 0.900 0.929 0.914 184 smax 0.984 0.978 0.981 186 sports-watch 0.994 1.000 0.997 163 topic-news 0.981 0.987 0.984 157 micro avg 0.955 0.955 0.955 1474 macro avg 0.956 0.950 0.952 1474 weighted avg 0.956 0.955 0.954 1474
- BiLSTM
precision recall f1-score supportdokujo-tsushin 0.850 0.874 0.862 175 it-life-hack 0.851 0.929 0.888 154 kaden-channel 0.957 0.928 0.942 167livedoor-homme 0.786 0.675 0.726 114 movie-enter 0.860 0.989 0.920 174 peachy 0.886 0.761 0.819 184 smax 0.957 0.957 0.957 186 sports-watch 0.969 0.957 0.963 163 topic-news 0.950 0.975 0.962 157 micro avg 0.900 0.900 0.900 1474 macro avg 0.896 0.894 0.893 1474 weighted avg 0.900 0.900 0.899 1474
[2]SentencePiece
- MLP
precision recall f1-score supportdokujo-tsushin 0.862 0.926 0.893 175 it-life-hack 0.911 0.929 0.920 154 kaden-channel 0.931 0.970 0.950 167livedoor-homme 0.886 0.684 0.772 114 movie-enter 0.945 0.983 0.963 174 peachy 0.893 0.864 0.878 184 smax 0.974 0.995 0.984 186 sports-watch 0.974 0.914 0.943 163 topic-news 0.909 0.955 0.932 157 micro avg 0.922 0.922 0.922 1474 macro avg 0.921 0.913 0.915 1474 weighted avg 0.922 0.922 0.920 1474
- CNN
precision recall f1-score supportdokujo-tsushin 0.965 0.937 0.951 175 it-life-hack 0.962 0.994 0.978 154 kaden-channel 1.000 0.982 0.991 167livedoor-homme 0.903 0.816 0.857 114 movie-enter 0.956 0.994 0.975 174 peachy 0.937 0.962 0.949 184 smax 0.989 1.000 0.995 186 sports-watch 0.975 0.975 0.975 163 topic-news 0.981 0.981 0.981 157 micro avg 0.965 0.965 0.965 1474 macro avg 0.963 0.960 0.961 1474 weighted avg 0.965 0.965 0.965 1474
- BiLSTM
precision recall f1-score supportdokujo-tsushin 0.927 0.943 0.935 175 it-life-hack 0.936 0.955 0.945 154 kaden-channel 0.970 0.964 0.967 167livedoor-homme 0.930 0.702 0.800 114 movie-enter 0.919 0.983 0.950 174 peachy 0.891 0.935 0.912 184 smax 0.969 0.995 0.981 186 sports-watch 0.969 0.957 0.963 163 topic-news 0.955 0.949 0.952 157 micro avg 0.940 0.940 0.940 1474 macro avg 0.941 0.931 0.934 1474 weighted avg 0.941 0.940 0.939 1474
- BERT
precision recall f1-score supportdokujo-tsushin 0.958 0.920 0.939 175 it-life-hack 0.933 0.987 0.959 154 kaden-channel 0.976 0.964 0.970 167livedoor-homme 0.922 0.825 0.870 114 movie-enter 0.944 0.977 0.960 174 peachy 0.922 0.967 0.944 184 smax 0.989 0.973 0.981 186 sports-watch 1.000 0.982 0.991 163 topic-news 0.969 0.987 0.978 157 accuracy 0.958 1474 macro avg 0.957 0.954 0.955 1474 weighted avg 0.958 0.958 0.958 1474
The best model among the 7 models above isCNN with Sentence Piece.
Results may be changed if you do more complicated classification tasks.
For each DNN model tested on both MeCab and Sentence Piece, such as MLP, CNN or biLSTM, a model that usedSentence Piece outperformed the one that used fastText+MeCab+ipadicNEologd.
About
To investigate various DNN text classifiers including MLP, CNN, RNN, BERT approaches.
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.