- Notifications
You must be signed in to change notification settings - Fork2
ULMFiT language model for Czech language
License
simecek/Czech-ULMFiT
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
ULMFiT paper appeard in January 2018 and pioneeredtransfer learning for NLP data. ULMFiT runs in three steps:first, train a language model,then fine-tune it to a specific task andfinally use the fine-tuned model for the final prediction. The method is described in the following paper and implemented infastai package.
Slavic and other morphologically rich languages need a special preprocessing (sentencepiece instead of spaCy) as explained in the following paper for Polish.
I have trained ULMFiT on Czech Wikipedia as a hobby project. To my knowledge, this isthe first ULMFiT model for Czech language.
Notebook(s):nn-czech.ipynb
Weights:cs_wt.pth,cs_wt_vocab.pkl,spm.model,spm.vocab
With P4 Tesla GPU and Google Cloud virtual machine specifiedhere, the training took ~28 hours. I was closely following the recentULMFiT lecture from fast.ai NLP course.
The experiments are still a work in progress (help needed! do you know any good Czech sentiment benchmark?). I have found a couple of datasets in the following paper:
Data:http://liks.fav.zcu.cz/sentiment/ (Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License)
As a proof of concept, I have performed sentiment classification of ~60K Czech movie reviews:
- CSFD movie dataset: 91,381 movie reviews (30,897 positive, 30,768 neutral, and 29,716 negative reviews) from theCzech Movie Database. In this first experiment, I omitted neutral reviews and made a classifier of positive vs. negative reviews only (90% used for training, 10% for validation). The achieved accuracy was94.5%.
Notebook:nn-czech.ipynb (same as for language model training)
Colab:CSFD_retrained_colab.ipynb This demonstrates how to fine-tune the language model for classification (here - the sentiment of movie reviews). I have saved the final sentiment classifier with 94.5% accuracy can be downloaded ascs_csfd_2classes_945.pkl. The training was done onColab Pro, Tesla P100-PCIE-16GB GPU.
Demo:CSFD_demo.ipynb For users just interested in sentiment analysis, this is a no-fuss demo how to setup the environment, load the model and get a sentiment prediction for a given text.
Web app: I reshaped the demo script into a simple web app, the code is living indetektor_slunickovosti repo (in Czech).
This repo is a little dwarf standing on the shoulder of giants. Let me thank at least a few of them:
Jeremy Howard, Rachel Thomas and the whole fast.ai team for ULMFiT developement and making an addition of new languages super simple with the last fastai version. Also, Piotr Czapla for subword tokenization idea and the Polish ULMFiT model.
Karla Fejfarova for introducing me to ULMFiT a year ago. Katerina Veselovska for a motivation after her recent NLP talk at ML meetup in Brno.
Google for free Google Cloud credits.