- Notifications
You must be signed in to change notification settings - Fork1
A tutorial implementation of T9 predictive input (without spaces) with Juman++
License
eiennohito/jumanpp-t9
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A tutorial project for usingJuman++ as a general text processing tool.
This tutorial implements something like aT9 with a small variation.What would happen if you input everything without spaces.
- Unix-like environment
- C++14-compatible compiler
- CMake 3.1 or later
- Python 3 or later
git clone --recursive https://github.com/eiennohito/jumanpp-t9.gitcd jumanpp-t9mkdir buildcd buildcmake .. -DCMAKE_BUILD_TYPE=Releasemake -j
You should get a src/jumanpp_t9 binary.
Let's try to use Juman++ to solve T9 without spaces.The goal of this tutorial is to demonstrate how to make morphological analyzers using Juman++.
T9 was popular before the smartphones have came and everyone started to use QWERTY on their touch screens.However, dumbphones did not have a touchscreen.Instead they had only had 4x3 keyboard like this one (picture credit to Wikipedia).
T9 had a great idea: what if we will press the keys for each letter only once andselect matching words by using the context.For example, a to input "hello" you would type "43556".0 was used for spaces and 1 for punctuation.
But if you provide spaces it is relatively easy to guess the correct word.But how would we do if there were no spaces.For example, the previous sentence would be inputted as: 28846996853933643843739373667722371.In this case, we would need to segment digits into "words" and select words corresponding to those digits simultaneously.
This is what morphological analyzers for languages with continuous scripts (like Japanese or Chinese) have to do.They segment continuous text into tokens (morphemes) and tag them with additional information like lemmas (base forms)and parts of speech.
Juman++ is a modern morphological analyzer for continuous languages.And we will try to use it to solve T9 without spaces.
To build an analyzer using Juman++ we would need to prepare some things, namely
- Dictionary,
- Analysis Spec,
- Driver Program.
The analysis dictionary defines words which our analyzer would be able to accept.Words can define other fields, like parts of speech.Dictionary should be supplied in the CSV format.
Let's create a dictionary that would have at least
- 9-pad number representation
- English original spelling
Let's useUniversal Dependencies project for creating the dictionary.It provides annotated text corpora in many languages; and we will use the annotation information later,for improving our model.
Let's download the UD corpora.
wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-2515/ud-treebanks-v2.1.tgztar xf ud-treebanks-v2.1.tgz
Inside there will be a lot of folders starting withUD_
.We will useUD_English
.It should have these files:
LICENSE.txtREADME.mden-ud-dev.conlluen-ud-dev.txten-ud-test.conlluen-ud-test.txten-ud-train.conlluen-ud-train.txtstats.xml
Let's first make a training corpus usingconversion script.From the jumanpp-t9 root directory execute:
python3 scripts/conll2minicorpus.py <path to UD dir>/UD_English/en-ud-train.conllu > build/ud_en.train.corpus
The file will contain lines like
3766,from843,the27,ap26637,comes8447,this78679,story1,:EOS
Each line will correspond to a single word in the sentence; EOS marks the end of sentence.Let's repeat the conversion for dev and test parts as well:
python3 scripts/conll2minicorpus.py<path to UD dir>/UD_English/en-ud-dev.conllu> build/ud_en.dev.corpuspython3 scripts/conll2minicorpus.py<path to UD dir>/UD_English/en-ud-test.conllu> build/ud_en.test.corpus
And now let's compile the dictionary itself usinganother python script.
python3 scripts/minicorpus2minidic.py build/ud_en.*.corpus> build/ud_en.dic
It will contains like:
227623,abroad22787859,abruptly2273623,absence227368,absent227368464,absenting22765883,absolute
Analysis Spec defines the dictionary format,features used for scoringand the structure of the loss function.Read thefull documentation.
Aminimal dictionary spec would be something like
field 1 surface string trie_indexfield 2 english stringngram [surface, english]train loss surface 1, english 1
First two lines define a dictionary definition.This spec uses only two fields, butthe actual CSV can contain more fields, but only first two will be used.A column #1 would be called "surface", have string type and a trie-based indexforsurface lookup would be built over this field.A column #2 would be called "english" and have string type.
A single unigram feature, which uses both contents of surface and english fieldswould be used for path scoring.Finally, the loss would use both fields with the equal weight of 1.
The creation of a trained model is done in two steps.
- We build a binary dictionary from a spec and dictionary csv.
- We train model weights using a binary dictionary and corpus, producing an analysis model.
Let's do this!
cd build./jumanpp/src/core/tool/jumanpp_tool index \ --spec ../src/jumanpp_t9_nano.spec \ --dict-file ud_en.dic \ --output nano.seed../scripts/train.sh ./jumanpp/src/core/tool/jumanpp_tool nano.seed ud_en.dev.corpus dev.nano.model
You can also try to train a model withud_en.train.corpus
,but the training will use around 6GB of RAM.
Finally, we need a program which will do the actual analysis.While the training and dictionary preparation could be donewith generalized tools,Juman++ does not provide a generalized analysis toolbecause of two reasons: output formats and statically generated feature processing code.
I have already implemented avery simple driver programfor our case.
Let's test our trained model:
# There is a tester.py script which converts English text to digits# and forward them into the driver program itself.python3 ../scripts/tester.py ./src/jumanpp_t9 dev.nano.model
Ok, let's test it!
this model works844766335967578447this66335model96757worksand sometimes it does not263766384637483637668263and7663roof84vi63748merit3637does668not
So, our model already works for some inputs, but not for everything.Let's improve on it.
The current model uses only a single unigram feature template,which is obviously very naive.Let's add some other templates so the ngram feature section will look like:
ngram [surface]ngram [english]ngram [surface, english]ngram [surface][surface]ngram [english][english]ngram [surface, english][surface, english]ngram [english][english][english]
Now we have three unigram, three bigram and one trigram feature templates.
Let's retrain our model with the new spec:
./jumanpp/src/core/tool/jumanpp_tool index \ --spec ../src/jumanpp_t9_mini.spec \ --dict-file ud_en.dic \ --output mini.seed../scripts/train.sh ./jumanpp/src/core/tool/jumanpp_tool mini.seed ud_en.dev.corpus dev.mini.model
Notice that the loss became much smaller than with the "nano" model."And sometimes it does not" should be restored correctly this time.
By default, Juman++ uses virtual function-based dynamic dispatchfor evaluating feature-based model score.Indirection, caused by virtual function calls has a certainnon-negligible performance penalty, especially for complex models.
For the analysis we usually want to have all the speed we can get.For that Juman++ can generate static C++ code based on Analysis Spec.
Ok, let's try (from the CMakebuild
folder):
./jumanpp/src/core/tool/jumanpp_tool static-features \ --spec ../src/jumanpp_t9_mini.spec \ --class-name JppT9Mini \ --output codegen/t9_mini.cg
There should be two files:t9_mini.cg.h
andt9_mini.cg.cc
in thecodegen
subfolder.
Now you need to inject the generated code into the driver program.jumanpp_t9.cc
should contain this line in themain
function.
dieOnError(env.initFeatures(nullptr));
Thenullptr
parameter is a pointer to aStaticFeatureFactory
which has a responsibility to create feature processing functionality for the inference.Passing there a pointer to an actual instance from the generated code will enablethe faster analysis.
A complete example is available atjumanpp_t9_static.cc
for the implementation andCMakeLists.txt for how to integrateeverything into the build system.
First you need to includeJumanppStaticFeatures.cmake
file from the Juman++ repository.That gives you ajumanpp_gen_static
function which does the dirty work of invokingjumanpp_tool
.Thejumanpp_gen_static
has 4 arguments:
- Analysis spec file
- Static feature factory class name
- A variable name which will be filled with the directory name where the C++ code will be generated.You will need to add that to
include_directories
of your driver binary. - A variable name which will be filled with path to generated source files.You will need to add that to the driver target source files.