eiennohito/jumanpp-t9Public

NotificationsYou must be signed in to change notification settings
Fork1
Star4

A tutorial implementation of T9 predictive input (without spaces) with Juman++

License

Apache-2.0 license

4 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
jumanpp @ 623ff44		jumanpp @ 623ff44
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Repository files navigation

What is this about

A tutorial project for usingJuman++ as a general text processing tool.

This tutorial implements something like aT9 with a small variation.What would happen if you input everything without spaces.

Prerequisites

Unix-like environment
C++14-compatible compiler
CMake 3.1 or later
Python 3 or later

How to build

git clone --recursive https://github.com/eiennohito/jumanpp-t9.gitcd jumanpp-t9mkdir buildcd buildcmake .. -DCMAKE_BUILD_TYPE=Releasemake -j

You should get a src/jumanpp_t9 binary.

Tutorial

Let's try to use Juman++ to solve T9 without spaces.The goal of this tutorial is to demonstrate how to make morphological analyzers using Juman++.

Introduction: What was T9

T9 was popular before the smartphones have came and everyone started to use QWERTY on their touch screens.However, dumbphones did not have a touchscreen.Instead they had only had 4x3 keyboard like this one (picture credit to Wikipedia).

T9 had a great idea: what if we will press the keys for each letter only once andselect matching words by using the context.For example, a to input "hello" you would type "43556".0 was used for spaces and 1 for punctuation.

But if you provide spaces it is relatively easy to guess the correct word.But how would we do if there were no spaces.For example, the previous sentence would be inputted as: 28846996853933643843739373667722371.In this case, we would need to segment digits into "words" and select words corresponding to those digits simultaneously.

This is what morphological analyzers for languages with continuous scripts (like Japanese or Chinese) have to do.They segment continuous text into tokens (morphemes) and tag them with additional information like lemmas (base forms)and parts of speech.

Juman++ is a modern morphological analyzer for continuous languages.And we will try to use it to solve T9 without spaces.

Juman++ Structure

To build an analyzer using Juman++ we would need to prepare some things, namely

Dictionary,
Analysis Spec,
Driver Program.

Dictionary

The analysis dictionary defines words which our analyzer would be able to accept.Words can define other fields, like parts of speech.Dictionary should be supplied in the CSV format.

Let's create a dictionary that would have at least

9-pad number representation
English original spelling

Let's useUniversal Dependencies project for creating the dictionary.It provides annotated text corpora in many languages; and we will use the annotation information later,for improving our model.

Let's download the UD corpora.

wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-2515/ud-treebanks-v2.1.tgztar xf ud-treebanks-v2.1.tgz

Inside there will be a lot of folders starting withUD_.We will useUD_English.It should have these files:

LICENSE.txtREADME.mden-ud-dev.conlluen-ud-dev.txten-ud-test.conlluen-ud-test.txten-ud-train.conlluen-ud-train.txtstats.xml

Let's first make a training corpus usingconversion script.From the jumanpp-t9 root directory execute:

python3 scripts/conll2minicorpus.py <path to UD dir>/UD_English/en-ud-train.conllu > build/ud_en.train.corpus

The file will contain lines like

3766,from843,the27,ap26637,comes8447,this78679,story1,:EOS

Each line will correspond to a single word in the sentence; EOS marks the end of sentence.Let's repeat the conversion for dev and test parts as well:

python3 scripts/conll2minicorpus.py<path to UD dir>/UD_English/en-ud-dev.conllu> build/ud_en.dev.corpuspython3 scripts/conll2minicorpus.py<path to UD dir>/UD_English/en-ud-test.conllu> build/ud_en.test.corpus

And now let's compile the dictionary itself usinganother python script.

python3 scripts/minicorpus2minidic.py build/ud_en.*.corpus> build/ud_en.dic

It will contains like:

227623,abroad22787859,abruptly2273623,absence227368,absent227368464,absenting22765883,absolute

Analysis Spec

Analysis Spec defines the dictionary format,features used for scoringand the structure of the loss function.Read thefull documentation.

Aminimal dictionary spec would be something like

field 1 surface string trie_indexfield 2 english stringngram [surface, english]train loss surface 1, english 1

First two lines define a dictionary definition.This spec uses only two fields, butthe actual CSV can contain more fields, but only first two will be used.A column #1 would be called "surface", have string type and a trie-based indexforsurface lookup would be built over this field.A column #2 would be called "english" and have string type.

A single unigram feature, which uses both contents of surface and english fieldswould be used for path scoring.Finally, the loss would use both fields with the equal weight of 1.

The creation of a trained model is done in two steps.

We build a binary dictionary from a spec and dictionary csv.
We train model weights using a binary dictionary and corpus, producing an analysis model.

Let's do this!

cd build./jumanpp/src/core/tool/jumanpp_tool index \    --spec ../src/jumanpp_t9_nano.spec \    --dict-file ud_en.dic \    --output nano.seed../scripts/train.sh ./jumanpp/src/core/tool/jumanpp_tool nano.seed ud_en.dev.corpus dev.nano.model

You can also try to train a model withud_en.train.corpus,but the training will use around 6GB of RAM.

Driver Program

Finally, we need a program which will do the actual analysis.While the training and dictionary preparation could be donewith generalized tools,Juman++ does not provide a generalized analysis toolbecause of two reasons: output formats and statically generated feature processing code.

I have already implemented avery simple driver programfor our case.

Let's test our trained model:

# There is a tester.py script which converts English text to digits# and forward them into the driver program itself.python3 ../scripts/tester.py ./src/jumanpp_t9 dev.nano.model

Ok, let's test it!

this model works844766335967578447this66335model96757worksand sometimes it does not263766384637483637668263and7663roof84vi63748merit3637does668not

So, our model already works for some inputs, but not for everything.Let's improve on it.

Improving the spec

The current model uses only a single unigram feature template,which is obviously very naive.Let's add some other templates so the ngram feature section will look like:

ngram [surface]ngram [english]ngram [surface, english]ngram [surface][surface]ngram [english][english]ngram [surface, english][surface, english]ngram [english][english][english]

Now we have three unigram, three bigram and one trigram feature templates.

Let's retrain our model with the new spec:

./jumanpp/src/core/tool/jumanpp_tool index \    --spec ../src/jumanpp_t9_mini.spec \    --dict-file ud_en.dic \    --output mini.seed../scripts/train.sh ./jumanpp/src/core/tool/jumanpp_tool mini.seed ud_en.dev.corpus dev.mini.model

Notice that the loss became much smaller than with the "nano" model."And sometimes it does not" should be restored correctly this time.

Advanced Features

Using Code Generation for Linear Model Inference

By default, Juman++ uses virtual function-based dynamic dispatchfor evaluating feature-based model score.Indirection, caused by virtual function calls has a certainnon-negligible performance penalty, especially for complex models.

For the analysis we usually want to have all the speed we can get.For that Juman++ can generate static C++ code based on Analysis Spec.

Ok, let's try (from the CMakebuild folder):

./jumanpp/src/core/tool/jumanpp_tool static-features \    --spec ../src/jumanpp_t9_mini.spec \    --class-name JppT9Mini \    --output codegen/t9_mini.cg

There should be two files:t9_mini.cg.handt9_mini.cg.cc in thecodegen subfolder.

Now you need to inject the generated code into the driver program.jumanpp_t9.cc should contain this line in themain function.

dieOnError(env.initFeatures(nullptr));

Thenullptr parameter is a pointer to aStaticFeatureFactorywhich has a responsibility to create feature processing functionality for the inference.Passing there a pointer to an actual instance from the generated code will enablethe faster analysis.

A complete example is available atjumanpp_t9_static.ccfor the implementation andCMakeLists.txt for how to integrateeverything into the build system.

First you need to includeJumanppStaticFeatures.cmakefile from the Juman++ repository.That gives you ajumanpp_gen_static function which does the dirty work of invokingjumanpp_tool.Thejumanpp_gen_static has 4 arguments:

Analysis spec file
Static feature factory class name
A variable name which will be filled with the directory name where the C++ code will be generated.You will need to add that toinclude_directories of your driver binary.
A variable name which will be filled with path to generated source files.You will need to add that to the driver target source files.

About

A tutorial implementation of T9 predictive input (without spaces) with Juman++

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

What is this about

Prerequisites

How to build

Tutorial

Introduction: What was T9

Juman++ Structure

Dictionary

Analysis Spec

Driver Program

Improving the spec

Advanced Features

Using Code Generation for Linear Model Inference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

eiennohito/jumanpp-t9

Folders and files

Latest commit

History

Repository files navigation

What is this about

Prerequisites

How to build

Tutorial

Introduction: What was T9

Juman++ Structure

Dictionary

Analysis Spec

Driver Program

Improving the spec

Advanced Features

Using Code Generation for Linear Model Inference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages