mthebaud/predict4allPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star15

Accurate, fast, lightweight, multilingual, free and open-source next word prediction library

License

Apache-2.0 license

15 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github		.github
gradle/wrapper		gradle/wrapper
predict4all-core		predict4all-core
predict4all-example		predict4all-example
predict4all-model-trainer-cmd		predict4all-model-trainer-cmd
res		res
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Repository files navigation

PREDICT4ALL

Predict4All is anaccurate, fast, lightweight, multilingual, free and open-source next word prediction library.

It aims to be integrated in applications to display possible next words and help user input : virtual keyboards, text editors, AAC systems...

Key features

Next word prediction
Current word completion
Live accurate and customizable word correction while typing
Dynamic models : automatically learn new words and sentence to integrate user's language and style
Lightweight prediction and training : low dependency and fully integrated algorithms
Easy integration : load precomputed models and start predicting !
Low memory foot print : dynamically loaded language models allow memory saves - it only uses 25 MB of heap space !

Predict4All originality stands in its correction model : it works thanks to a set of correction rules (general or specific : accents, grammar, missing space, etc).This correction model allows the correction to happen earlier in prediction compared to string distance techniques.This also allows the correction to be similar to existing corrector (e.g. GBoard) but to be enhanced with custom rule based on user errors (dysorthography, dyslexia, etc)

Predict4All was co-designed with speech therapists and occupational therapists fromCMRRF Kerpape andHopital Raymond Poincaréto ensure that it match needs and requirements for user with speech and text writing troubles.A particular attention was given to determine common and particular mistakes made by people with dysorthography and dyslexia.

Currently, Predict4All supports french language (provided rules and pre-trained language model).

Project

This library was developed in the collaborative project Predict4All involvingCMRRF Kerpape,Hopital Raymond Poincaré andBdTln Team, LIFAT, Université de Tours

Predict4All is supported byFondation Paul Bennetot, Fondation du Groupe Matmut under Fondation de l'Avenir, Paris, France (project AP-FPB 16-001)

This project has been integrated in the following AAC software :LifeCompanion, Sibylle, CiviKey

The project is still developed inAAC4ALL project.

Usage

Installation

Train (see "Training your own language model") or download a language model. Pre-computed french language model is available :

fr_ngrams.bin : contains pre trained word sequences
fr_words.bin : contains the associated vocabulary

The french language model have been trained on more than +20 millions words from Wikipedia and subtitles corpus. The vocabulary contains ~112 000 unique words.

Get the library through your favorite dependency manager :

Maven

<dependency>    <groupId>io.github.mthebaud</groupId>    <artifactId>predict4all</artifactId>    <version>1.2.0</version></dependency>

Gradle

implementation 'io.github.mthebaud:predict4all:1.2.0'

In the following examples, we assume that you initialized language model, predictor, etc.

finalFileFILE_NGRAMS =newFile("fr_ngrams.bin");finalFileFILE_WORDS =newFile("fr_words.bin");LanguageModellanguageModel =newFrenchLanguageModel();PredictionParameterpredictionParameter =newPredictionParameter(languageModel);WordDictionarydictionary =WordDictionary.loadDictionary(languageModel,FILE_WORDS);try (StaticNGramTrieDictionaryngramDictionary =StaticNGramTrieDictionary.open(FILE_NGRAMS)) {WordPredictorwordPredictor =newWordPredictor(predictionParameter,dictionary,ngramDictionary);// EXAMPLE CODE SHOULD RUN HERE}

You can find complete working code for these examples and more complex ones inpredict4all-example

Please read theJavadoc (public classes are well documented)

Predict next words

WordPredictionResultpredictionResult =wordPredictor.predict("j'aime manger des ");for (WordPredictionprediction :predictionResult.getPredictions()) {System.out.println(prediction);}

Result (french language model)

trucs = 0.16105959785338766 (insert = trucs, remove = 0, space = true)fruits = 0.16093509126844632 (insert = fruits, remove = 0, space = true)bonbons = 0.11072838908013616 (insert = bonbons, remove = 0, space = true)gâteaux = 0.1107102433239866 (insert = gâteaux, remove = 0, space = true)frites = 0.1107077522148962 (insert = frites, remove = 0, space = true)

Predict word endings

WordPredictionResultpredictionResult =wordPredictor.predict("je te r");for (WordPredictionprediction :predictionResult.getPredictions()) {System.out.println(prediction);}

Result (french language model)

rappelle = 0.25714609184509885 (insert = appelle, remove = 0, space = true)remercie = 0.12539880967030353 (insert = emercie, remove = 0, space = true)ramène = 0.09357117922321868 (insert = amène, remove = 0, space = true)retrouve = 0.07317575867400958 (insert = etrouve, remove = 0, space = true)rejoins = 0.06404375655722373 (insert = ejoins, remove = 0, space = true)

To tune the WordPredictor, you can explorePredictionParameter javadoc

Using correction rules

CorrectionRuleNoderoot =newCorrectionRuleNode(CorrectionRuleNodeType.NODE);root.addChild(FrenchDefaultCorrectionRuleGenerator.CorrectionRuleType.ACCENTS.generateNodeFor(predictionParameter));predictionParameter.setCorrectionRulesRoot(root);predictionParameter.setEnableWordCorrection(true);WordPredictorwordPredictor =newWordPredictor(predictionParameter,dictionary,ngramDictionary);WordPredictionResultpredictionResult =wordPredictor.predict("il eta");for (WordPredictionprediction :predictionResult.getPredictions()) {System.out.println(prediction);}

Result (french language model)

était = 0.9485814446960688 (insert = était, remove = 3, space = true)établit = 0.05138460933797299 (insert = établit, remove = 3, space = true)étale = 7.544080911878824E-6 (insert = étale, remove = 3, space = true)établissait = 4.03283914323952E-6 (insert = établissait, remove = 3, space = true)étaye = 4.025324786425216E-6 (insert = étaye, remove = 3, space = true)

In this example, remove become positive as the first letter in the word is incorrect : previous typed text should be removed before insert.

Using dynamic model

DynamicNGramDictionarydynamicNGramDictionary =newDynamicNGramDictionary(4);predictionParameter.setDynamicModelEnabled(true);WordPredictorwordPredictor =newWordPredictor(predictionParameter,dictionary,ngramDictionary,dynamicNGramDictionary);WordPredictionResultpredictionResult =wordPredictor.predict("je vais à la ");for (WordPredictionprediction :predictionResult.getPredictions()) {System.out.println(prediction);}wordPredictor.trainDynamicModel("je vais à la gare");predictionResult =wordPredictor.predict("je vais à la ");for (WordPredictionprediction :predictionResult.getPredictions()) {System.out.println(prediction);}

Result (french language model)

fête = 0.3670450710570904 (insert = fête, remove = 0, space = true)bibliothèque = 0.22412342109445696 (insert = bibliothèque, remove = 0, space = true)salle = 0.22398910838330122 (insert = salle, remove = 0, space = true)fin = 0.014600071765987328 (insert = fin, remove = 0, space = true)suite = 0.014315510457449597 (insert = suite, remove = 0, space = true)- After trainingfête = 0.35000112941797795 (insert = fête, remove = 0, space = true)bibliothèque = 0.2137161256141207 (insert = bibliothèque, remove = 0, space = true)salle = 0.213588049788271 (insert = salle, remove = 0, space = true)gare = 0.045754860284824 (insert = gare, remove = 0, space = true)fin = 0.013922109328323544 (insert = fin, remove = 0, space = true)

In this example, the word "gare" appears after training the model with "je vais à la gare".

Be careful, training a model with wrong sentence will corrupt your data.

Saving the dynamic model

When using a dynamic model, you should take care of saving/loading two different files : user ngrams and user word dictionary.

The original files won't be modified, to be shared across different users : the good implementation pattern.

Once your model is trained, you may want to save it :

dynamicNGramDictionary.saveDictionary(newFile("fr_user_ngrams.bin"));

and later load it again (and pass it to WordPredictor constructor)

DynamicNGramDictionarydynamicNGramDictionary =DynamicNGramDictionary.load(newFile("fr_user_ngrams.bin"));

You can also save the word dictionary if new words have been added :

dictionary.saveUserDictionary(newFile("fr_user_words.bin"));

and later load it again (on an existing WordDictionary instance)

dictionary.loadUserDictionary(newFile("fr_user_words.bin"));

Modify vocabulary

It is sometimes useful to modify the available vocabulary to better adapt predictions to user.

This can be done working with WordDictionary, for example, you can disable a word :

WordmaisonWord =dictionary.getWord("maison");maisonWord.setForceInvalid(true,true);

Or you can show to user every custom words added to the dictionary :

dictionary.getAllWords().stream()        .filter(w ->w.isValidToBePredicted(predictionParameter))// don't want to display to the user the word that would never appears in prediction        .filter(Word::isUserWord)// get only the user added words        .forEach(w ->System.out.println(w.getWord()));

When you modify words (original or added by users), don't forget to save the user dictionary : it will save user words but also original words modifications.

You can find further information looking atWord javadoc

Tech notes

When using Predict4All, you should take note that :

The library is not designed to be thread safe : you should synchronize your calls toWordPredictor
The library relies on disk reads : the ngram file is opened with aFileChannel : this means that your ngram data file will be opened by the process as long as you're using the library
Dynamic model files are dependent from the original data files : if the original data changes, you may get aWordDictionaryMatchingException when loading your previous user files

Training your own language model

To train your own language model, you will first need to prepare :

The runtime environment for Predict4All (JRE 1.8+) with enough RAM (the more you get, the more you will be able to create big models)
The training data : a directory containing .txt files encoded in UTF-8 (to improve computing performance, it's better to have multiple txt files than a single big txt file)
Lexique : a base dictionary for the French Language that you should extract somewhere on your system (download)
A training configuration file : you can useres/default/fr_default_training_configuration.json - make sure to changePATH_TO_LEXIQUE

A good CPU is also a key point : Predict4All strongly use multi threaded algorithms, so the more core you get, the faster the training will be

Then, you can run the executable jar (precompiled version available) with a command line :

java -Xmx16G -jar predict4all-model-trainer-cmd-1.1.0-all.jar -config fr_training_configuration.json -language fr -ngram-dictionary fr_ngrams.bin -word-dictionary fr_words.bin path/to/corpus

This command will launch a training, allowing the JVM to get 16GB memory, and giving an input and output configuration.

Generated data files will befr_ngrams.bin andfr_words.bin

Alternatively, you can checkLanguageDataModelTrainer inpredict4all-model-trainer-cmd to launch your training programmatically.

Getting help

Please let us know if you use Predict4All !

Feel free to fill anissue if you need assistance or if you find a bug.

Licence

This software is distributed under theApache License 2.0 (see file LICENCE)

References

This project was developed various NLP techniques (mainly ngram based)

Directly related articles

Sibylle AAC system : Tonio Wandmacher, Jean-Yves Antoine, Jean-Paul Departe, Franck Poirier. SIBYLLE, an assistive communication system adapting to the context and its user.ACM Transactions on Accessible Computing , ACM New York, NY, USA 2008, 1 (1), pp.1-30.hal-01021174
Correction needs (Dysorthography/Dyslexia users) : Antoine J.Y., Crochetet M., Arbizu C., Lopez E., Pouplin S., Besnier A., Thebaud M. (2019) Ma copie adore le vélo: analyse des besoins réels en correction orthographique sur un corpus de dictées d’enfants.TALN’2019.hal-02375246v1

Inspiring papers

Techniques for automatically correcting words in text, Karen Kukich,ACM Computing Surveys, December 1992https://doi.org/10.1145/146370.146380
On structuring probabilistic dependences in stochastic language modelling, Hermann Ney, Ute Essen, Reinhard Kneser,Computer Speech & Language, Volume 8, Issue 1, 1994, Pages 1-38, ISSN 0885-2308,https://doi.org/10.1006/csla.1994.1001
More to come...

Note for dev.

Predict4All is built using Github actions to publish new versions on Maven Central

To create a version

Build it locally and check that tests are still valid
Check for the target version inbuild.gradle
Tag the repo with this version and push to Github
Run the workflow :ci-predict4all-publish
Connect to thelegacy Sonatype website
Select the corresponding matching repository
Press onClose to release orDrop to cancel
IfClose is selected, you have to wait for the validation rules to run, and then selectRelease to deploy the version

Reference document on publishing

About

Accurate, fast, lightweight, multilingual, free and open-source next word prediction library

Releases

2tags

Packages

No packages published

Movatterモバイル変換

License

mthebaud/predict4all

Folders and files

Latest commit

History

Repository files navigation

PREDICT4ALL

Key features

Project

Usage

Installation

Predict next words

Predict word endings

Using correction rules

Using dynamic model

Saving the dynamic model

Modify vocabulary

Tech notes

Training your own language model

Getting help

Licence

References

Directly related articles

Inspiring papers

Note for dev.

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages