anyks/ascPublic

NotificationsYou must be signed in to change notification settings
Fork3
Star19

ANYKS Spell-Checker

License

MIT license

19 stars 3 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github		.github
app		app
cmake		cmake
contrib		contrib
icons		icons
include		include
scripts		scripts
site		site
src		src
submodules		submodules
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
ChangeLog.md		ChangeLog.md
LICENSE.MIT		LICENSE.MIT
README.md		README.md
asc.rc		asc.rc
build_third_party.sh		build_third_party.sh

Repository files navigation

ANYKS Spell-checker (ASC) C++11

Project description

There are a lot of typo and text error correction systems out there. Each one of those systems has its pros and cons, and each system has the right to live and will find its own user base. I would like to present my own version of the typo correction system with its own unique features.

List of features

Correction of mistakes in words with aLevenshtein distance of up to 4;
Correction of different types of typos in words: insertion, deletion, substitution, rearrangement of character;
Ё-fication of a word given the context (letter 'ё' is commonly replaced by letter 'е' in russian typed text);
Context-based word capitalization for proper names and titles;
Context-based splitting for words that are missing the separating space character;
Text analysis without correcting the original text;
Searching the text for errors, typos, incorrect context.

Dependencies

Building the project

Python version ASC

$ python3 -m pip install pybind11$ python3 -m pip install anyks-sc

Documentation pip

Cloning the project

$ git clone --recursive https://github.com/anyks/asc.git

Build third party

$ ./build_third_party.sh

Linux/MacOS X and FreeBSD

$ mkdir ./build$cd ./build$ cmake ..$ make

Ready-to-use dictionaries

Dictionary name	Size (GB)	RAM (GB)	N-gram order	Language
wittenbell-3-big.asc	1.97	15.6	3	RU
wittenbell-3-middle.asc	1.24	9.7	3	RU
mkneserney-3-middle.asc	1.33	9.7	3	RU
wittenbell-3-single.asc	0.772	5.14	3	RU
wittenbell-5-single.asc	1.37	10.7	5	RU
wittenbell-5-big.asc	9.76	22.1	5	RU
wittenbell-3-single.asc	2.33	6.15	3	EN

Testing

To test the system, we used data from the2016 "spelling correction" competition organized by Dialog21.
The trained binary dictionary that was used for testing:wittenbell-3-middle.asc

Mode	Precision	Recall	FMeasure
Typo correction	76.97	62.71	69.11
Error correction	73.72	60.53	66.48

I think it is unnecessary to add any other data. Anyone can repeat the test if they wish (all files used for testing are attached below).

Files used for testing

test.txt - Text used for testing;
correct.txt - File with correct text;
evaluate.py - Python3 script for correction result evaluation.

File formats

ARPA

\data\ngram 1=52ngram 2=68ngram 3=15\1-grams:-1.8070521-й-0.30103-1.8070522-0.30103-1.8070523~4-0.30103-2.332414как-0.394770-3.185530после-0.311249-3.055896того-0.441649-1.150508</s>-99<s>-0.3309932-2.112406<unk>-1.807052T358-0.30103-1.807052VII-0.30103-1.503878Грека-0.39794-1.807052Греку-0.30103-1.62953Ехал-0.30103...\2-grams:-0.294311-й передал-0.294312 ложки-0.294313~4 дня-0.8407791<s> Ехал-1.328447после того-0.477121...\3-grams:-0.09521468рак на руке-0.166590после того как...\end\

N-gramm	Occurance in the corpus	Occurrence in documents
только в одном	2	1

Vocab

\data\ad=1cw=23832unq=9390\words:33а244 | 1 | 0.010238 | 0.000000 | -3.58161634б11 | 1 | 0.000462 | 0.000000 | -6.68088935в762 | 1 | 0.031974 | 0.000000 | -2.44283840ж12 | 1 | 0.000504 | 0.000000 | -6.593878330344был47 | 1 | 0.001972 | 0.000000 | -5.228637335190вам17 | 1 | 0.000713 | 0.000000 | -6.245571335192дам1 | 1 | 0.000042 | 0.000000 | -9.078785335202нам22 | 1 | 0.000923 | 0.000000 | -5.987742335206сам7 | 1 | 0.000294 | 0.000000 | -7.132874335207там29 | 1 | 0.001217 | 0.000000 | -5.7114892282019644похожесть1 | 1 | 0.000042 | 0.000000 | -9.0787852282345502новый10 | 1 | 0.000420 | 0.000000 | -6.7761992282416889белый2 | 1 | 0.000084 | 0.000000 | -8.3856373009239976гражданский1 | 1 | 0.000042 | 0.000000 | -9.0787853009763109банкиры1 | 1 | 0.000042 | 0.000000 | -9.0787853013240091геныч1 | 1 | 0.000042 | 0.000000 | -9.0787853014009989преступлениях1 | 1 | 0.000042 | 0.000000 | -9.0787853015727462тысяч2 | 1 | 0.000084 | 0.000000 | -8.3856373025113549позаботьтесь1 | 1 | 0.000042 | 0.000000 | -9.0787853049820849комментарием1 | 1 | 0.000042 | 0.000000 | -9.0787853061388599компьютерная1 | 1 | 0.000042 | 0.000000 | -9.0787853063804798шаблонов1 | 1 | 0.000042 | 0.000000 | -9.0787853071212736завидной1 | 1 | 0.000042 | 0.000000 | -9.0787853074971025холодной1 | 1 | 0.000042 | 0.000000 | -9.0787853075044360выходной1 | 1 | 0.000042 | 0.000000 | -9.0787853123271427делаешь1 | 1 | 0.000042 | 0.000000 | -9.0787853123322362читаешь1 | 1 | 0.000042 | 0.000000 | -9.0787853126399411готовится1 | 1 | 0.000042 | 0.000000 | -9.078785...

Word ID	Word	Occurance in the corpus	Occurance in documents	tf	tf-idf	wltf
2282345502	новый	10	1	0.000420	0.000000	-6.776199

Description:

ad - The total number of documents in the training corpus;
cw - The total number of words in the training corpus;
oc - Occurance in the corpus;
dc - Occurance in documents;
tf - (term frequency) — the ratio of the number of occurrences of a certain word to the total number of words in the document. Thus, the importance of a word within a separate document is estimated, calculated as: [tf = oc / cw];
idf - (inverse document frequency) — the inversion of the frequency with which a word occurs in the collection of documents, calculated as: [idf = log(ad / dc)]
tf-idf - calculated as: [tf-idf = tf * idf]
wltf - Word rating, calculated as: [wltf = 1 + log(tf * dc)]

File containing similar letters encountered in different dictionaries

p  рc  сo  оt  тk  кe  еa  аh  нx  хb  вm  м...

Original letter	Separator	Replacement letter
t	\t	т

File containing a list of abbreviations

грСШАулрубрусчел...

All words from this list will be marked as an unknown word〈abbr〉.

File containing a list of domain zones

rusuccnetcomorginfo...

For a more accurate definition of the ** 〈url〉 ** token, we recommend adding your own domain zones (all domain zones in the example are already pre-installed).

Python word preprocessing script template

# -*- coding: utf-8 -*-definit():"""    Initialization Method: Executed once at application startup    """defrun(word,context):"""    Processing method: called during word extraction from text    @word    proccesed word    @context sequence of previous words in form of an array    """returnword

Python custom word token definition script template

# -*- coding: utf-8 -*-definit():"""    Initialization Method: Executed once at application startup    """defrun(token,word):"""    Processing method: called during word extraction from text    @token word token name    @word processed word    """iftokenand (token=="<usa>"):ifwordand (word.lower()=="сша"):return"ok"eliftokenand (token=="<russia>"):ifwordand (word.lower()=="россия"):return"ok"return"no"

Python stemming script example

importspacyimportpymorphy2# Morphological analyzermorphRu=NonemorphEn=Nonedefinit():"""    Initialization Method: Executed once at application startup    """# Get morphological analyzerglobalmorphRuglobalmorphEn# Activate morphological analyzer for Russian languagemorphRu=pymorphy2.MorphAnalyzer()# Activate morphological analyzer for English languagemorphEn=spacy.load('en',disable=['parser','ner'])defeng(word):"""    English lemmatization method    @word word to lemmatize    """# Get morphological analyzerglobalmorphEn# Get morphological analyzer resultwords=morphEn(word)# Get lemmatization resultword=''.join([token.lemma_fortokeninwords]).strip()# If the resulting word is a correct wordifword[0]!='-'andword[len(word)-1]!='-':# Return resultreturnwordelse:# Return empty stringreturn""defrus(word):"""    Russian lemmatization method    @word word to lemmatize    """# Get morphological analyzerglobalmorphRu# If morphological analyzer existsifmorphRu!=None:# Get morphological analyzer resultword=morphRu.parse(word)[0].normal_form# Return analyzer resultreturnwordelse:# Return empty stringreturn""defrun(word,lang):"""    Method that runs morphological processing    @word word to lemmatize    @lang alphabet name for @word    """# If the word is in Russianiflang=="ru":# Return russian lemmatization resultreturnrus(word.lower())# If the word is in Englisheliflang=="en":# Return english lemmatization resultreturneng(word.lower())

Environment variables

All arguments can be passed via environment variables. Variables start with the ** ASC _ ** prefix and must be written in uppercase, other than that, all the variable names correspond to their application arguments.
If both application parameters and environment variables were passed, application parameters will have priority.

$export ASC_R-ARPA=./lm.arpa$export ASC_R-BIN=./wittenbell-3-single.asc

Example of parameters in JSON format

{"debug":1,"method":"spell","spell-verbose":true,"asc-split":true,"asc-alter":true,"asc-esplit":true,"asc-rsplit":true,"asc-uppers":true,"asc-hyphen":true,"asc-wordrep":true,"r-text":"./texts/input.txt","w-text":"./texts/output.txt","r-bin":"./dict/wittenbell-3-middle.asc"}

Examples

Information about the binary dictionary

$ ./asc -method info -r-bin ./dict/wittenbell-3-middle.asc

Training process

{"ext":"txt","size":3,"alter": {"е":"ё"},"debug":1,"threads":0,"method":"train","vprune":true,"allow-unk":true,"reset-unk":true,"confidence":true,"interpolate":true,"mixed-dicts":true,"only-token-words":true,"kneserney-modified":true,"kneserney-prepares":true,"vprune-wltf":-15.0,"locale":"en_US.UTF-8","smoothing":"mkneserney","pilots": ["а","у","в","о","с","к","б","и","я","э","a","i","o","e","g"],"corpus":"./texts/corpus","w-bin":"./dictionary/3-middle.asc","w-abbr":"./dict/release/lm.abbr","w-vocab":"./dict/release/lm.vocab","w-arpa":"./dict/release/lm.arpa","abbrs":"./texts/abbrs/abbrs.txt","goodwords":"./texts/whitelist/words.txt","badwords":"./texts/blacklist/words.txt","alters":"./texts/alters/yoficator.txt","upwords":"./texts/words/names/words","mix-restwords":"./texts/similars/letters.txt","alphabet":"абвгдеёжзийклмнопрстуфхцчшщъыьэюяabcdefghijklmnopqrstuvwxyz","bin-code":"ru","bin-name":"Russian","bin-author":"You name","bin-copyright":"You company LLC","bin-contacts":"site: https://example.com, e-mail: info@example.com","bin-lictype":"MIT","bin-lictext":"... License text ...","embedding-size":28,"embedding": {"а":0,"б":1,"в":2,"г":3,"д":4,"е":5,"ё":5,"ж":6,"з":7,"и":8,"й":8,"к":9,"л":10,"м":11,"н":12,"о":0,"п":13,"р":14,"с":15,"т":16,"у":17,"ф":18,"х":19,"ц":20,"ч":21,"ш":21,"щ":21,"ъ":22,"ы":23,"ь":22,"э":5,"ю":24,"я":25,"<":26,">":26,"~":26,"-":26,"+":26,"=":26,"*":26,"/":26,":":26,"%":26,"|":26,"^":26,"&":26,"#":26,"'":26,"\\":26,"0":27,"1":27,"2":27,"3":27,"4":27,"5":27,"6":27,"7":27,"8":27,"9":27,"a":0,"b":2,"c":15,"d":4,"e":5,"f":18,"g":3,"h":12,"i":8,"j":6,"k":9,"l":10,"m":11,"n":12,"o":0,"p":14,"q":13,"r":14,"s":15,"t":16,"u":24,"v":21,"w":22,"x":19,"y":17,"z":7  }}

$ ./asc -r-json ./train.json

Error correction

Reading text from file -> correction -> writing corrected text to a new file

$ ./asc -debug 1 -method spell -spell-verbose -asc-split -asc-alter -asc-esplit -asc-rsplit -asc-uppers -asc-hyphen -asc-wordrep -r-text ./texts/input.txt -w-text ./texts/output.txt -r-bin ./dict/wittenbell-3-middle.asc

Reading from stream -> correction -> output to stream

$echo"слзы теут на мрозе"| ./asc -debug 1 -method spell -spell-verbose -asc-split -asc-alter -asc-esplit -asc-rsplit -asc-uppers -asc-hyphen -asc-wordrep -r-bin ./dict/wittenbell-3-middle.asc

Running in the interactive mode

$ ./asc -debug 1 -method spell -spell-verbose -asc-split -asc-alter -asc-esplit -asc-rsplit -asc-uppers -asc-hyphen -asc-wordrep -interactive -r-bin ./dict/wittenbell-3-middle.asc

Working with files using JSON template

{"debug":1,"method":"spell","spell-verbose":true,"asc-split":true,"asc-alter":true,"asc-esplit":true,"asc-rsplit":true,"asc-uppers":true,"asc-hyphen":true,"asc-wordrep":true,"r-text":"./texts/input.txt","w-text":"./texts/output.txt","r-bin":"./dict/wittenbell-3-middle.asc"}

$ ./asc -r-json ./spell.json

License

The class is licensed under theMIT License:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Contact Info

If you have questions regarding the library, I would like to invite you toopen an issue at GitHub. Please describe your request, problem, or question as detailed as possible, and also mention the version of the library you are using as well as the version of your compiler and operating system. Opening an issue at GitHub allows other users and contributors to this library to collaborate.

Yuriy Lobarev forman@anyks.com

About

ANYKS Spell-Checker

Releases26

v1.2.6 Latest

Jan 3, 2023

+ 25 releases

Sponsor this project

https://www.paypal.me/anyks

Packages

No packages published

Movatterモバイル変換

Uh oh!

License

anyks/asc

Folders and files

Latest commit

History

Repository files navigation

ANYKS Spell-checker (ASC) C++11

Project description

List of features

Dependencies

Building the project

Python version ASC

Cloning the project

Build third party

Linux/MacOS X and FreeBSD

Ready-to-use dictionaries

Testing

Files used for testing

File formats

ARPA

Vocab

Description:

File containing similar letters encountered in different dictionaries

File containing a list of abbreviations

File containing a list of domain zones

Python word preprocessing script template

Python custom word token definition script template

Python stemming script example

Environment variables

Examples

Information about the binary dictionary

Training process

Error correction

License

Contact Info

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases26

Sponsor this project

Uh oh!

Packages0

Languages

Packages