- Notifications
You must be signed in to change notification settings - Fork2
An Esperanto dictionary, compiled by Sergio Pokrovskij for the version 3 of ispell.
License
pok49/ispell-eo
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
- Name:
./README
- Content: Information about Esperanto dictionary for the Ispell speller
- Created: 2024-01-19 by Sergio Pokrovskij
<sergio.pokrovskij(cxe)gmail.com>
- Version: 4.3
Copyright © 1997, 1998, 2003, 2006, 2008, 2024 by Sergio Pokrovskij
This dictionary package is available on the terms of GNU General PublicLicense version 2.0 (Free Software Foundation, 675 Mass Ave, Cambridge,MA 02139, USA).
- About the Package
- Quick Install
- For the Ispell Utility Program
- Installation from Scratch
- Usage
- Apostrophes
Here is an Esperanto spelling dictionary, compiled by Sergio Pokrovskij for theversions 3+ ofIspell.
The dictionary can also be converted into aneo.utf-8.spl
file foruse with theVim Speller (see:help spell
in vim itself).
Some more information is availablein Esperanto.
You will need to have Ispell 3.0+; its precompiled package isavailable in most of Linux repositories, the source tarball can bedownloaded from itsofficial site; if you compile it from the sources,make sure that in itslocal.h
theNO8BIT
thing is commented outandMASKBITS = 64
. To see the optionispell
has been compiled with,please run
$ ispell -vv | grep MASKBITS
The result should be:
MASKBITS = 64
Here is a binary dictionaryeo.hash, made for Linux x86_64. It shouldbe placed where Ispell expects to find it, in my case it is
$ ispell -vv | grep LIBDIRLIBDIR = "/usr/lib/ispell"
You can test your installation by an ASCII transliteration on anyterminal:
$ echo "Kuba harpisto sxajnis amuzigxi facilege cxe via jxauxda hxoro" |\ispell -T cxirkaux -d eo@(#) International Ispell Version 3.4.02 08 Jan 2021word: okok (derives from root HARPI)ok (derives from root ^SAJNI)ok (derives from root AMUZI)ok (derives from root FACILEGA)okok (derives from root VI)okokword:
or by Unicode representation, if your terminal works in UTF-8:
$ echo "Ŝajnas ke sagaca monaĥo laŭtvoĉe refuzadis pregi syr herbaĵo" |\ispell -P -T utf8 -d eo@(#) International Ispell Version 3.4.02 08 Jan 2021word: ok (derives from root ^SAJNI)okokokok (derives from root LA^UTVO^CA)how about: refutadis, reuzadis, rifuzadishow about: predi, preĝi, premi, preni, presi, preti, puregi, regihow about: surok (derives from root HERBO)word:
Also see the section “Usage” below.
If the binaryeo.hash
proves incompatible with your system, you canbuild it yourself fromeo.aff andeo.asc:
Download the files.
Use the
buildhash
utility from theispell
distro:buildhash eo.asc eo.aff eo.hash
You now can try the neweo.hash
and proceed back to “The Binary File”.
There is an obsoletevimspell plugin for the ubiquitousVim editor,which provides an interface toispell
andaspell
. You should beable to use the ispell-eo dictionaries via this interface (I did nottest it). Yet it is no longer supported:
note to VIM 7 users !
Version 7 of vim integrates a native spellchecker which outperformsvimspell script. As such I will not maintain anymore vimspell script,and you are advised to delete all related files from your plugin/ anddoc/ directory, and use the native spellchecker instead.
I have written a converter from my Ispell source dictionary to the Vim'sSpell format (which is basically theMySpell format).
The ready-made packed dictionary is available for download aseo.utf-8.spl. You can install it locally for yourself in$HOME/.vim/spell
, or in a system-wide manner in aspell
orafter
subdirectory of Vim'sruntimepath
variable. (To examine its valuesay:echo &runtimepath
).
In order to invoke Esperanto spell say:setlocal spell spelllang=eo
.For help about speller commands see
:h spell
(or see the same documentationon the Web).
Several encodings used with Esperanto text are supported:
The very best is Unicode or its subset, which contains theesperantic letters, like the Microsoft's WGL4 or better ISO's MES-1or MES-2; you can use it with
xterm
orEmacs
under Unix, orwithUniRed
under Windows.Unicode is available in UTF-8 encoding, which is preferred for theUnices.
The second best choice used to be the Latin‑3 encoding (ISO-8859‑3);it is obsolete by now.
For the sake of the ASCII-impaired (and ANSI-impaired), there are twosurrogates:
The TeX-like
^cirka^u-style: e^ho^san^go ^ciu^ja^ude
. Presentlythis is used as the reference representation, mainly because it isunambiguous (cf names likeMichaux); andThe popular
cxirkaux-style
, which is also convenient forlexicographical ordering and thus is used in the dictionaries;besides, it uses ASCII letters only, and that makes it suitable forvarious names in computer programs.
Clone the
ispel-eo
project into your local repository:$ git clone https://github.com/pok49/ispell-eo
Go to the root directory
ispell-eo
(where thisreadme
resides).Say
$ make first
(or simply
make
) in order to check yourispell
program.Examine the output, e.g. do you have the permissions to write thehash file(s) at the install phase?If everything is OK, say
$ make eo
(to built the strict dictionary), or
$ make esperanto
(to built a permissive dictionary), or
$ make all
(to built both).You may get a few warnings of from
buildhash
, like this one:eo.aff line 218: Flag must be alphabetic
Just ignore them.
Type
$ make install
to copy the hash file(s) to where Ispell expects them to be(probably you already have your
american.hash
there; normally youshall need the root rights to make install).
After that you can call, for instance,
$ ispell -d eo -T cxirkaux $HOME/Git/ispell-eo/doc/ekz.cx
(ekz.cx
is an ASCII file, in which the Esperantic letters arepresented incxirkaux
-surrogate, as the-T cxirkaux
argumentstates; this ASCII interface should work on any terminal).
If the prefabricatedeo.utf-8.spl
dictionary does not work for you,you could try to pack it on your computer from the fileseo_l3.aff
andeo_l3.dic
in theispell-eo/oo/vimspell
subdirectory:
$ cd $HOME/Git/ispell-eo/vimspell$ env LANG=eo.utf-8 vim -u NONE -e -c "mkspell! $HOME/.vim/spell/eo eo" -c q 2>&1 > err
This implies that you have installed the Esperanticeo.utf-8
localein your system (available in most Linux distros); it should produceeoutf-8.spl
in the$HOME/.vim/spell/
directory for your privateuse.
If you prefer a different composition, you can make the dictionaryyourself. You'll need Emacs to produce the dictionary forMySpell
(which used to work withOpenOffice
and which remains the basis ofthe Vim's Spell).cd
into$HOME/Git/ispell-eo
; customize the wordprovision as described in “Customized Build” (except thebuildhash
and the following steps). In theMakfile
check itsvim_spl_install_dir
variable; by default it is set for a localinstall in your$(HOME)/.vim/spell
directory; you may prefer to setit globally for a system-wide install. Then say
$ make vim
and
$ make install_vim
(the latter may require the administrative rights for a system-wide install).
To see Vim Speller in action please open the test file
$ vim $HOME/Git/ispell-eo/doc/vim-test.u8
and say
:set spell spelllang=eo
In order to enable selective construction of dictionaries, some entriesin the source dictionary./src/vortoj.l3
are marked with keywordsindicating the special field they belong to:
#arhx
: archaic words, like ‹ĥina› (= ‹ĉina›) or ‹malkompreni› (= ‹miskompreni›)#bot
: a rare botanic word#Eujo
: vocabulary of the Esperanto Movement (of ‹Esperantujo›)#etn
: ethnography; actually also countries and other geography#his
: history#komp
: some computer-science terminology according to theKomputada Leksikono#mav
: redundant words, which are used by some esperantists, though they are less precise and unnecessarily complicate the language; e.g. ‹olda› (‹maljuna›or ‹malnova›), ‹mava› (= ‹malbona›)#mit
: mythology, religion#pers
: given names and names of important personalities (e.g. ‹Petro›, ‹Ŝekspiro› …)#pok
: the words specific to my idiolect#rar
: rare words which may coincide with a misspelling of a more frequent word; e.g. ‹ajuna›, ‹komanditi›, ‹liona›#var
: variant which I do not use but which is frequent enough (e.g. ‹kemio›, ‹tekniko› opposed to ‹ĥemio› and ‹teĥniko›)#zoo
: uncommon zoological word
You cangrep
,
$ grep '#mav' ./src/vortoj.l3 | less
in order to see if you feel like me about them; you can either removeall of them from the target dictionary, or remove the#mav
mark fromthose you do use and like; the default setting in the./Makefile
is
short_list = komp,etn,Eujo,mllpok_list= $(short_list),bot,fremd,his,pok,pers,var,zooeo_list = $(short_list),drvesperanto_list = $(short_list),arhx,mav,rar
Unless included in the custom list (likeeo_list
), a marked word isconsidered as a special one and is excluded from the build.
When preparing a dictionary for the Vim Speller it is advisable toretain the entries marked with#mav
and#rar
: in Vim Spelldictionary they will receive the qualificationsBAD
andRAR
and assuch will be warned about in an appropriate manner.
One same Ispell dictionary, e.g.eo.hash
, can be used with severalinput representations, specified in its affix file (e.g.eo.aff
).Each such representation can be identified by a name (used in ispellinvocation as ‘‑T identifier’ argument), or by the extension of theargument filename; both identification kinds are specified in theaffix file. In case of identification conflict the name argument takesprecedence.
eo.aff
defines the following representations:
tex
(the extensions are.tex
or.bib
)is suited for TeX, andimitates thedead keys:e^ho^san^g^' ^ciu^ja^de
. It is therepresentation for whichispell-eo
was originally designed (inconnection with the “Komputika Leksikono”), and it remains thebasic representation used inispell-eo
internally.cxirkaux
(or.cx
or.t
) identifies the x‑style representation,which is the most popular ASCIIization of the Esperanto letters;ehxosxangx' cxiujxauxde
.latin3
(or.l3
) is the straightforward application ofISO 8859‑3 (aka Latin‑3), which gives all the accented esperanticletters their canonical form; the apostrophe is represented as theASCII'
(0x27).epo
(or.la3
or.wiki
) is likelatin3
, except that theapostrophe is represented by ‹´› (0xB4, spacing acute; see belowthe section “Apostrophes”);epo
is the standard 3-letterdesignation of Esperanto in ISO 639.- In
utf8
(.html
,.u8
,.utf
) the accented letters are coded by2 bytes each according to the UTF-8 encoding; apostrophe is encodedas ‹ʼ› (U+02BC,#xCA #xBC
, modifier letter apostrophe).
Unlike Aspell or Hunspell, Ispell allows switching among theserepresentations (via the-T
flag:‑T tex
, or‑T utf8
etc) whileusing the same hash file; this is an advantage of Ispell. OTOHvariation in the word provision (e.g. inclusion or exclusion of the“bad” words) requires compilation of separate hash files (in our case,eo.hash
vsesperanto.hash
); here Hunspell and Vim Speller are moreflexible, they make it possible to retain the bad words and mark theiruse in a special way.
Some usage examples below are illustrated with specimen files from theispell-eo/doc
directory (which in my case is in my local$HOME/Git/
repository). This should give you an idea about where andwhat kind of files could be used in a given situation.
You can use Ispell in a stand-alone mode, as a console program. Theusage depends on the encodings available at your terminal emulator forrepresenting the Esperanto letters.
This is available anywhere, you can use thetex
(TeX) or thecxirkaux
representation. In thetex
representation the wordĉirkaŭ takes the form^cirka^u
; in the latter case both esperanticaccents are expressed with the letterx
, docxirkaux
. In order tocheck a file with the Ispell dialog editor simply type (for thex‑notation):
ispell -d eo -T cxirkaux $HOME/Git/ispell-eo/doc/ekz.cx
or (fro the TeX notation):
$ ispell -d eo -T tex $HOME/Git/ispell-eo/doc/ekz.^c
The resulting dialog is self-explanatory.
You also can request a list of misspelt words, e.g.
$ cat -b $HOME/Git/ispell-eo/doc/ekz.cx$ ispell -d eo -T cxirkaux -l < $HOME/Git/ispell-eo/doc/ekz.cx 1Por ke la linguo intrenacia povu bone kaj regule progresadi kaj por ke 2gxi havu plenan cetrecon, ke gxi neniam disfalos kaj ia facxilanima 3pasxo de gxiaj amikoj estontaj ne detruos la laborojn de gxiaj amikoj 4estitaj, -- estas plej necesa antaux cxio unu kondico: la ekzistado de 5klare difinita, neniam tusxebla kaj neniam sxangxebla Fundamento ... 6en nova form' eksonis nova kant'linguointrenaciacetreconfacxilanimaestitajkondico
Unfortunately, the Ispell editor is unaware of multibyte characters.
Ispell is two decades older than Unicode; yet it is possible to useits general specification facilities to define the UTF-8 encoding ofthe Esperantic letters, and it partially works.
The modern Linux terminals use the UTF-8 encoding by default, so youcan say there:
$ ispell -d eo -T utf8 $HOME/Git/ispell-eo/doc/testo.u8
or submit test words in the command line:
$ echo "faĉilanima paŝo de ĝiaj anikoj estitaj" | ispell -T utf8 -d eo@(#) International Ispell Version 3.4.02 08 Jan 2021word: how about: facilanimaok (derives from root PA^SI)okok (derives from root ^GI)how about: amikoj, anigoj, aniĝoj, animoj, aninoj, anizoj, manikoj, panikoj, unikojhow about: estigaj, estiĝaj, estimaj, estintaj, estritaj, festitaj, ostitaj, testitaj, vestitaj, esti+taj, estu-u+itajword:
(The suggestions are presented in the TeX notation.)
You can also get a list of all misspelled or unknown words from a text:
$ head -9 $HOME/Git/ispell-eo/doc/Cart.u8AL Sinjoro fruictier.Kara Amiko!Vi petis, ke mi prezentu vian libron al Esperantistaro.Prezentadon ĝi ne bezonas: ne sole ĉar la ĉefredaktorode »Lingvo Internacia« jam elmontris sian valoron, sedprecipe ĉar lia verko estas kunmetita laŭ principojde severega metodo. Eĵektive, kio eslas farita per Sciencokaj en ĝia nomo, tio tute ne bezonas patronadon, ĉarne ekzistas pli atta.$ head -9 $HOME/Git/ispell-eo/doc/Cart.u8 | ispell -d eo -T utf8 -lfruictierEĵektiveeslasatta
(The fileCart.u8
is produced by OCR of a letter by Th. Cart to PaulFruictier, published as a foreword to “Esperanta Sintakso” by thelatter. The letter is printed in italics.)
Such a short file is more conveniently corrected in a text editorvia its speller interface (see “Emacs” below); but when editing alarge file, e.g. a scan of a book such asHistorio de Mondolingvo byE. Drezen, it may be advantageous to get a list of most numerouserrors; in Unix this can be done with a one-liner:
$ ispell -H -l -d eo -T utf8 < Drezen.html | sort -if | uniq -c | sort -nr | head -12 215 ciuj 209 tin 185 gi 168 lau 167 ankau 156 ce 143 au 127 ec 120 in 120 autoro 112 Paris 109 Volapiik
(the 1st column indicates the number of occurrences of the error).With such a list one can correct hundreds or dozens of errors with asingle command.
You may safely skip this section, unless you are interestwed inhistory of computing or have to use software which accepts only single-byteencodings.
The ISO 8859‑3 encoding, aka Latin‑3, is now rarely used; yet it is forsuch a single-byte encoding that Ispell was developed. Presently aLatin‑3 terminal is not readily available; one could installxterm
anditsluit
package; and then either launchxterm
in Latin‑3:
$ xterm -en 'ISO 8859-3' &
or useluit
as a filter:
$ luit -encoding 'ISO 8859-3' echo eĥoŝanĝo | od -c0000000 e 266 o 376 a n 370 o \n0000011
(theecho|od
commands attest that the non-ASCII letters are encodedin Latin‑3).
In this environment the dialog Ispell Editor shall work as expected:
ispell -d eo -T latin3 $HOME/Git/ispell-eo/doc/testo.l3
(the filetesto.l3
is written in the Latin‑3 encoding).
Normally I use Ispell in an Emacs session. Emacs is distributed withtheispell.el
package, which provides an interface with the ispellprocesses (see “InteractiveSpell” in Emacs' Wiki). This packageincludes, among others, specifications for interactions with thepermissiveesperanto
dictionary in two representations:latin3
andtex
; in ispell.el they are named resp.esperanto
andesperanto-tex
.
Don't be affraid of the namelatin3
: your text may be (and normallyis) in Unicode; the program seamlessly converts your UTF-8 words toLatin‑3 and back, using Latin‑3 behind the scenes, so that you never noticeit. The only exception is the limitation of the repertoire of thecharacters available for word representation: e.g. you cannot use the curlyapostrophe, which is absent from Latin‑3, and thus cannot be passed toispell.
As mentioned earlier, the Esperantic Ispell dictionaries can accept theUTF-8 input; alas, for some bugs inispell.el
conversions from theinteger Unicode numbers to multibyte UTF-8 and back “may result in theevil misalignment error”; the interaction with 1-byte codes (e.g.Latin‑3) is more stable.
In my practical work I prefer the strictereo.hash
dictionary (andthe x‑style ASCIIization); both are made available via the./emacs/ispell-ini.el
customization included in this distribution.You may copy it into yoursite-lisp
(or somewhere else on your emacs'load-path
), and put this into your.emacs
:
(load "ispell-ini.el")
ispell-ini.el
provides access to the hash dictionaries via thenames of the Esperanto representations it defines:
eo : latin3(eo.hash)epo : epo(eo.hash)eo-x : cxirkaux(eo.hash)esperanto-x : cxirkaux(esperanto.hash)
The representation nameslatin3, , cxirkax
are described above in“Esperanto encodings”.epo
is basicallylatin3
extended with aspecial care of the curly apostrophes; it enables use (and check) ofUnicode coded texts, even though the stable ispell representation isfunctioning in the single-byte Latin‑3 encoding. This solution ispresented in the next section.
In English, the apostrophes appear either inside a word, like inisn't (and this case is addressed by theboundarychars
specification), or after a well-formed word, like infor goodness'sake (where the exclusion of the apostrophe does not raise a falseerror report). Confusions with quotes are infrequent, yet possible:
$ echo 'Tis the season to be jolly! | ispell | head -2@(#) International Ispell Version 3.4.02 08 Jan 2021word: how about: Dis, His, Is, Its, Otis, Pis, Sis, T's, TAs, This, Ti, Ti's, Tia, Tic, Tics, Tie, Ties, Tim, Tims, Tin, Tins, Tip, Tips, Tit, Tits, Ti s, Ti-s, Ts, TVs, T is, T-is, Vis
In Esperanto apostrophes appear mainly as the last of the wordʼscharacters, i.e. in a most error-prone position:
en nova form' eksonis nova kant'
Ispell and Aspell are able to treat such postfix apostrophes; Hunspelland Vim Speller take account of non-letter characters (e.g.-
or'
) only inside a word, when they occur between two letters.
Actually the fine typography requires a curly apostrophe, and Unicodeoffers two options:
- themodifier letter apostrophe ‹ʼ› U+02BC, and
- theright single quotation mark ‹’› U+2019.
The letter apostrophe is classified by Unicode as a letter, and assuch perfectly suits the Esperantic spelling dictionaries for Hunspelland Vim Speller; hence the Hunspell spelling dictionary for Esperanto,distributed with LibreOffice, as well as my conversion for Vim Spellerboth cannot but opt for U+02BC.
Unfortunately the impact of the English (or rather Microsoft's)tradition imposes the use of U+2019 (e.g. great many fonts whichfollow the Microsoftʼs WGL4 standard have U+2019 and lack U+02BC).Also the Unicode® Standard 15.1.0 (2023 Sept. 12),Chapter 6, supportsthis confusion:
An implementation cannot assume that users’ text always adheres to thedistinction between these characters. The text may come from differentsources, including mapping from other character sets that do not makethis distinction between the letter apostrophe and the punctuationapostrophe/right single quotation mark. In that case,all of themwill generally be represented by U+2019.
The semantics of U+2019 are therefore context dependent. For example,if surrounded by letters or digits on both sides, it behaves as anin-text punctuation character and does not separate words or lines.
ThisWrong Thing works for English and French; it fails for Esperantoand other languages where apostrophes may behave as a word element atthe word boundary; i.e. the ISO standard is not international enough.
In an Ispell specification any character may be declared a letter;thus theeo, esperanto, eo-x, esperanto-x
representations use theASCII apostrophe ‹'›; theutf8
(unavailable via theispell.el
interface) uses U+2019 (this can be changed to U+02BC by replacing oneline in the affix file). Unfortunately the interaction between theutf8
representation in Ispell andispell.el
remains unstable for“the evil misalignment error”.
As a workaround, alongside the traditionallatin3
representation,which uses the ASCII apostrophe, there is theepo
representation inIspell, in which the ASCII apostrophe is ignored (and available forany non-lexical usage); the Esperanticletter apostrophe isrepresented by the otherwise unused Latin‑3 character ‹´› (spacingacute, 0xB4). Now, the attachedispell-ini.el
fileadvises theinterface functionsispell-send-string
andispell-parse-output
insuch a way, that in the input string sent to the Ispell process thecurly apostrophes (either U+2019 or U+02BC) are replaced with 0xB4;and in the Ispellʼs output this character (if any) is recoded back tothe “canonical” apostrophe representation, specified by theispell-apostrophe
Elisp variable. Its default value is U+2019 (rightquote), but it can be toggled to U+02BC and back by the interactiveElisp function
M-x ispell-set-apostrophe
When given a numeric prefix, this function can also set the variableunconditionally: with 1 it is set to the letter apostrophe ‹ʼ›; with2, to the right quotation mark ‹’›.
In informal writing one usually would prefer the easier ASCIIapostrophes, and apply theeo
spellcheck;epo
is appropriate whenone in preparing a typographic quality text, or when the ASCIIapostrophe is used for some extra-lingual purpose, like in Wikipediasources.
Take for example the source text of article «Majusklo» in theEsperanto Wikipedia (triple apostrophes are the boldface mark):
'''Majusklo''' (aŭ '''grandlitero''' aŭ '''ĉeflitero''')estas unu el du formoj, kiujn povas havi ĉiu litero ...
Theeo
spellcheck would complain about the inexistent words'''Majusklo'''
,'''grandlitero'''
,'''ĉeflitero'''
Please note that such a nuisance does not occur in English or French,where the apostrophes are recognized only within a word, between two letters.On the one hand this is a language-dependent feature, which is notequally convenient internationally (but was Wikipedia conceived as aninternational project?). On the other hand, it may be useful topromote the typographically preferable form of the apostrophes.
ispell.el
provides thespell-change-dictionary
function, bound toC-c i c
, ekz‑e
C-c i c RET epo RET
selects theepo
“dictionary” with ‹’›;C-c i c RET eo RET
selects theeo
“dictionary” with ‹'›.
It can also be invoked from the menu:
Tools → Spell Checking → Change Dictionary ...
ispell-ini.el
adds a few commands for easier switching:
C-c i 3
sets the Esperantoeo
(latin3
) dictionaryC-u C-c i 3
sets the Esperantoepo
dictionaryC-c i x
sets the Esperantoeo-x
dictionaryC-u C-c i x
sets theesperanto-x
dictionary
C-c i a
sets the American English dictionaryC-c i f
sets the French dictionaryC-c i p
sets the Russian dictionary.
In order to input curly apostrophesispell.el
provides the command
C-c i '
It inserts U+2019 or U+02BC, according to the current value ofispell-apostrophe
.
The quotation marks can be inputted pairwise, and the cursor ispositioned in between; if there is an active region, the quotationmarks are put around it. The commands (shortcuts) are:
C-c "
makes a pair of guillemets «│»C-c 9
makes a 99—66 pair: „│“C-c <
makes a pair of single guillemets: ‹│›C-c 6
makes a 66—99 pair: “│”.
- Pri apostrofoj kaj citiloj en Esperanto (in Esperanto)
- Which Unicode character should represent the English apostrophe? (And why the Unicode committee is very wrong)
- Getting Emacs to play nice with Hunspell and apostrophes (a threadin the
<help-gnu-emacs@gnu.org>
forum)