- Notifications
You must be signed in to change notification settings - Fork1
Shared ispell dictionary (stored in shared segment, used by multiple connections)
License
postgrespro/shared_ispell
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This PostgreSQL extension provides a shared ispell dictionary, i.e.a dictionary that's stored in shared segment. The traditional ispellimplementation means that each session initializes and stores thedictionary on it's own, which means a lot of CPU/RAM is wasted.
This extension allocates an area in shared segment (you have tochoose the size in advance) and then loads the dictionary into itwhen it's used for the first time.
If you need just snowball-type dictionaries, this extension is notreally interesting for you. But if you really need an ispelldictionary, this may save you a lot of resources.
Before build and installshared_ispell
you should ensure following:
- PostgreSQL version is 9.6 or later.
Installing the extension is quite simple. In that case all you need to do is this:
$ git clone git@github.com:postgrespro/shared_ispell.git$ cd shared_ispell$ make USE_PGXS=1$ make USE_PGXS=1 install
and then (after connecting to the database)
db=# CREATE EXTENSION shared_ispell;
Important: Don't forget to set the
PG_CONFIG
variable in case you want to testshared_ispell
on a custom build of PostgreSQL. Read morehere.
No the functions are created, but you still need to load the sharedmodule. This needs to be done from postgresql.conf, as the moduleneeds to allocate space in the shared memory segment. So add this tothe config file (or update the current values)
# libraries to loadshared_preload_libraries = 'shared_ispell'# config of the shared memoryshared_ispell.max_size = 32MB
Yes, there's a single GUC variable that defines the maximum size ofthe shared segment. This is a hard limit, the shared segment is notextensible and you need to set it so that all the dictionaries fitinto it and not much memory is wasted.
To find out how much memory you actually need, use a large value(e.g. 200MB) and load all the dictionaries you want to use. Then usethe shared_ispell_mem_used() function to find out how much memorywas actually used (and set the max_size GUC variable accordingly).
Don't set it exactly to that value, leave there some free space,so that you can reload the dictionaries without changing the GUCmax_size limit (which requires a restart of the DB). Ssomethinglike 512kB should be just fine.
The shared segment can contain several dictionaries at the same time,the amount of memory is the only limit. There's no limit on numberof dictionaries / words etc. Just the max_size GUC variable.
Technically, the extension defines a 'shared_ispell' template thatyou may use to define custom dictionaries. E.g. you may do this
CREATE TEXT SEARCH DICTIONARY czech_shared ( TEMPLATE = shared_ispell, DictFile = czech, AffFile = czech, StopWords = czech);CREATE TEXT SEARCH CONFIGURATION public.czech_shared ( COPY = pg_catalog.simple );ALTER TEXT SEARCH CONFIGURATION czech_shared ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH czech_shared;
and then do the usual stuff, e.g.
db=# SELECT ts_lexize('czech_shared', 'automobile');
or whatever you want.
The extension provides five management functions, that allow you tomanage and get info about the preloaded dictionaries. The first twofunctions
shared_ispell_mem_used()shared_ispell_mem_available()
allow you to get info about the shared segment (used and free memory)e.g. to properly size the segment (max_size). Then there are functionsreturn list of dictionaries / stop lists loaded in the shared segment
shared_ispell_dicts()shared_ispell_stoplists()
e.g. like this
db=# SELECT * FROM shared_ispell_dicts(); dict_name | affix_name | words | affixes | bytes -----------+------------+-------+---------+---------- bulgarian | bulgarian | 79267 | 12 | 7622128 czech | czech | 96351 | 2544 | 12715000(2 rows)db=# SELECT * FROM shared_ispell_stoplists(); stop_name | words | bytes -----------+-------+------- czech | 259 | 4552(1 row)
The last function allows you to reset the dictionary (e.g. so that youcan reload the updated files from disk). The sessions that already usethe dictionaries will be forced to reinitialize them (the first onewill rebuild and copy them in the shared segment, the other ones willuse this prepared data).
db=# SELECT shared_ispell_reset();
That's all for now ...
The original version of this module located in the Tomas Vondra'sGitHub. That version does not handleaffixes that require full regular expressions (regex_t, implemented in regex.h).
This version of the module can handle that affixes with full regularexressions. To handle it the module loads and stores affix files in eachsessions. The affix list is tiny and takes a little time and memory to parse.Actually this is Tomasidea,but there is not related code in the GitHub.
Tomas VondraGitHub
About
Shared ispell dictionary (stored in shared segment, used by multiple connections)
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Languages
- C91.5%
- PLpgSQL3.8%
- Meson3.1%
- Makefile1.6%