- Notifications
You must be signed in to change notification settings - Fork2
streetwriters/sqlite-better-trigram
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
A (better) trigram tokenizer for SQLite3 FTS5
While SQLite3 already has a built-in trigram tokenizer, it does not have any kind of word segmentation support. For example,i am a bird gets tokenized as['i a', ' am', 'm a', ' a ', ' bi', 'bir', 'ird'] which isn't too bad but in real-world use cases users usually type queries asSELECT * FROM fts_table WHERE title MATCH 'a bird' which returns no results.
This tokenizer fixes this by treating spaces as a word boundary. The result is thati am a bird gets tokenized as['i', 'am', 'a', 'bir', 'ird'] andSELECT * FROM fts_table WHERE title MATCH 'a bird' correctly returns the expected results. You get all the benefits for substring matching just with a wider range of queries.
Furthermore, the built-intrigram tokenizer treats CJK as normal characters and creates trigrams out of them. The problem is, in CJK a single Unicode character can be a whole word.better-trigram fixes this by treating each CJK character as its own token. For example:李红:那是钢笔 gets tokenized as['李','红',':','那','是','钢','笔'] and if there are any non-CJK words mixed in the input, they also get properly tokenized automatically.
better-trigram is 99% compatible withtrigram. This means it has full UTF-8 support, handles all the same edge cases etc. To ensurebetter-trigram remains compatible, it passes all thetrigram tokenizer tests. Yay!
With that being said,better-trigram doesn't supportLIKE &GLOB patterns. This is a limitation in FTS5 because it doesn't allow custom tokenizers to opt-in to this behavior. (Youcould technically compile a custom version of FTS5 that enables support for this but I haven't looked into it.) UsingLIKE &GLOB will fallback to full table scans (not recommended).
- Lemon
- Tcl
First install the prerequisites:
# on macOSbrew install lemon tcl-tk# on Ubuntu Linuxsudo apt install lemon tcl
Then build the tokenizer:
make loadable
Load thebetter-trigram.so orbetter-trigram.dylib file as a loadable SQLite extension (e.g..load better-trigram.so).
Then specify it when creating your FTS5 virtual table:
CREATE VIRTUAL TABLE t1 USING fts5(y, tokenize='better_trigram')
better-trigram supports exactly the same options astrigram and copies the exact behavior of the original tokenizer when specifying invalid options. This means settingcase_sensitive 1 andremove_diacritics 1 will throw an error.
Refer toFTS5 documentation for the Trigram tokenizer for more details on what each options does and how you can use it.
better-trigram is significantly faster thantrigram. Here are the benchmarks:
Text length: 289091clk: ~4.09 GHzcpu: AMD Ryzen 5 PRO 5650U with Radeon Graphicsruntime: bun 1.1.31 (x64-linux)benchmark avg (min … max) p75 p99 (min … top 1%)-------------------------------------- -------------------------------better-trigram 6.37 ms/iter 6.86 ms 7.31 ms ▂▅▃▂▂▃▄▇█▄▂trigram 10.21 ms/iter 11.23 ms 12.75 ms █▄▄▄▃▅▅▇▄▂▂summary better-trigram 1.6x faster than trigramTo reproduce on your own machine, runbun bench.ts.
All kinds of PRs are welcome, of course. Just make sure all the tests pass. You can run the tests like this:
maketest2024-10-21The author disclaims copyright to this source code. In place ofa legal notice, here is a blessing: May you do good and not evil. May you find forgiveness for yourself and forgive others. May you share freely, never taking more than you give.About
A (better) trigram tokenizer for SQLite3 FTS5 that also handles words less than 3 characters in length.
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.