Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A (better) trigram tokenizer for SQLite3 FTS5 that also handles words less than 3 characters in length.

NotificationsYou must be signed in to change notification settings

streetwriters/sqlite-better-trigram

Repository files navigation

A (better) trigram tokenizer for SQLite3 FTS5

While SQLite3 already has a built-in trigram tokenizer, it does not have any kind of word segmentation support. For example,i am a bird gets tokenized as['i a', ' am', 'm a', ' a ', ' bi', 'bir', 'ird'] which isn't too bad but in real-world use cases users usually type queries asSELECT * FROM fts_table WHERE title MATCH 'a bird' which returns no results.

This tokenizer fixes this by treating spaces as a word boundary. The result is thati am a bird gets tokenized as['i', 'am', 'a', 'bir', 'ird'] andSELECT * FROM fts_table WHERE title MATCH 'a bird' correctly returns the expected results. You get all the benefits for substring matching just with a wider range of queries.

Furthermore, the built-intrigram tokenizer treats CJK as normal characters and creates trigrams out of them. The problem is, in CJK a single Unicode character can be a whole word.better-trigram fixes this by treating each CJK character as its own token. For example:李红:那是钢笔 gets tokenized as['李','红',':','那','是','钢','笔'] and if there are any non-CJK words mixed in the input, they also get properly tokenized automatically.

Compatibility withtrigram

better-trigram is 99% compatible withtrigram. This means it has full UTF-8 support, handles all the same edge cases etc. To ensurebetter-trigram remains compatible, it passes all thetrigram tokenizer tests. Yay!

With that being said,better-trigram doesn't supportLIKE &GLOB patterns. This is a limitation in FTS5 because it doesn't allow custom tokenizers to opt-in to this behavior. (Youcould technically compile a custom version of FTS5 that enables support for this but I haven't looked into it.) UsingLIKE &GLOB will fallback to full table scans (not recommended).

Getting started

Prerequisites

  • Lemon
  • Tcl

Build

First install the prerequisites:

# on macOSbrew install lemon tcl-tk# on Ubuntu Linuxsudo apt install lemon tcl

Then build the tokenizer:

make loadable

Usage

Load thebetter-trigram.so orbetter-trigram.dylib file as a loadable SQLite extension (e.g..load better-trigram.so).

Then specify it when creating your FTS5 virtual table:

CREATE VIRTUAL TABLE t1 USING fts5(y, tokenize='better_trigram')

Options

better-trigram supports exactly the same options astrigram and copies the exact behavior of the original tokenizer when specifying invalid options. This means settingcase_sensitive 1 andremove_diacritics 1 will throw an error.

Refer toFTS5 documentation for the Trigram tokenizer for more details on what each options does and how you can use it.

Performance

better-trigram is significantly faster thantrigram. Here are the benchmarks:

Text length: 289091clk: ~4.09 GHzcpu: AMD Ryzen 5 PRO 5650U with Radeon Graphicsruntime: bun 1.1.31 (x64-linux)benchmark              avg (min … max) p75   p99    (min … top 1%)-------------------------------------- -------------------------------better-trigram            6.37 ms/iter   6.86 ms   7.31 ms ▂▅▃▂▂▃▄▇█▄▂trigram                  10.21 ms/iter  11.23 ms  12.75 ms █▄▄▄▃▅▅▇▄▂▂summary  better-trigram   1.6x faster than trigram

To reproduce on your own machine, runbun bench.ts.

Contributing

All kinds of PRs are welcome, of course. Just make sure all the tests pass. You can run the tests like this:

maketest

License

2024-10-21The author disclaims copyright to this source code.  In place ofa legal notice, here is a blessing:    May you do good and not evil.    May you find forgiveness for yourself and forgive others.    May you share freely, never taking more than you give.

About

A (better) trigram tokenizer for SQLite3 FTS5 that also handles words less than 3 characters in length.

Resources

Stars

Watchers

Forks

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp