- Notifications
You must be signed in to change notification settings - Fork715
fix: add length filter for tantivy token#8310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Greptile Summary
This PR introduces configurable token length filtering for Tantivy's inverted index to resolve critical memory issues that were causing panics when processing large parquet files. The core problem was that Tantivy uses int32 for memory addressing, limiting data pages to 4GB, but OpenObserve generates one segment file per parquet file. When processing very large parquet files, the excessive number of tokens (including single characters and very long strings) would cause Tantivy segments to exceed this 4GB limit, resulting in panics atmemory_arena.rs:205:9.
The solution implements a two-pronged token filtering approach:
New Environment Variables:
ZO_INVERTED_INDEX_MIN_TOKEN_LENGTH=2- Filters out tokens shorter than 2 charactersZO_INVERTED_INDEX_MAX_TOKEN_LENGTH=64- Filters out tokens longer than 64 characters (previously only long token filtering existed)
Key Changes:
Token Filtering: A new
RemoveShortFilteris added insrc/config/src/utils/tantivy/tokenizer/remove_short.rsthat complements the existingRemoveLongFilter. Both filters are now applied in the tokenizer pipeline.Configuration: The config system now validates and sets sensible defaults for token length limits, ensuring minimum=2 and maximum=64 when values are unset.
Tokenizer Integration: The
o2_tokenizer_build()function inmod.rsnow applies both min and max token length filters to bothSimpleTokenizerandO2Tokenizerpaths.Base64 Detection Threshold: The O2Tokenizer's base64 detection logic is updated to use the configurable max token length (
MAX_TOKEN_LENGTH + 1 = 65) instead of a hardcoded 1024, creating more consistent behavior.File Name Generation: Several files now use
ider::generate_file_name()instead ofider::generate()to provide enhanced uniqueness with a 4-character hex suffix, supporting the improved token processing workflow.
With these changes, OpenObserve can successfully process 5GB parquet files without hitting Tantivy's memory limits, significantly improving the system's ability to handle large datasets while maintaining search functionality.
Confidence score: 3/5
- This PR addresses a critical production issue but has some implementation inconsistencies that need attention
- Score reflects the importance of the fix but concerns about inconsistent tokenization behavior between different code paths
- Pay close attention to the tokenizer module files, especially the inconsistency between
o2_tokenizer_buildando2_collect_tokensfunctions
7 files reviewed, 5 comments
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
890a8d4 intomainUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
User description
Add two new ENV:
The previous version we only remove the tokens longer than
64, didn't remove the small tokens.Now, we support both remove mini token and huge token.
Why we got error like this:
Because Tantivy use int32 for memory address, it can't generate data page large than 4GB, and we only generate one segment file for one parquet file in OpenObserve, it caused one problem, if the parquet is very large that will generate a lot of tokens, at the end, the Tantivy segment will exceeds the limit then paniced.
This PR improved the token generated, will auto remove:
2.64chars.With these changes we can succeeded generate
5GBparquet file and the Tantivy file.PR Type
Enhancement, Bug fix
Description
Add min/max token length configs
Implement short-token removal filter
Apply token limits in analyzers
Fix compactor file naming collisions
Diagram Walkthrough
File Walkthrough
4 files
Add token length configs and validation hookIntroduce unique file name generatorWire min/max token filters into analyzersImplement RemoveShortFilter and tests1 files
Align base64 threshold with max token; test update2 files
Use new filename generator; remove ad-hoc suffixUse new filename generator for merged outputs