NotificationsYou must be signed in to change notification settings
Fork715
Star17.5k

fix: add length filter for tantivy token#8310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

hengfeiyang merged 6 commits intomainfromfeat/tantivy-tokens

Sep 6, 2025

Merged

fix: add length filter for tantivy token#8310

hengfeiyang merged 6 commits intomainfromfeat/tantivy-tokens

Sep 6, 2025

Conversation

Copy link

Contributor

hengfeiyang commentedSep 5, 2025•
edited by github-actionsbot
Loading

User description

Add two new ENV:

ZO_INVERTED_INDEX_MIN_TOKEN_LENGTH=2ZO_INVERTED_INDEX_MAX_TOKEN_LENGTH=64

The previous version we only remove the tokens longer than64, didn't remove the small tokens.

Now, we support both remove mini token and huge token.

Why we got error like this:

thread 'job_runtime' panicked at /index.crates.io-1949cf8c6b5b557f/tantivy-stacker-0.3.0/src/memory_arena.rs:205:9:

Because Tantivy use int32 for memory address, it can't generate data page large than 4GB, and we only generate one segment file for one parquet file in OpenObserve, it caused one problem, if the parquet is very large that will generate a lot of tokens, at the end, the Tantivy segment will exceeds the limit then paniced.

This PR improved the token generated, will auto remove:

the single char token, less than2.
large token that more than64 chars.

With these changes we can succeeded generate5GB parquet file and the Tantivy file.

PR Type

Enhancement, Bug fix

Description

Add min/max token length configs
Implement short-token removal filter
Apply token limits in analyzers
Fix compactor file naming collisions

Diagram Walkthrough

flowchart LR  CFG["Config: add min/max token length + validation"] -- "used by" --> TOK["Tokenizer builder: apply RemoveShort + RemoveLong"]  TOK -- "new filter" --> RSF["RemoveShortFilter"]  O2["O2 Tokenizer tests"] -- "adjust thresholds" --> TOK  ID["ider: generate_file_name()"] -- "use in" --> JOB["job/files: storage file naming"]  ID -- "use in" --> CMP["compactor: merged file naming"]

File Walkthrough

Relevant files

Enhancement

4 files

config.rs `Add token length configs and validation hook`	+36/-6
ider.rs `Introduce unique file name generator`	+7/-0
mod.rs `Wire min/max token filters into analyzers`	+18/-7
remove_short.rs `Implement RemoveShortFilter and tests`	+120/-0

Tests

1 files

o2_tokenizer.rs `Align base64 threshold with max token; test update`	+3/-3

Bug fix

2 files

mod.rs `Use new filename generator; remove ad-hoc suffix`	+2/-3
merge.rs `Use new filename generator for merged outputs`	+2/-2

hengfeiyang added3 commits

September 5, 2025 19:02

fix: add length filter for tantivy token

0a0f457

fix: generate file name for compactor

2d1f8ee

ci: cargo fmt

7076049

github-actionsbot added ☢️ Bug

Something isn't working

✏️ Feature labels

Sep 5, 2025

hengfeiyang requested review fromhaohuaijin anduddhavdave

September 5, 2025 11:13

github-actionsbot added the Review effort 3/5 label

Sep 5, 2025

Copy link

Contributor

github-actionsbot commentedSep 5, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Config Bounds The max token length is derived using max(config, 64). This prevents lowering the limit below 64 via config, which may be intended but contradicts the new env var description. Confirm desired behavior or switch to clamping (min/max) semantics. pubfn o2_tokenizer_build() ->TextAnalyzer{let cfg =get_config();let min_token_length = std::cmp::max(cfg.limit.inverted_index_min_token_length,MIN_TOKEN_LENGTH);let max_token_length = std::cmp::max(cfg.limit.inverted_index_max_token_length,MAX_TOKEN_LENGTH);if cfg.common.inverted_index_camel_case_tokenizer_disabled{ tantivy::tokenizer::TextAnalyzer::builder(SimpleTokenizer::default()).filter(RemoveShortFilter::limit(min_token_length)).filter(tantivy::tokenizer::RemoveLongFilter::limit( max_token_length,)) Position Gaps Removing short tokens preserves original token positions, resulting in non-sequential positions (e.g., first token can have position 1). Validate that downstream components expect gaps and that scoring/phrase queries in Tantivy behave as desired. letmut a =TextAnalyzer::builder(SimpleTokenizer::default()).filter(RemoveShortFilter::limit(2)).build();letmut token_stream = a.token_stream(text);letmut tokens:Vec<Token> =vec![];letmut add_token = \|token:&Token\|{ tokens.push(token.clone());}; token_stream.process(&mut add_token); tokens}#[test]fn test_remove_short(){let tokens =token_stream_helper("i am a hello tantivy, happy searching!");assert_eq!(tokens.len(),5);assert_token(&tokens[0],1,"am",2,4);assert_token(&tokens[1],3,"hello",7,12);assert_token(&tokens[2],4,"tantivy",13,20);assert_token(&tokens[3],5,"happy",22,27);assert_token(&tokens[4],6,"searching",28,37); Base64 Threshold MIN_BASE64_LENGTH tied to MAX_TOKEN_LENGTH+1 drastically lowers the prior 1024 threshold; verify this doesn’t reintroduce noisy base64-like tokens and aligns with performance goals. fn looks_like_base64(s:&str) ->bool{// Base64 characteristics:// 1. Only contains [A-Za-z0-9], the tokenizer will remove all non-alphanumeric characters// 2. Should be reasonably long (to avoid false positives)constMIN_BASE64_LENGTH:usize =super::MAX_TOKEN_LENGTH +1;// Adjust this threshold as neededif s.len() <MIN_BASE64_LENGTH{returnfalse;

uddhavdave approved these changes

Sep 5, 2025

View reviewed changes

Copy link

Contributor

github-actionsbot commentedSep 5, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Enforce min-max consistency Ensure`min_token_length` never exceeds`max_token_length`; otherwise all tokens could be dropped. After clamping individual values, enforce`min <= max` by adjusting one side if needed. src/config/src/utils/tantivy/tokenizer/mod.rs [30-33] -let min_token_length =- std::cmp::max(cfg.limit.inverted_index_min_token_length, MIN_TOKEN_LENGTH);-let max_token_length =- std::cmp::max(cfg.limit.inverted_index_max_token_length, MAX_TOKEN_LENGTH);+let mut min_token_length = cfg+ .limit+ .inverted_index_min_token_length+ .clamp(MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH);+let mut max_token_length = cfg+ .limit+ .inverted_index_max_token_length+ .clamp(MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH);+if min_token_length > max_token_length {+ min_token_length = max_token_length;+} Suggestion importance[1-10]: 8 __ Why: Ensuring`min_token_length <= max_token_length` prevents pathological configurations that could drop all tokens. The improved code accurately derives from the new hunk and meaningfully hardens behavior.	Medium
Possible issue	Clamp token length bounds Using max here ignores smaller configured limits and can allow overly long tokens. Clamp the configured value between sensible min and max bounds, and use it directly without inflating beyond`MAX_TOKEN_LENGTH`. src/config/src/utils/tantivy/tokenizer/mod.rs [33] -let max_token_length =- std::cmp::max(cfg.limit.inverted_index_max_token_length, MAX_TOKEN_LENGTH);+let max_token_length = cfg+ .limit+ .inverted_index_max_token_length+ .clamp(MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH); Suggestion importance[1-10]: 7 __ Why: The existing code uses`max` with`MAX_TOKEN_LENGTH`, which can exceed the intended upper bound; clamping to`[MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH]` is a reasonable fix. It improves correctness by honoring configured limits without allowing overly long tokens, though it's not critical.	Medium
General	Count Unicode characters `token.text.len()` measures bytes, not characters, so multibyte UTF-8 tokens may be misclassified. Use`chars().count()` to compare character length consistently with the filter's description. src/config/src/utils/tantivy/tokenizer/remove_short.rs [21-23] impl<T> RemoveShortFilterStream<T> { fn predicate(&self, token: &Token) -> bool {- token.text.len() >= self.token_length_limit+ token.text.chars().count() >= self.token_length_limit } } Suggestion importance[1-10]: 6 __ Why: The suggestion aligns with the filter's description about UTF-8 but trades performance for semantic accuracy; depending on intent (bytes vs chars), this may or may not be desirable. It's a valid minor improvement with limited impact.	Low

greptile-appsbot reviewed

Sep 5, 2025

View reviewed changes

Copy link

Contributor

greptile-appsbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Greptile Summary

This PR introduces configurable token length filtering for Tantivy's inverted index to resolve critical memory issues that were causing panics when processing large parquet files. The core problem was that Tantivy uses int32 for memory addressing, limiting data pages to 4GB, but OpenObserve generates one segment file per parquet file. When processing very large parquet files, the excessive number of tokens (including single characters and very long strings) would cause Tantivy segments to exceed this 4GB limit, resulting in panics atmemory_arena.rs:205:9.

The solution implements a two-pronged token filtering approach:

New Environment Variables:

ZO_INVERTED_INDEX_MIN_TOKEN_LENGTH=2 - Filters out tokens shorter than 2 characters
ZO_INVERTED_INDEX_MAX_TOKEN_LENGTH=64 - Filters out tokens longer than 64 characters (previously only long token filtering existed)

Key Changes:

Token Filtering: A newRemoveShortFilter is added insrc/config/src/utils/tantivy/tokenizer/remove_short.rs that complements the existingRemoveLongFilter. Both filters are now applied in the tokenizer pipeline.
Configuration: The config system now validates and sets sensible defaults for token length limits, ensuring minimum=2 and maximum=64 when values are unset.
Tokenizer Integration: Theo2_tokenizer_build() function inmod.rs now applies both min and max token length filters to bothSimpleTokenizer andO2Tokenizer paths.
Base64 Detection Threshold: The O2Tokenizer's base64 detection logic is updated to use the configurable max token length (MAX_TOKEN_LENGTH + 1 = 65) instead of a hardcoded 1024, creating more consistent behavior.
File Name Generation: Several files now useider::generate_file_name() instead ofider::generate() to provide enhanced uniqueness with a 4-character hex suffix, supporting the improved token processing workflow.

With these changes, OpenObserve can successfully process 5GB parquet files without hitting Tantivy's memory limits, significantly improving the system's ability to handle large datasets while maintaining search functionality.

Confidence score: 3/5

This PR addresses a critical production issue but has some implementation inconsistencies that need attention
Score reflects the importance of the fix but concerns about inconsistent tokenization behavior between different code paths
Pay close attention to the tokenizer module files, especially the inconsistency betweeno2_tokenizer_build ando2_collect_tokens functions

_{7 files reviewed, 5 comments}

_{Edit Code Review Bot Settings |Greptile}