Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

fix: add length filter for tantivy token#8310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
hengfeiyang merged 6 commits intomainfromfeat/tantivy-tokens
Sep 6, 2025

Conversation

@hengfeiyang
Copy link
Contributor

@hengfeiyanghengfeiyang commentedSep 5, 2025
edited by github-actionsbot
Loading

User description

Add two new ENV:

ZO_INVERTED_INDEX_MIN_TOKEN_LENGTH=2ZO_INVERTED_INDEX_MAX_TOKEN_LENGTH=64

The previous version we only remove the tokens longer than64, didn't remove the small tokens.

Now, we support both remove mini token and huge token.

Why we got error like this:

thread 'job_runtime' panicked at /index.crates.io-1949cf8c6b5b557f/tantivy-stacker-0.3.0/src/memory_arena.rs:205:9:

Because Tantivy use int32 for memory address, it can't generate data page large than 4GB, and we only generate one segment file for one parquet file in OpenObserve, it caused one problem, if the parquet is very large that will generate a lot of tokens, at the end, the Tantivy segment will exceeds the limit then paniced.

This PR improved the token generated, will auto remove:

  • the single char token, less than2.
  • large token that more than64 chars.

With these changes we can succeeded generate5GB parquet file and the Tantivy file.


PR Type

Enhancement, Bug fix


Description

  • Add min/max token length configs

  • Implement short-token removal filter

  • Apply token limits in analyzers

  • Fix compactor file naming collisions


Diagram Walkthrough

flowchart LR  CFG["Config: add min/max token length + validation"] -- "used by" --> TOK["Tokenizer builder: apply RemoveShort + RemoveLong"]  TOK -- "new filter" --> RSF["RemoveShortFilter"]  O2["O2 Tokenizer tests"] -- "adjust thresholds" --> TOK  ID["ider: generate_file_name()"] -- "use in" --> JOB["job/files: storage file naming"]  ID -- "use in" --> CMP["compactor: merged file naming"]
Loading

File Walkthrough

Relevant files
Enhancement
4 files
config.rs
Add token length configs and validation hook                         
+36/-6   
ider.rs
Introduce unique file name generator                                         
+7/-0     
mod.rs
Wire min/max token filters into analyzers                               
+18/-7   
remove_short.rs
Implement RemoveShortFilter and tests                                       
+120/-0 
Tests
1 files
o2_tokenizer.rs
Align base64 threshold with max token; test update             
+3/-3     
Bug fix
2 files
mod.rs
Use new filename generator; remove ad-hoc suffix                 
+2/-3     
merge.rs
Use new filename generator for merged outputs                       
+2/-2     

greptile-apps[bot] reacted with thumbs up emoji
@github-actions
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Config Bounds

The max token length is derived using max(config, 64). This prevents lowering the limit below 64 via config, which may be intended but contradicts the new env var description. Confirm desired behavior or switch to clamping (min/max) semantics.

pubfn o2_tokenizer_build() ->TextAnalyzer{let cfg =get_config();let min_token_length =        std::cmp::max(cfg.limit.inverted_index_min_token_length,MIN_TOKEN_LENGTH);let max_token_length =        std::cmp::max(cfg.limit.inverted_index_max_token_length,MAX_TOKEN_LENGTH);if cfg.common.inverted_index_camel_case_tokenizer_disabled{        tantivy::tokenizer::TextAnalyzer::builder(SimpleTokenizer::default()).filter(RemoveShortFilter::limit(min_token_length)).filter(tantivy::tokenizer::RemoveLongFilter::limit(                max_token_length,))
Position Gaps

Removing short tokens preserves original token positions, resulting in non-sequential positions (e.g., first token can have position 1). Validate that downstream components expect gaps and that scoring/phrase queries in Tantivy behave as desired.

letmut a =TextAnalyzer::builder(SimpleTokenizer::default()).filter(RemoveShortFilter::limit(2)).build();letmut token_stream = a.token_stream(text);letmut tokens:Vec<Token> =vec![];letmut add_token = |token:&Token|{        tokens.push(token.clone());};    token_stream.process(&mut add_token);    tokens}#[test]fn test_remove_short(){let tokens =token_stream_helper("i am a hello tantivy, happy searching!");assert_eq!(tokens.len(),5);assert_token(&tokens[0],1,"am",2,4);assert_token(&tokens[1],3,"hello",7,12);assert_token(&tokens[2],4,"tantivy",13,20);assert_token(&tokens[3],5,"happy",22,27);assert_token(&tokens[4],6,"searching",28,37);
Base64 Threshold

MIN_BASE64_LENGTH tied to MAX_TOKEN_LENGTH+1 drastically lowers the prior 1024 threshold; verify this doesn’t reintroduce noisy base64-like tokens and aligns with performance goals.

fn looks_like_base64(s:&str) ->bool{// Base64 characteristics:// 1. Only contains [A-Za-z0-9], the tokenizer will remove all non-alphanumeric characters// 2. Should be reasonably long (to avoid false positives)constMIN_BASE64_LENGTH:usize =super::MAX_TOKEN_LENGTH +1;// Adjust this threshold as neededif s.len() <MIN_BASE64_LENGTH{returnfalse;

@github-actions
Copy link
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                   Impact
Possible issue
Enforce min-max consistency

Ensuremin_token_length never exceedsmax_token_length; otherwise all tokens could
be dropped. After clamping individual values, enforcemin <= max by adjusting one
side if needed.

src/config/src/utils/tantivy/tokenizer/mod.rs [30-33]

-let min_token_length =-    std::cmp::max(cfg.limit.inverted_index_min_token_length, MIN_TOKEN_LENGTH);-let max_token_length =-    std::cmp::max(cfg.limit.inverted_index_max_token_length, MAX_TOKEN_LENGTH);+let mut min_token_length = cfg+    .limit+    .inverted_index_min_token_length+    .clamp(MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH);+let mut max_token_length = cfg+    .limit+    .inverted_index_max_token_length+    .clamp(MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH);+if min_token_length > max_token_length {+    min_token_length = max_token_length;+}
Suggestion importance[1-10]: 8

__

Why: Ensuringmin_token_length <= max_token_length prevents pathological configurations that could drop all tokens. The improved code accurately derives from the new hunk and meaningfully hardens behavior.

Medium
Clamp token length bounds

Using max here ignores smaller configured limits and can allow overly long tokens.
Clamp the configured value between sensible min and max bounds, and use it directly
without inflating beyondMAX_TOKEN_LENGTH.

src/config/src/utils/tantivy/tokenizer/mod.rs [33]

-let max_token_length =-    std::cmp::max(cfg.limit.inverted_index_max_token_length, MAX_TOKEN_LENGTH);+let max_token_length = cfg+    .limit+    .inverted_index_max_token_length+    .clamp(MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH);
Suggestion importance[1-10]: 7

__

Why: The existing code usesmax withMAX_TOKEN_LENGTH, which can exceed the intended upper bound; clamping to[MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH] is a reasonable fix. It improves correctness by honoring configured limits without allowing overly long tokens, though it's not critical.

Medium
General
Count Unicode characters

token.text.len() measures bytes, not characters, so multibyte UTF-8 tokens may be
misclassified. Usechars().count() to compare character length consistently with the
filter's description.

src/config/src/utils/tantivy/tokenizer/remove_short.rs [21-23]

 impl<T> RemoveShortFilterStream<T> {     fn predicate(&self, token: &Token) -> bool {-        token.text.len() >= self.token_length_limit+        token.text.chars().count() >= self.token_length_limit     } }
Suggestion importance[1-10]: 6

__

Why: The suggestion aligns with the filter's description about UTF-8 but trades performance for semantic accuracy; depending on intent (bytes vs chars), this may or may not be desirable. It's a valid minor improvement with limited impact.

Low

Copy link
Contributor

@greptile-appsgreptile-appsbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Greptile Summary

This PR introduces configurable token length filtering for Tantivy's inverted index to resolve critical memory issues that were causing panics when processing large parquet files. The core problem was that Tantivy uses int32 for memory addressing, limiting data pages to 4GB, but OpenObserve generates one segment file per parquet file. When processing very large parquet files, the excessive number of tokens (including single characters and very long strings) would cause Tantivy segments to exceed this 4GB limit, resulting in panics atmemory_arena.rs:205:9.

The solution implements a two-pronged token filtering approach:

New Environment Variables:

  • ZO_INVERTED_INDEX_MIN_TOKEN_LENGTH=2 - Filters out tokens shorter than 2 characters
  • ZO_INVERTED_INDEX_MAX_TOKEN_LENGTH=64 - Filters out tokens longer than 64 characters (previously only long token filtering existed)

Key Changes:

  1. Token Filtering: A newRemoveShortFilter is added insrc/config/src/utils/tantivy/tokenizer/remove_short.rs that complements the existingRemoveLongFilter. Both filters are now applied in the tokenizer pipeline.

  2. Configuration: The config system now validates and sets sensible defaults for token length limits, ensuring minimum=2 and maximum=64 when values are unset.

  3. Tokenizer Integration: Theo2_tokenizer_build() function inmod.rs now applies both min and max token length filters to bothSimpleTokenizer andO2Tokenizer paths.

  4. Base64 Detection Threshold: The O2Tokenizer's base64 detection logic is updated to use the configurable max token length (MAX_TOKEN_LENGTH + 1 = 65) instead of a hardcoded 1024, creating more consistent behavior.

  5. File Name Generation: Several files now useider::generate_file_name() instead ofider::generate() to provide enhanced uniqueness with a 4-character hex suffix, supporting the improved token processing workflow.

With these changes, OpenObserve can successfully process 5GB parquet files without hitting Tantivy's memory limits, significantly improving the system's ability to handle large datasets while maintaining search functionality.

Confidence score: 3/5

  • This PR addresses a critical production issue but has some implementation inconsistencies that need attention
  • Score reflects the importance of the fix but concerns about inconsistent tokenization behavior between different code paths
  • Pay close attention to the tokenizer module files, especially the inconsistency betweeno2_tokenizer_build ando2_collect_tokens functions

7 files reviewed, 5 comments

Edit Code Review Bot Settings |Greptile

@hengfeiyanghengfeiyang merged commit890a8d4 intomainSep 6, 2025
28 of 30 checks passed
@hengfeiyanghengfeiyang deleted the feat/tantivy-tokens branchSeptember 6, 2025 14:28
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@uddhavdaveuddhavdaveuddhavdave approved these changes

@haohuaijinhaohuaijinhaohuaijin approved these changes

+1 more reviewer

@greptile-appsgreptile-apps[bot]greptile-apps[bot] left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@hengfeiyang@uddhavdave@haohuaijin

[8]ページ先頭

©2009-2025 Movatter.jp